Automatic Labeled Data Generation for Person Named Entity Disambiguation on the Ming Shilu

1. Abstract

One important task of historical research in DH is to identify person names from history texts. This task can be divided into two subtasks: person named entity recognition (PNER) and person named entity disambiguation (PNED). PNED is to link each PNE mention to a specific person profile in the reference knowledge base. The main challenge of machine-learning-based PNED is the lack of annotated data. We design an automatic approach to labeling the training data. We choose the Ming Shilu as our target history texts. We use the Ming-Qing Archives Name Authority Database as our reference knowledge base, which contains 14,070 government officials living in Ming dynasty. Our BERT-based model reaches an accuracy of 90.1%, which proves that our approach can generate labeled data for the PNED task of very high quality on Chinese history texts. For the general situation (including trivial instances), the accuracy is even higher (~98%).

Richard Tzong-Han Tsai (thtsai@csie.ncu.edu.tw), Department of Computer Science and Information Engineering, National Central University, Taiwan, Research Center for Humanities and Social Sciences, Academia Sinica, Taiwan, Cheng-Han Wu , Department of Computer Science and Information Engineering, National Central University, Taiwan, Pi-Ling Pai , Research Center for Humanities and Social Sciences, Academia Sinica, Taiwan and I-Chun Fan , Institute of History and Philology, Academia Sinica, Taiwan

Theme: Lux by Bootswatch.