CLC number: TP391
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2015-10-15
Cited: 0
Clicked: 8474
Jie Zhou, Bi-cheng Li, Gang Chen. Automatically building large-scale named entity recognition corpora from Chinese Wikipedia[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.1500067 @article{title="Automatically building large-scale named entity recognition corpora from Chinese Wikipedia", %0 Journal Article TY - JOUR
Abstract: Constructing NER corpora is important but labor-intensive. This paper proposed an automatic way to construct NER corpora from Chinese Wikipedia, and the experiments show promising results. This work is meaningful, and the constructed corpora can be used in many Chinese NLP applications. The paper is well written.
基于中文维基的大规模命名实体识别语料自动生成方法创新点:本文根据中文维基的特点设计出四类启发式规则,并结合有监督的命名实体分类器,实现中文维基条目的命名实体类型的准确、全面识别;为避免缺失的维基链接引发的标注缺失,本文利用出链接的边界信息发现维基文档中的隐式指称项,并利用实体链接技术识别歧义指称项的实体类型;本文提出一种基于核心条目扩展的标注语料选择方法,实现测试数据的领域自适应。 方法:本文方法的整体流程如原文图2所示。该方法主要包括显式指称项的实体分类、隐式指称项的类型识别和标注语料选择三个主要步骤。在显式指称项的实体分类中,为实现准确、全面的实体类型识别,采用基于启发式规则与有监督实体分类器相结合的方法;在隐式指称项的类型识别中,提出一种新方法发现维基文档中的隐式指称项并识别歧义指称项的实体类型;在标注语料选择中,提出一种基于核心条目扩展的方法,实现测试数据的领域自适应。 结论:根据实验结果,采用本文方法能自动生成大规模的中文NER语料。此外,将生成语料与标准语料结合时,训练获得的NER模型性能更优。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Alotaibi, F., Lee, M., 2012. Mapping Arabic Wikipedia into the named entities taxonomy. Proc. 24th Int. Conf. on Computational Linguistics, p.43-52. ![]() [2]An, J., Lee, S., Lee, G.G., 2003. Automatic acquisition of named entity tagged corpus from World Wide Web. Proc. 41st Annual Meeting on Association for Computational Linguistics, p.165-168. ![]() [3]Auer, S., Bizer, C., Kobilarov, G., et al., 2007. DBpedia: a nucleus for a Web of open data. LNCS, 4825:722-735. ![]() [4]Balasuriya, D., Ringland, N., Nothman, J., et al., 2009. Named entity recognition in Wikipedia. Proc. Workshop on the People’s Web Meets NLP, ACL-IJCNLP, p.10-18. ![]() [5]Bunescu, R., Paşca, M., 2006. Using encyclopedic knowledge for named entity disambiguation. Proc. 11th Conf. of the European Chapter of the Association for Computational Linguistics, p.9-16. ![]() [6]Carletta, J., 1996. Assessing agreement on classification tasks: the kappa statistic. Comput. Ling., 22(2):249-254. ![]() [7]Ciaramita, M., Altun, Y., 2005. Named-entity recognition in novel domains with external lexical knowledge. Proc. Human Language Technologies in Advances in Structured Learning for Text and Speech Processing Workshop, p.209-212. ![]() [8]Dakka, W., Cucerzan, S., 2008. Augmenting Wikipedia with named entity tags. Proc. Int. Joint Conf. on Natural Language Processing, p.545-552. ![]() [9]Darwish, K., 2013. Named entity recognition using cross-lingual resources: Arabic as an example. Proc. 51st Annual Meeting of the Association for Computational Linguistics, p.1558-1567. ![]() [10]Ehrmann, M., Turchi, M., 2010. Building multilingual named entity annotated corpora exploiting parallel corpora. Proc. Workshop on Annotation and Exploitation of Parallel Corpora, p.24-33. ![]() [11]Etzioni, O., Cafarella, M., Downey, D., et al., 2005. Unsupervised named-entity extraction from the Web: an experimental study. Artif. Intell., 165(1):91-134. ![]() [12]Fu, R., Qin, B., Liu, T., 2011. Generating Chinese named entity data from a parallel corpus. Proc. 5th Int. Joint Conf. on Natural Language Processing, p.264-272. ![]() [13]Gabrilovich, E., Markovitch, S., 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. Proc. 20th Int. Joint Conf. on Artificial Intelligence, p.1606-1611. ![]() [14]Guo, H., Zhu, H., Guo, Z., et al., 2009. Domain adaptation with latent semantic association for named entity recognition. Proc. Human Language Technologies: the Annual Conf. of the North American Chapter of the ACL, p.281-289. ![]() [15]Higashinaka, R., Sadamitsu, K., Saito, K., et al., 2012. Creating an extended named entity dictionary from Wikipedia. Proc. 24th Int. Conf. on Computational Linguistics, p.1163-1178. ![]() [16]Ji, H., Grishman, R., Dang, H.T., 2011. Overview of the TAC2011 Knowledge Base Population Track. Proc. Text Analysis Conf. ![]() [17]Jiang, J., Zhai, C.X., 2006. Exploiting domain structure for named entity recognition. Proc. Main Conf. on Human Language Technology Conf. of the North American Chapter of the Association of Computational Linguistics, p.74-81. ![]() [18]Jiang, J., Zhai, C.X., 2007. A two-stage approach to domain adaptation for statistical classifiers. Proc. 16th ACM Conf. on Information and Knowledge Management, p.401-410. ![]() [19]Liao, W., Veeramachaneni, S., 2009. A simple semi-supervised algorithm for named entity recognition. Proc. NAACL HLT Workshop on Semi-Supervised Learning for Natural Language Processing, p.58-65. ![]() [20]Liu, H., Chen, Y., 2010. Computing semantic relatedness between named entities using Wikipedia. Proc. Int. Conf. on Artificial Intelligence and Computational Intelligence, p.388-392. ![]() [21]Liu, X., Zhang, S., Wei, F., et al., 2011. Recognizing named entities in Tweets. Proc. 49th Annual Meeting of the Association for Computational Linguistics, p.359-367. ![]() [22]Medelyan, O., Milne, D., Legg, C., et al., 2009. Mining meaning from Wikipedia. Int. J. Human-Comput. Stud., 67(9):716-754. ![]() [23]Mika, P., Ciaramita, M., Zaragoza, H., et al., 2008. Learning to tag and tagging to learn: a case study on Wikipedia. IEEE Intell. Syst., 23(5):26-33. ![]() [24]Nadeau, D., Turney, P.D., Matwin, S., 2006. Unsupervised named entity recognition: generating gazetteers and resolving ambiguity. LNCS, 4013:266-277. ![]() [25]Nastase, V., Strube, M., 2013. Transforming Wikipedia into a large scale multilingual concept network. Artif. Intell., 194:62-85. ![]() [26]Nemeskey, D.M., Simon, E., 2012. Automatically generated NE tagged corpora for English and Hungarian. Proc. 4th Named Entity Workshop, p.38-46. ![]() [27]Ni, Y., Zhang, L., Qiu, Z., et al., 2010. Enhancing the open-domain classification of named entity using linked open data. Proc. 9th Int. Semantic Web Conf., p.566-581. ![]() [28]Nothman, J., Curran, J.R., Murphy, T., 2008. Transforming Wikipedia into named entity training data. Proc. Australian Language Technology Workshop, p.124-132. ![]() [29]Nothman, J., Ringland, N., Radford, W., et al., 2013. Learning multilingual named entity recognition from Wikipedia. Artif. Intell., 194:151-175. ![]() [30]Ratinov, L., Roth, D., 2009. Design challenges and misconceptions in named entity recognition. Proc. 13th Conf. on Computational Natural Language Learning, p.147-155. ![]() [31]Richman, A.E., Schone, P., 2008. Mining Wiki resources for multilingual named entity recognition. Proc. 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, p.1-9. ![]() [32]Toral, A., Ferrández, S., Monachini, M., et al., 2012. Web 2.0, language resources and standards to automatically build a multilingual named entity lexicon. Lang. Res. Eval., 46(3):383-419. ![]() [33]Zesch, T., Müller, C., Gurevych, I., 2008. Extracting lexical semantic knowledge from Wikipedia and Wiktionary. Proc. Conf. on Language Resources and Evaluation, p.1646-1651. ![]() [34]Zhang, W., Sun, L., Zhang, X., 2012. A entity relation extraction method based on Wikipedia and pattern clustering. J. Chin. Inform. Process., 26(2):75-81 (in Chinese). ![]() [35]Zhou, J., Dai, X., Yin, C., et al., 2006. Automatic recognition of Chinese organization name based on cascaded conditional random fields. Acta Electron. Sin., 34(5):804-809 (in Chinese). ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2025 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>