|
|
Frontiers of Information Technology & Electronic Engineering
ISSN 2095-9184 (print), ISSN 2095-9230 (online)
2015 Vol.16 No.7 P.541-552
BUEES: a bottom-up event extraction system
Abstract: Traditional event extraction systems focus mainly on event type identification and event participant extraction based on pre-specified event type paradigms and manually annotated corpora. However, different domains have different event type paradigms. When transferring to a new domain, we have to build a new event type paradigm and annotate a new corpus from scratch. This kind of conventional event extraction system requires massive human effort, and hence prevents event extraction from being widely applicable. In this paper, we present BUEES, a bottom-up event extraction system, which extracts events from the web in a completely unsupervised way. The system automatically builds an event type paradigm in the input corpus, and then proceeds to extract a large number of instance patterns of these events. Subsequently, the system extracts event arguments according to these patterns. By conducting a series of experiments, we demonstrate the good performance of BUEES and compare it to a state-of-the-art Chinese event extraction system, i.e., a supervised event extraction system. Experimental results show that BUEES performs comparably (5% higher F-measure in event type identification and 3% higher F-measure in event argument extraction), but without any human effort.
Key words: Event extraction, Unsupervised learning, Bottom-up
创新点:本文首次提出基于聚类的事件类型自动发现方法。和传统事件抽取技术相比,该方法无需预先定义事件类型,无需先验的领域知识。因此,该方法是对领域移植的一个尝试,尤其适用于知识和资源有限的领域。
方法:该方法依据谓语动词是对领域事件刻画的重要单元的特点,利用依存句法信息抽取领域事件词,利用«知网»(HowNet)对领域事件词进行聚类从而获取不同的事件类型(图2),随后进行事件元素的抽取。本文提出基于Bootstrapping的事件元素抽取框架,该框架核心有三部分:(1)模式获取:该模块负责将事件种子放在互联网上去检索,获得事件实例,并根据事件实例,按照一定的规则生成初始的事件模式(图3);(2)模式泛化:初始事件模式由于过于死板,导致遗漏掉很多事件的匹配,因此,本文设计模式泛化方法,将原有的事件模式按照一定规则,进行一定程度上的泛化,使其在保证准确率不变的情况下尽量提高召回率(算法3);(3)模式过滤:经泛化后的模式会在一定程度上引入噪声,因此,本文提出一套过滤规则,尽量减少泛化带来的噪声(表3)。
结论:提出自底向上的事件抽取系统。该系统在公开的ACE语料数据集上取得了优于当前最好基线方法的结果。同时在我们手工构造的音乐领域和金融领域数据集上也取得了优秀的实验结果。这表明该方法可以很好地进行领域自适应。
关键词组:
References:
[1]Ahn, D., 2006. The stages of event extraction. Proc. Workshop on Annotating and Reasoning about Time and Events, p.1-8.
[2]Banko, M., Etzioni, O., 2008. The tradeoffs between open and traditional relation extraction. Proc. Annual Meeting on Association for Computational Linguistics, p.28-36.
[3]Banko, M., Cafarella, M.J., Soderland, S., et al., 2007. Open information extraction for the Web. Proc. 20th Int. Joint Conf. on Artificial Intelligence, p.2670-2676.
[4]Barzilay, R., McKeown, K.R., 2001. Extracting paraphrases from a parallel corpus. Proc. 39th Annual Meeting on Association for Computational Linguistics, p.50-57.
[5]Chambers, N., Jurafsky, D., 2009. Unsupervised learning of narrative schemas and their participants. Proc. 47th Annual Meeting on Association for Computational Linguistics and 4th Int. Joint Conf. on Natural Language Processing, p.602-610.
[6]Chambers, N., Jurafsky, D., 2011. Template-based information extraction without the templates. Proc. 49th Annual Meeting on Association for Computational Linguistics, p.976-986.
[7]Che, W., Li, Z., Li, Y., et al., 2009. Multilingual dependency-based syntactic and semantic parsing. Proc. 13th Conf. on Computational Natural Language Learning, p.49-54.
[8]Chen, Z., Ji, H., 2009. Language specific issue and feature exploration in Chinese event extraction. Proc. Annual Conf. on Association for Computational Linguistics, p.209-212.
[9]Chinchor, N., Lewis, D.D., Hirschman, L., 1993. Evaluating message understanding systems: an analysis of the third message understanding conference (MUC-3). Comput. Ling., 19(3):409-449.
[10]Ding, X., Song, F., Qin, B., et al., 2011. Research on typical event extraction method in the field of music. J. Chin. Inform. Process., 25(2):15-20 (in Chinese).
[11]Ding, X., Qin, B., Liu, T., 2013. Building Chinese event type paradigm based on trigger clustering. Proc. Int. Joint Conf. on Natural Language Processing, p.311-319.
[12]Dong, Z., Dong, Q., 2006. HowNet and the Computation of Meaning. World Scientific Publishing Company, USA.
[13]Etzioni, O., Fader, A., Christensen, J., et al., 2011. Open information extraction: the second generation. Proc. 22nd Int. Joint Conf. on Artificial Intelligence, p.3-10.
[14]Fader, A., Soderland, S., Etzioni, O., 2011. Identifying relations for open information extraction. Proc. Conf. on Empirical Methods in Natural Language Processing, p.1535-1545.
[15]Friedman, J.H., Bentley, J.L., Finkel, R.A., 1977. An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Softw., 3(3):209-226.
[16]Grishman, R., 1997. Information extraction: techniques and challenges. In: Pazienza, M.T. (Ed.), Information Extraction: a Multidisciplinary Approach to an Emerging Information Technology. Springer Berlin Heidelberg, New York, USA, p.10-27.
[17]Grishman, R., 2001. Adaptive information extraction and sublanguage analysis. Int. Joint Conf. on Artificial Itelligence, Workshop on Adaptive Text Extraction and Mining.
[18]Halkidi, M., Batistakis, Y., Vazirgiannis, M., 2001. On clustering validation techniques. J. Intell. Inform. Syst., 17(2-3):107-145.
[19]Hasegawa, T., Sekine, S., Grishman, R., 2004. Discovering relations among named entities from large corpora. Proc. 42nd Annual Meeting on Association for Computational Linguistics, Article 415.
[20]Hirschberg, D.S., 1977. Algorithms for the longest common subsequence problem. J. ACM, 24(4):664-675.
[21]Hong, Y., Zhang, J., Ma, B., et al., 2011. Using cross-entity inference to improve event extraction. Proc. 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, p.1127-1136.
[22]Ibrahim, A., Katz, B., Lin, J., 2003. Extracting structural paraphrases from aligned monolingual corpora. Proc. 2nd Int. Workshop on Paraphrasing, p.57-64.
[23]Ji, H., Grishman, R., 2008. Refining event extraction through cross-document inference. Proc. Association for Computational Linguistics, p.254-262.
[24]Lee, C.S., Chen, Y.J., Jian, Z.W., 2003. Ontology-based fuzzy event extraction agent for Chinese e-news summarization. Expert Syst. Appl., 25(3):431-447.
[25]Liao, S., Grishman, R., 2010. Filtered ranking for bootstrapping in event extraction. Proc. 23rd Int. Conf. on Computational Linguistics, p.680-688.
[26]Lin, D., Pantel, P., 2001. DIRT@SBT@discovery of inference rules from text. Proc. 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.323-328.
[27]Liu, T., Ma, J., Zhang, H., et al., 2007. Subdividing verbs to improve syntactic parsing. J. Electron. (China), 24(3):347-352 (in Chinese).
[28]Mei, J.J., Zhu, Y.M., Gao, Y.Q., et al., 1983. Dictionary of Synonymous Words. Shanghai Dictionary Publishing Press, Shanghai, China (in Chinese).
[29]Miller, S., Guinness, J., Zamanian, A., 2004. Name tagging with word clusters and discriminative training. Proc. Conf. of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, p.337-342.
[30]Miwa, M., Sætre, R., Kim, J.D., et al., 2010. Event extraction with complex event classification using rich features. J. Bioinform. Comput. Biol., 8(1):131-146.
[31]Pang, B., Knight, K., Marcu, D., 2003. Syntax-based alignment of multiple translations: extracting paraphrases and generating new sentences. Proc. Conf. of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, p.102-109.
[32]Patwardhan, S., Riloff, E., 2006. Learning domain-specific information extraction patterns from the Web. Proc. Workshop on Information Extraction Beyond the Document, p.66-73.
[33]Pham, X., Le, M., Ho, B., 2013. A hybrid approach for biomedical event extraction. Proc. Association for Computational Linguistics, p.121-124.
[34]Poon, H., Domingos, P., 2008. Joint unsupervised coreference resolution with Markov logic. Proc. Conf. on Empirical Methods in Natural Language Processing, p.650-659.
[35]Poon, H., Domingos, P., 2009. Unsupervised semantic parsing. Proc. Conf. on Empirical Methods in Natural Language Processing, p.1-10.
[36]Riloff, E., 1996. Automatically generating extraction patterns from untagged text. Proc. AAAI, p.1044-1049.
[37]Ritter, A., Mausam, Etzioni, O., et al., 2012. Open domain event extraction from Twitter. Proc. 18th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.1104-1112.
[38]Rosenfeld, B., Feldman, R., 2006. URES: an unsupervised web relation extraction system. Proc. COLING/ACL on Main Conference Poster Sessions, p.667-674.
[39]Schilder, F., 2007. Event extraction and temporal reasoning in legal documents. In: Schilder, F., Katz, G., Pustejovsky, J. (Eds.), Annotating, Extracting and Reasoning about Time and Events, p.55-71.
[40]Shinyama, Y., Sekine, S., 2006. Preemptive information extraction using unrestricted relation discovery. Proc. Conf. of the North American Chapter of the Association of Computational Linguistics on Human Language Technology, p.304-311.
[41]Soderland, S., 1999. Learning information extraction rules for semi-structured and free text. Mach. Learn., 34(1-3):233-272.
[42]Stevenson, M., Greenwood, M.A., 2005. A semantic approach to IE pattern induction. Proc. 43rd Annual Meeting on Association for Computational Linguistics, p.379-386.
[43]Sudo, K., Sekine, S., Grishman, R., 2003. An improved extraction pattern representation model for automatic IE pattern acquisition. Proc. 41st Annual Meeting on Association for Computational Linguistics, p.224-231.
[44]Wagner, W., Schmid, H., im Walde, S.S., 2009. Verb sense disambiguation using a predicate-argument-clustering model. Proc. CogSci Workshop on Distributional Semantics Beyond Concrete Concepts, p.23-28.
[45]Wu, F., Weld, D.S., 2010. Open information extraction using Wikipedia. Proc. 48th Annual Meeting of the Association for Computational Linguistics, p.118-127.
[46]Yangarber, R., Grishman, R., Tapanainen, P., et al., 2000. Automatic acquisition of domain knowledge for information extraction. Proc. 18th Conf. on Computational Linguistics, p.940-946.
[47]Yates, A., Etzioni, O., 2009. Unsupervised methods for determining object and relation synonyms on the web. J. Artif. Intell. Res., 34(1):255-296.
[48]Yeh, A., Hirschman, L., Morgan, A., 2002. Background and overview for KDD Cup 2002 task 1: information extraction from biomedical articles. ACM SIGKDD Explor. Newslett., 4(2):87-89.
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/FITEE.1400405
CLC number:
TP391
Download Full Text:
Downloaded:
4178
Download summary:
<Click Here>Downloaded:
2508Clicked:
9393
Cited:
0
On-line Access:
2024-08-27
Received:
2023-10-17
Revision Accepted:
2024-05-08
Crosschecked:
2015-06-08