|
Frontiers of Information Technology & Electronic Engineering
ISSN 2095-9184 (print), ISSN 2095-9230 (online)
2015 Vol.16 No.7 P.541-552
BUEES: a bottom-up event extraction system
Abstract: Traditional event extraction systems focus mainly on event type identification and event participant extraction based on pre-specified event type paradigms and manually annotated corpora. However, different domains have different event type paradigms. When transferring to a new domain, we have to build a new event type paradigm and annotate a new corpus from scratch. This kind of conventional event extraction system requires massive human effort, and hence prevents event extraction from being widely applicable. In this paper, we present BUEES, a bottom-up event extraction system, which extracts events from the web in a completely unsupervised way. The system automatically builds an event type paradigm in the input corpus, and then proceeds to extract a large number of instance patterns of these events. Subsequently, the system extracts event arguments according to these patterns. By conducting a series of experiments, we demonstrate the good performance of BUEES and compare it to a state-of-the-art Chinese event extraction system, i.e., a supervised event extraction system. Experimental results show that BUEES performs comparably (5% higher F-measure in event type identification and 3% higher F-measure in event argument extraction), but without any human effort.
Key words: Event extraction, Unsupervised learning, Bottom-up
创新点:本文首次提出基于聚类的事件类型自动发现方法。和传统事件抽取技术相比,该方法无需预先定义事件类型,无需先验的领域知识。因此,该方法是对领域移植的一个尝试,尤其适用于知识和资源有限的领域。
方法:该方法依据谓语动词是对领域事件刻画的重要单元的特点,利用依存句法信息抽取领域事件词,利用«知网»(HowNet)对领域事件词进行聚类从而获取不同的事件类型(图2),随后进行事件元素的抽取。本文提出基于Bootstrapping的事件元素抽取框架,该框架核心有三部分:(1)模式获取:该模块负责将事件种子放在互联网上去检索,获得事件实例,并根据事件实例,按照一定的规则生成初始的事件模式(图3);(2)模式泛化:初始事件模式由于过于死板,导致遗漏掉很多事件的匹配,因此,本文设计模式泛化方法,将原有的事件模式按照一定规则,进行一定程度上的泛化,使其在保证准确率不变的情况下尽量提高召回率(算法3);(3)模式过滤:经泛化后的模式会在一定程度上引入噪声,因此,本文提出一套过滤规则,尽量减少泛化带来的噪声(表3)。
结论:提出自底向上的事件抽取系统。该系统在公开的ACE语料数据集上取得了优于当前最好基线方法的结果。同时在我们手工构造的音乐领域和金融领域数据集上也取得了优秀的实验结果。这表明该方法可以很好地进行领域自适应。
关键词组:
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/FITEE.1400405
CLC number:
TP391
Download Full Text:
Downloaded:
2656
Download summary:
<Click Here>Downloaded:
1864Clicked:
6739
Cited:
0
On-line Access:
2024-08-27
Received:
2023-10-17
Revision Accepted:
2024-05-08
Crosschecked:
2015-06-08