|
Frontiers of Information Technology & Electronic Engineering
ISSN 2095-9184 (print), ISSN 2095-9230 (online)
2024 Vol.25 No.1 P.121-134
Enhancing low-resource cross-lingual summarization from noisy data with fine-grained reinforcement learning
Abstract: Cross-lingual summarization (CLS) is the task of generating a summary in a target language from a document in a source language. Recently, end-to-end CLS models have achieved impressive results using large-scale, high-quality datasets typically constructed by translating monolingual summary corpora into CLS corpora. However, due to the limited performance of low-resource language translation models, translation noise can seriously degrade the performance of these models. In this paper, we propose a fine-grained reinforcement learning approach to address low-resource CLS based on noisy data. We introduce the source language summary as a gold signal to alleviate the impact of the translated noisy target summary. Specifically, we design a reinforcement reward by calculating the word correlation and word missing degree between the source language summary and the generated target language summary, and combine it with cross-entropy loss to optimize the CLS model. To validate the performance of our proposed model, we construct Chinese-Vietnamese and Vietnamese-Chinese CLS datasets. Experimental results show that our proposed model outperforms the baselines in terms of both the ROUGE score and BERTScore.
Key words: Cross-lingual summarization; Low-resource language; Noisy data; Fine-grained reinforcement learning; Word correlation; Word missing degree
1昆明理工大学信息工程与自动化学院,中国昆明市,650504
2昆明理工大学云南省人工智能重点实验室,中国昆明市,650504
摘要:跨语言摘要是从源语言文档生成目标语言摘要的任务。最近,端到端跨语言摘要模型通过使用大规模、高质量数据集取得令人瞩目的结果,这些数据集通常是通过将单语摘要语料库翻译成跨语言摘要语料库而构建的。然而,由于低资源语言翻译模型性能有限,翻译噪声会严重降低模型性能。提出一种细粒度强化学习方法解决基于噪声数据的低资源跨语言摘要问题。引入源语言摘要作为黄金信号,减轻翻译后噪声目标摘要的影响。具体来说,通过计算源语言摘要和生成目标语言摘要之间的词相关性和词缺失度设计强化奖励,并将其与交叉熵损失相结合优化跨语言摘要模型。为验证所提出模型性能,构建汉语-越南语和越南语-汉语跨语言摘要数据集。实验结果表明,所提出模型在ROUGE分数和BERTScore方面优于其他基线。
关键词组:
References:
Open peer comments: Debate/Discuss/Question/Opinion
<1>
DOI:
10.1631/FITEE.2300296
CLC number:
TP391
Download Full Text:
Downloaded:
1480
Download summary:
<Click Here>Downloaded:
319Clicked:
1569
Cited:
0
On-line Access:
2024-08-27
Received:
2023-10-17
Revision Accepted:
2024-05-08
Crosschecked:
2023-10-22