Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

Enhancing low-resource cross-lingual summarization from noisy data with fine-grained reinforcement learning

Abstract: Cross-lingual summarization (CLS) is the task of generating a summary in a target language from a document in a source language. Recently, end-to-end CLS models have achieved impressive results using large-scale, high-quality datasets typically constructed by translating monolingual summary corpora into CLS corpora. However, due to the limited performance of low-resource language translation models, translation noise can seriously degrade the performance of these models. In this paper, we propose a fine-grained reinforcement learning approach to address low-resource CLS based on noisy data. We introduce the source language summary as a gold signal to alleviate the impact of the translated noisy target summary. Specifically, we design a reinforcement reward by calculating the word correlation and word missing degree between the source language summary and the generated target language summary, and combine it with cross-entropy loss to optimize the CLS model. To validate the performance of our proposed model, we construct Chinese-Vietnamese and Vietnamese-Chinese CLS datasets. Experimental results show that our proposed model outperforms the baselines in terms of both the ROUGE score and BERTScore.

Key words: Cross-lingual summarization; Low-resource language; Noisy data; Fine-grained reinforcement learning; Word correlation; Word missing degree

Chinese Summary  <2> 基于细粒度强化学习增强噪声数据的低资源跨语言摘要

黄于欣1,2,顾怀领1,2,余正涛1,2,高玉梦1,2,潘通1,2,徐佳龙1,2
1昆明理工大学信息工程与自动化学院,中国昆明市,650504
2昆明理工大学云南省人工智能重点实验室,中国昆明市,650504
摘要:跨语言摘要是从源语言文档生成目标语言摘要的任务。最近,端到端跨语言摘要模型通过使用大规模、高质量数据集取得令人瞩目的结果,这些数据集通常是通过将单语摘要语料库翻译成跨语言摘要语料库而构建的。然而,由于低资源语言翻译模型性能有限,翻译噪声会严重降低模型性能。提出一种细粒度强化学习方法解决基于噪声数据的低资源跨语言摘要问题。引入源语言摘要作为黄金信号,减轻翻译后噪声目标摘要的影响。具体来说,通过计算源语言摘要和生成目标语言摘要之间的词相关性和词缺失度设计强化奖励,并将其与交叉熵损失相结合优化跨语言摘要模型。为验证所提出模型性能,构建汉语-越南语和越南语-汉语跨语言摘要数据集。实验结果表明,所提出模型在ROUGE分数和BERTScore方面优于其他基线。

关键词组:跨语言摘要;低资源语言;噪声数据;细粒度强化学习;词相关性;词缺失度https://doi.org/10.1631/FITEE.2300296


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.2300296

CLC number:

TP391

Download Full Text:

Click Here

Downloaded:

194

Download summary:

<Click Here> 

Downloaded:

122

Clicked:

403

Cited:

0

On-line Access:

2024-02-19

Received:

2023-04-27

Revision Accepted:

2024-02-19

Crosschecked:

2023-10-22

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE