Publishing Service

Polishing & Checking

Frontiers of Information Technology & Electronic Engineering

ISSN 2095-9184 (print), ISSN 2095-9230 (online)

Shared-weight multimodal translation model for recognizing Chinese variant characters

Abstract: The task of recognizing Chinese variant characters aims to address the challenges of semantic ambiguity and confusion, which potentially cause risks to the security of Web content and complicate the governance of sensitive words. Most existing approaches predominantly prioritize the acquisition of contextual knowledge from Chinese corpora and vocabularies during pretraining, often overlooking the inherent phonological and morphological characteristics of the Chinese language. To address these issues, we propose a shared-weight multimodal translation model (SMTM) based on multimodal information of Chinese characters, which integrates the phonology of Pinyin and the morphology of fonts into each Chinese character token to learn the deeper semantics of variant text. Specifically, we encode the Pinyin features of Chinese characters using the embedding layer, and the font features of Chinese characters are extracted based on convolutional neural networks directly. Considering the multimodal similarity between the source and target sentences of the Chinese variant-character-recognition task, we design the shared-weight embedding mechanism to generate target sentences using the heuristic information from the source sentences in the training process. The simulation results show that our proposed SMTM achieves remarkable performance of 89.550% and 79.480% on bilingual evaluation understudy (BLEU) and F1 metrics respectively, with significant improvement compared with state-of-the-art baseline models.

Key words: Chinese variant characters; Multimodal model; Translation model; Phonology and morphology

Chinese Summary  <0> 面向中文变体字识别的共享权重多模态翻译模型

孙元康1,2,李冰1,2,李乐翔1,2,杨鹏1,2,杨冬梅3
1东南大学计算机科学与工程学院,中国南京市,210000
2东南大学计算机网络和信息集成教育部重点实验室,中国南京市,210000
3北京科技大学计算机与通信工程学院,中国北京市,100083
摘要:中文变体字识别任务旨在解决中文字符中存在的语义模糊和混淆问题,这些问题对网页内容的安全性构成潜在风险,并加剧敏感词汇管理的复杂性。大多数现有方法在预训练阶段侧重于从中文语料库和词汇中获取上下文语义,往往忽视了中文固有的音韵和形态特征。基于上述问题,本文提出一种面向中文变体字识别的共享权重多模态翻译模型。该模型将拼音的音韵特征和字体的形态特征整合到每个中文词元中,以学习变体文本的深层语义特征。具体来说,通过嵌入层对中文拼音音韵特征进行编码,并利用卷积神经网络学习中文字体形态特征。考虑到中文变体字识别任务中源句与目标句之间的多模态特征相似性,设计了共享权重嵌入机制,在训练过程中利用源句的启发式信息生成目标句。实验结果表明,本文所提出的共享权重多模态翻译模型在双语评估测试(BLEU)和F1值方面分别达到89.550%和79.480%,与当前最先进的基线模型相比有显著提升。

关键词组:中文变体字;多模态模型;翻译模型;音韵和形态


Share this article to: More

Go to Contents

References:

<Show All>

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





DOI:

10.1631/FITEE.2400504

CLC number:

TP391

Download Full Text:

Click Here

Downloaded:

1041

Clicked:

568

Cited:

0

On-line Access:

2025-07-28

Received:

2024-06-11

Revision Accepted:

2024-10-10

Crosschecked:

2025-07-30

Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952276; Fax: +86-571-87952331; E-mail: jzus@zju.edu.cn
Copyright © 2000~ Journal of Zhejiang University-SCIENCE