
CLC number: TP391.1
On-line Access: 2026-01-08
Received: 2025-05-08
Revision Accepted: 2025-09-04
Crosschecked: 2026-01-08
Cited: 0
Clicked: 1137
Citations: Bibtex RefMan EndNote GB/T7714
Li WEIGANG, Pedro Carvalho BROM. Paradox of poetic intent in back-translation: evaluating the quality of large language models in Chinese translation[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(11): 2176-2203.
@article{title="Paradox of poetic intent in back-translation: evaluating the quality of large language models in Chinese translation",
author="Li WEIGANG, Pedro Carvalho BROM",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="11",
pages="2176-2203",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2500298"
}
%0 Journal Article
%T Paradox of poetic intent in back-translation: evaluating the quality of large language models in Chinese translation
%A Li WEIGANG
%A Pedro Carvalho BROM
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 11
%P 2176-2203
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2500298
TY - JOUR
T1 - Paradox of poetic intent in back-translation: evaluating the quality of large language models in Chinese translation
A1 - Li WEIGANG
A1 - Pedro Carvalho BROM
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 11
SP - 2176
EP - 2203
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2500298
Abstract: Large language models (LLMs) excel in multilingual translation tasks, yet often struggle with culturally and semantically rich Chinese texts. This study introduces the framework of back-translation (BT) powered by LLMs, or LLM-BT, to evaluate Chinese → intermediate language → Chinese translation quality across five LLMs and three traditional systems. We construct a diverse corpus containing scientific abstracts, historical paradoxes, and literary metaphors, reflecting the complexity of Chinese at the lexical and semantic levels. Using our modular NLPMetrics system, including bilingual evaluation understudy (BLEU), character F-score (CHRF), translation edit rate (TER), and semantic similarity (SS), we find that LLMs outperform traditional tools in cultural and literary tasks. However, the results of this study uncover a high-dimensional behavioral phenomenon, the paradox of poetic intent, where surface fluency is preserved, but metaphorical or emotional depth is lost. Additionally, some models exhibit verbatim BT, suggesting a form of data-driven quasi-self-awareness, particularly under repeated or cross-model evaluation. To address BLEU’s limitations for Chinese, we propose a Jieba-segmentation BLEU variant that incorporates word-frequency and n-gram weighting, improving sensitivity to lexical segmentation and term consistency. Supplementary tests show that in certain semantic dimensions, LLM outputs approach the fidelity of human poetic translations, despite lacking a deeper metaphorical intent. Overall, this study reframes traditional fidelity vs. fluency evaluation into a richer, multi-layered analysis of LLM behavior, offering a transparent framework that contributes to explainable artificial intelligence and identifies new research pathways in cultural natural language processing and multilingual LLM alignment.
[1]Aiken M, Park M, 2010. The efficacy of round-trip translation for MT evaluation. Transl J, 14(1):1-10.
[2]Arruda-Vasconcelos R, Louzada LM, Feres M, et al., 2021. Investigation of microbial profile, levels of endotoxin and lipoteichoic acid in teeth with symptomatic irreversible pulpitis: a clinical study. Int Endod J, 54(1):46-64.
[3]Artetxe M, Labaka G, Agirre E, 2018. Unsupervised statistical machine translation. Proc Conf on Empirical Methods in Natural Language Processing, p.3632-3642.
[4]Bahji A, Acion L, Laslett AM, et al., 2023. Exclusion of the non-English-speaking world from the scientific literature: recommendations for change for addiction journals and publishers. Nord Stud Alcohol Drugs, 40(1):6-13.
[5]Baker M, 2018. In Other Words: a Coursebook on Translation (3rd Ed.). Routledge, London, UK.
[6]Berman A, Venuti L, 2021. Translation and the Trials of the Foreign. In: Venuti L (Ed.), The Translation Studies Reader (4th Ed.). Routledge, London, UK, p.247-260.
[7]Brimacombe B, Zhou JW, 2023. Quick back-translation for unsupervised machine translation. Proc Findings of the Association for Computational Linguistics, p.8521-8534.
[8]Brown TB, Mann B, Ryder N, et al., 2020. Language models are few-shot learners. Proc 34th Int Conf on Neural Information Processing Systems, Article 159.
[9]Cao Z, Lu J, Cui S, et al., 2020. Zero-shot handwritten Chinese character recognition with hierarchical decomposition embedding. Patt Recogn, 107:107488.
[10]Chan SW, 2004. A Dictionary of Translation Technology. The Chinese University of Hong Kong Press, Hong Kong, China (in Chinese).
[11]Chen AD, Lou LZ, Chen KH, et al., 2024a. Benchmarking LLMs for translating classical Chinese poetry: evaluating adequacy, fluency, and elegance. https://arxiv.org/abs/2408.09945
[12]Chen AD, Lou LZ, Chen KH, et al., 2024b. DUAL-REFLECT: enhancing large language models for reflective translation through dual learning feedback mechanisms. Proc 62nd Annual Meeting of the Association for Computational Linguistics, p.693-704.
[13]Chung JB, Kim T, 2025. Leveraging large language models for enhanced back-translation: techniques and applications. IEEE Access, 13:61322-61328.
[14]Degroot AMB, Dannenburg L, Vanhell JG, 1994. Forward and backward word translation by bilinguals. J Mem Lang, 33(5):600-629.
[15]Demšar J, 2006. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res, 7:1-30.
[16]Ding Y, Teng F, Zhang P, et al., 2021. Research on text information mining technology of substation inspection based on improved Jieba. Proc Int Conf on Wireless Communications and Smart Grid, p.561-564.
[17]Eberhard DM, Simons GF, Fennig CD, 2022. Ethnologue: Languages of the World (25th Ed.). SIL International, Dallas, USA.
[18]Edunov S, Ott M, Auli M, et al., 2018. Understanding back-translation at scale. Proc Conf on Empirical Methods in Natural Language Processing, p.489-500.
[19]Feng SB, 2024. Discussion on applied chemical metrology calculation based on computer technology—comment on “applied chemistry”. Chin J Appl Chem, 41(12):1829-1830 (in Chinese).
[20]Gain B, Bandyopadhyay D, Ekbal A, 2025. Bridging the linguistic divide: a survey on leveraging large language models for machine translation. https://arxiv.org/abs/2504.01919
[21]Glidden-Tracey C, Greenwood AK, 1997. A validation study of the Spanish self-directed search using back-translation procedures. J Career Assess, 5(1):105-113.
[22]He J, 2019. The Chinese nomenclature for the heterocyclic compounds since 1932. Chemistry, 82(4):373-378 (in Chinese).
[23]He YJ, Hou LP, Lang LY, 2023. L2 Acquisition from Perspectives of Professional Translation and Interpreting. In: Maqbool T, Lang LY, Meltzoff K (Eds.), Second Language Acquisition—Learning Theories and Recent Approaches. IntechOpen, p.85.
[24]Hoang VCD, Koehn P, Haffari G, et al., 2018. Iterative back-translation for neural machine translation. Proc 2nd Workshop on Neural Machine Translation and Generation, p.18-24.
[25]Jiang JY, Liu C, 2020. Comparison and analysis of research and development expenditure and publication output of major countries (regions) in the world. Bull Natl Nat Sci Found China, 34(3):367-372 (in Chinese).
[26]Klaudy K, 1996. Back-translation as a tool for detecting explicitation strategies in translation. In: Klaudy K, Lambert J, Sohár A (Eds.), Translation Studies in Hungary. Scholastica, Budapest, Hungary, p.99-114.
[27]Kroll JF, Stewart E, 1994. Category interference in translation and picture naming: evidence for asymmetric connections between bilingual memory representations. J Mem Lang, 33(2):149-174.
[28]La Heij W, Hooglander A, Kerling R, et al., 1996. Nonverbal context effects in forward and backward word translation: evidence for concept mediation. J Mem Lang, 35(5):648-665.
[29]Li HZ, Sha J, Shi C, 2020. Revisiting back-translation for low-resource machine translation between Chinese and Vietnamese. IEEE Access, 8:119931-119939.
[30]Li YH, Huang HY, Wang BJ, et al., 2025. DRMSpell: dynamically reweighting multimodality for Chinese spelling correction. Front Inform Technol Electron Eng, 26(3):354-366.
[31]Ling L, Lin CH, Lin TY, et al., 2025. Scenethesis: a language and vision agentic framework for 3D scene generation. https://arxiv.org/abs/2505.02836
[32]Liu Y, Liang NY, 1986. Hanyu chuli de jichu gongcheng—xiandai hanyu cifrequency tongji. J Chin Inform Process, 1(1):17-25 (in Chinese).
[33]Luo RX, Xu JJ, Zhang Y, et al., 2019. PKUSEG: a toolkit for multi-domain Chinese word segmentation. https://arxiv.org/abs/1906.11455
[34]Ma WW, 2024. Effect of amino oligosaccharides combined with chemical fungicides on the control of downy mildew in Chinese cabbage. Contemp Farm Mach, (12):68-69 (in Chinese).
[35]Marivate V, Sefara T, 2020. Improving short text classification through global augmentation methods. Proc CICLing, p.234-246.
[36]Modarressi A, Köksal A, Imani A, et al., 2024. MemLLM: finetuning LLMs to use an explicit read-write memory. https://arxiv.org/abs/2404.11672
[37]Nam GE, Park YG, 2015. Re: Inhibition of peripheral FAAH depresses activities of bladder mechanosensitive nerve fibers of the rat. J Urol, 193(2):738-739.
[38]Nida EA, 1964. Toward a Science of Translating: with Special Reference to Principles and Procedures Involved in Bible Translating. Brill Archive, Leiden, the Netherlands.
[39]Ozolins U, Hale S, Cheng X, et al., 2020. Translation and back-translation methodology in health research—a critique. Expert Rev Pharmacoecon Outcomes Res, 20(1):69-77.
[40]Papineni K, Roukos S, Ward T, et al., 2002. BLEU: a method for automatic evaluation of machine translation. Proc 40th Annual Meeting of the Association for Computational Linguistics, p.311-318.
[41]Qiang JP, Li Y, Zhang CW, et al., 2023. Chinese idiom paraphrasing. Trans Assoc Comput Ling, 11:740-754.
[42]Salamoura A, Williams JN, 1999. Backward word translation: lexical vs. conceptual mediation or “concept activation vs. word retrieval”? RCEAL Work Pap Engl Appl Ling, 6:31-56.
[43]Schäffner C, 2004. Metaphor and translation: some implications of a cognitive approach. J Pragmat, 36(7):1253-1269.
[44]Sennrich R, Haddow B, Birch A, 2016. Improving neural machine translation models with monolingual data. Proc 54th Annual Meeting of the Association for Computational Linguistics, p.86-96.
[45]Shan LL, Luo SX, Zhu ZZ, et al., 2025. Cognitive memory in large language models. https://arxiv.org/abs/2504.02441
[46]Sheldon MR, Fillyaw MJ, Thompson WD, 1996. The use and interpretation of the Friedman test in the analysis of ordinal-scale data in repeated measures designs. Physioth Res Int, 1(4):221-228.
[47]Somers H, 2005. Round-trip translation: what is it good for? Proc Australasian Language Technology Workshop, p.127-133.
[48]Sun ZJ, Li XY, Sun XF, et al., 2021. ChineseBERT: Chinese pretraining enhanced by glyph and pinyin information. Proc 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing, p.2065-2075.
[49]Taheri A, Zamanifar A, Farhadi A, 2025. Enhancing aspect-based sentiment analysis using data augmentation based on back-translation. Int J Data Sci Anal, 19(3):491-516.
[50]Tao Z, Che YF, Xi DH, et al., 2024. Towards reliable detection of LLM-generated texts: a comprehensive evaluation framework with CUDRT. https://arxiv.org/abs/2406.09056
[51]Toral A, Way A, 2018. What level of quality can neural machine translation attain on literary text? In: Moorkens J, Castilho S, Gaspari F, et al. (Eds.), Translation Quality Assessment: from Principles to Practice. Springer, Cham, p.263-287.
[52]Troiano E, Klinger R, Padó S, 2020. Lost in back-translation: emotion preservation in neural machine translation. Proc 28th Int Conf on Computational Linguistics, p.4340-4354.
[53]Tu QY, Li CB, 2017. A review on textless back translation of China-themed works written in English. Stud Lit Lang, 14(1):1-7.
[54]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31st Int Conf on Neural Information Processing Systems, p.6000-6010.
[55]Waijanya S, Mingkhwan A, 2014. Thai poetry translation to English with backward translation evaluation. Proc 9th Int Conf on Digital Information Management, p.248-253.
[56]Wang HY, 2009. Introduction to Literary Translation Criticism. China Renmin University Press, Beijing, China (in Chinese).
[57]Wei JQ, Ren XZ, Li XG, et al., 2019. NEZHA: neural contextualized representation for Chinese language understanding. https://arxiv.org/abs/1909.00204
[58]Weigang L, Brom PC, 2025. LLM-BT-terms: back-translation as a framework for terminology standardization and dynamic semantic embedding. https://arxiv.org/abs/2506.08174
[59]Weigang L, Marinho MC, Li DL, et al., 2024. Six-writings multimodal processing with pictophonetic coding to enhance Chinese language models. Front Inform Technol Electron Eng, 25(1):84-105.
[60]Weigang L, Brom PC, Ramos RM, 2025a. Quantitative evaluation of translation quality and computational efficiency in semantic vs. phonetic strategies for Chinese scientific terms. Proc 29th Int Conf on Asian Language Processing, p.43-48.
[61]Weigang L, Ramos RM, Brom PC, et al., 2025b. Threshold study for Hanzi image recognition: defining character and component limits in Chinese, Japanese, and Korean script processing. Int J Asian Lang Process, 35(1):2450011.
[62]Wong KF, Li WJ, Xu RF, et al., 2010. Introduction to Chinese Natural Language Processing. Springer, Cham, Germany.
[63]Wu MM, Hu YX, Zhang YC, et al., 2024. Mitigating idiom inconsistency: a multi-semantic contrastive learning method for Chinese idiom reading comprehension. Proc 38th AAAI Conf on Artificial Intelligence, p.19243-19251.
[64]Yang HK, Lin ZH, Wang WJ, et al., 2024. Memory3: language modeling with explicit memory. https://arxiv.org/abs/2407.01178
[65]Yang YX, Ren GC, 2020. HanLP-based technology function matrix construction on Chinese process patents. Int J Mob Comput Multim Commun, 11(3):48-64.
[66]Yousufi S, Erdely F, 2024. Enhancing nonparametric tests: insights for computational intelligence and data mining. Res Acad Innov Data Anal, 1(3):214-226.
[67]Yung C, Dolatabadi HM, Erfani S, et al., 2025. Round trip translation defence against large language model jailbreaking attacks. Proc Workshops, ADUR, FairPC, GLFM, PM4B and RAFDA Trends and Applications in Knowledge Discovery and Data Mining, p.286-297.
[68]Zhang XE, 2021. A study of cultural context in Chinese–English translation. Reg-Educ Res Rev, 3(2):11-14.
[69]Zhang Y, Shuai YH, Xiao CY, et al., 2025. The structure of the bilingual lexicon: evidence from a semantic blocked word translation task with Chinese–English bilinguals. Second Lang Res, early access.
[70]Zhang ZY, Bo XH, Ma C, et al., 2024. A survey on the memory mechanism of large language model based agents. https://arxiv.org/abs/2404.13501
[71]Zhao SQ, Zhou YH, Ren YP, et al., 2025. Fùxì: a benchmark for evaluating language models on ancient Chinese. https://arxiv.org/abs/2503.15837
[72]Zhong CZ, Cheng F, Liu QY, et al., 2024. Beyond English-centric LLMs: what language do multilingual language models think in? https://arxiv.org/abs/2408.10811
[73]Zhou Z, 2014. The six principles of Chinese writing and its application to design as design idea. Stud Lit Lang, 8(3):84-88.
[74]Zhu SL, Pan LY, Jian D, et al., 2025. Overcoming language barriers via machine translation with sparse mixture-of-experts fusion of large language models. Inform Process Manag, 62(3):104078.
[75]Zhuo TY, Xu QK, He XL, et al., 2023. Rethinking round-trip translation for machine translation evaluation. Proc Findings of the Association for Computational Linguistics, p.319-337.
Open peer comments: Debate/Discuss/Question/Opinion
<1>