CLC number: TP391
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2018-01-08
Cited: 0
Clicked: 6852
Heung-yeung Shum, Xiao-dong He, Di Li. From Eliza to XiaoIce: challenges and opportunities with social chatbots[J]. Frontiers of Information Technology & Electronic Engineering, 2018, 19(1): 10-26.
@article{title="From Eliza to XiaoIce: challenges and opportunities with social chatbots",
author="Heung-yeung Shum, Xiao-dong He, Di Li",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="19",
number="1",
pages="10-26",
year="2018",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1700826"
}
%0 Journal Article
%T From Eliza to XiaoIce: challenges and opportunities with social chatbots
%A Heung-yeung Shum
%A Xiao-dong He
%A Di Li
%J Frontiers of Information Technology & Electronic Engineering
%V 19
%N 1
%P 10-26
%@ 2095-9184
%D 2018
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1700826
TY - JOUR
T1 - From Eliza to XiaoIce: challenges and opportunities with social chatbots
A1 - Heung-yeung Shum
A1 - Xiao-dong He
A1 - Di Li
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 19
IS - 1
SP - 10
EP - 26
%@ 2095-9184
Y1 - 2018
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1700826
Abstract: conversational systems have come a long way since their inception in the 1960s. After decades of research and development, we have seen progress from Eliza and Parry in the 1960s and 1970s, to task-completion systems as in the Defense Advanced Research Projects Agency (DARPA) communicator program in the 2000s, to intelligent personal assistants such as Siri, in the 2010s, to today’s social Chatbots like xiaoIce. social Chatbots’ appeal lies not only in their ability to respond to users’ diverse requests, but also in being able to establish an emotional connection with users. The latter is done by satisfying users’ need for communication, affection, as well as social belonging. To further the advancement and adoption of social Chatbots, their design must focus on user engagement and take both intellectual quotient (IQ) and emotional quotient (EQ) into account. Users should want to engage with a social Chatbot; as such, we define the success metric for social Chatbots as conversation-turns per session (CPS). Using xiaoIce as an illustrative example, we discuss key technologies in building social Chatbots from core chat to visual awareness to skills. We also show how xiaoIce can dynamically recognize emotion and engage the user throughout long conversations with appropriate interpersonal responses. As we become the first generation of humans ever living with artificial intelligenc (AI), we have a responsibility to design social Chatbots to be both useful and empathetic, so they will become ubiquitous and help society as a whole.
[1]Alam F, Danieli M, Riccardi G, 2017. Annotating and modeling empathy in spoken conversations. Comput Speech Lang, 50:40-61.
[2]Andreani G, di Fabbrizio G, Gilbert M, et al., 2006. Let’s DISCOH: collecting an annotated open corpus with dialogue acts and reward signals for natural language helpdesks. Proc IEEE Spoken Language Technology Workshop, p.218-221.
[3]Bahdanau D, Cho K, Bengio Y, 2014. Neural machine translation by jointly learning to align and translate. https://arxiv.org/abs/1409.0473
[4]Beldoch M, 1964. Sensitivity to expression of emotional meaning in three modes of communication. In: Davitz JR (Ed.), The Communication of Emotional Meaning. McGraw-Hill, New York, p.31-42.
[5]Bengio Y, Ducharme R, Vincent P, et al., 2003. A neural probabilistic language model. Proc Neural Information Processing Systems, p.1137-1155.
[6]Chen HM, Sun MS, Tu CC, et al., 2016. Neural sentiment classification with user and product attention. Proc Conf on Empirical Methods in Natural Language Processing, p.1650-1659.
[7]Colby KM, 1975. Artificial Paranoia: a Computer Simulation of Paranoid Processes. Pergamon Press INC. Maxwell House, New York, NY, England.
[8]Dahl DA, Bates M, Brown M, et al., 1994. Expanding the scope of the ATIS task: the ATIS-3 corpus. Proc Workshop on Human Language Technology, p.43-48.
[9]Deng L, Li JY, Huang JT, et al., 2013. Recent advances in deep learning for speech research at Microsoft. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.8604-8608.
[10]Elkahky AM, Song Y, He XD, 2015. A multi-view deep learning approach for cross domain user modeling in recommendation systems. Proc 24th Int Conf on World Wide Web, p.278-288.
[11]Fang H, Gupta S, Iandola F, et al., 2015. From captions to visual concepts and back. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.1473-1482.
[12]Fung P, Bertero D, Wan Y, et al., 2016. Towards empathetic human-robot interactions. Proc 17th Int Conf on Intelligent Text and Computational Linguistics.
[13]Gan C, Gan Z, He XD, et al., 2017, StyleNet: generating attractive visual captions with styles. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3137-3146.
[14]Gardner H, 1983. Frames of Mind: the Theory of Multiple Intelligences. Basic Books, New York.
[15]Glass J, Flammia G, Goodine D, et al., 1995. Multilingual spoken-language understanding in the MIT Voyager system. Speech Commun, 17(1):1-18.
[16]Goleman D, 1995. Emotional Intelligence: Why It Can Matter More than IQ. Bloomsbury, Inc., New York, NY, England.
[17]Goleman D, 1998. Working with Emotional Intelligence. Bloomsbury, Inc., New York, NY, England.
[18]Güzeldere G, Franchi S, 1995. Dialogues with colorful “personalities” of early AI. Stanford Human Rev, 4(2):161-169.
[19]He KM, Zhang YX, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778.
[20]He XD, Deng L, 2013. Speech-centric information processing: an optimization-oriented approach. Proc IEEE, 101(5): 116-1135.
[21]He XD, Deng L, 2017. Deep learning for image-to-text generation: a technical overview. IEEE Signal Process Mag, 34(6):109-116.
[22]Hemphill CT, Godfrey JJ, Doddington GR, 1990. The ATIS spoken language systems pilot corpus. Proc Workshop on Speech and Natural Language, p.96-101.
[23]Hinton G, Deng L, Yu D, et al., 2012. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag, 29(6):82-97.
[24]Hochreiter S, Schmidhuber J, 1997. Long short-term memory. Neur Comput, 9(8):1735-1780.
[25]Huang PS, He XD, Gao JF, et al., 2013. Learning deep structured semantic models for web search using click through data. Proc 22nd ACM Int Conf on Information & Knowledge Management, p.2333-2338.
[26]Karpathy A, Li FF, 2015. Deep visual-semantic alignments for generating image descriptions. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3128-3137.
[27]Krizhevsky A, Sutskever I, Hinton GE, 2012. ImageNet classification with deep convolutional neural networks. Proc 25th Int Conf on Neural Information Processing Systems, p.1097-1105.
[28]Levin E, Narayanan S, Pieraccini R, et al., 2000. The ATT-DARPA ommunicator mixed-initiative spoken dialog system. 6th Int Conf on Spoken Language Processing.
[29]Li JW, Galley M, Brockett C, et al., 2016. A persona-based neural conversation model. Proc 54th Annual Meeting of the Association for Computational Linguistics, p.944-1003.
[30]Li X, Mou LL, Yan R, et al., 2016. Stalematebreaker: a proactive content-introducing approach to automatic human-computer conversation. Proc 25th Int Joint Conf on Artificial Intelligence, p.2845-2851.
[31]Liu XD, Gao JF, He XD, et al., 2015. Representation learning using multi-task deep neural networks for semantic classification and information retrieval. Proc Annual Conf on North American Chapter of the ACL, p.912-921.
[32]Lu ZD, Li H, 2013. A deep architecture for matching short texts. Proc Int Conf on Neural Information Processing Systems, p.1367-1375.
[33]Maslow AH, 1943. A theory of human motivation. Psychol Rev, 50(4):370-396.
[34]Mathews A, Xie LX, He XM, 2016. SentiCap: generating image descriptions with sentiments. Proc 30th AAAI Conf on Artificial Intelligence, p.3574-3580.
[35]Mesnil G, He X, Deng L, et al., 2013. Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. Interspeech, p.3771-3775.
[36]Mesnil G, Dauphin Y, Yao KS, et al., 2015. Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Trans Audio Speech Lang Process, 23(3):530-539.
[37]Mikolov T, Sutskever I, Chen K, et al., 2013. Distributed representations of words and phrases and their compositionality. Proc 26th Int Conf on Neural Information Processing Systems, p.3111-3119.
[38]Mower E, Matarić MJ, Narayanan S, 2011. A framework for automatic human emotion classification using emotion profiles. IEEE Trans Audio Speech Lang Process, 19(5): 1057-1070.
[39]Murphy KR, 2007. A critique of emotional intelligence: what are the problems and how can they be fixed? Pers Psychol-, 60(1):235-238.
[40]Price PJ, 1990. Evaluation of spoken language systems: the ATIS domain. Proc Workshop on Speech and Natural Language, p.91-95.
[41]Qian Y, Fan YC, Hu WP, et al., 2014. On the training aspects of deep neural network (DNN) for parametric TTS synthesis. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.3829-3833.
[42]Raux A, Langner B, Bohus D, et al., 2005. Let’s go public! Taking a spoken dialog system to the real world. 9th European Conf on Speech Communication and Technology, p.885-888.
[43]Rudnicky AI, Thayer EH, Constantinides PC, et al., 1999. Creating natural dialogs in the Carnegie Mellon communicator system. 6th European Conf on Speech Communication and Technology.
[44]Sarikaya R, 2017. The technology behind personal digital assistants—an overview of the system architecture and key components. IEEE Signal Process Mag, 34(1):67-81.
[45]Sarikaya R, Crook PA, Marin A, et al., 2016. An overview of end-to-end language understanding and dialog management for personal digital assistants. Proc IEEE Spoken Language Technology Workshop, p.391-397.
[46]Seneff S, Hurley E, Lau R, et al., 1998. Galaxy-II: a reference architecture for conversational system development. 5th Int Conf on Spoken Language Processing.
[47]Serban IV, Klinger T, Tesauro G, et al., 2017. Multiresolution recurrent neural networks: an application to dialogue response generation. AAAI, p.3288-3294.
[48]Shawar BA, Atwell E, 2007. Different measurements metrics to evaluate a chatbot system. Proc Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies, p.89-96.
[49]Shieber SM, 1994. Lessons from a restricted Turing test. Commun ACM, 37(6):70-78.
[50]Socher R, Perelygin A, Wu JY, et al., 2013. Recursive deep models for semantic compositionality over a sentiment treebank. Proc Conf on Empirical Methods in Natural Language Processing, p.1631-1642.
[51]Song R, 2018. Image to poetry by cross-modality understanding with unpaired data. Personal Communication.
[52]Sordoni A, Galley M, Auli M, et al., 2015. A neural network approach to context-sensitive generation of conversational responses. Proc Annual Conf on North American Chapter of the ACL, p.196-205.
[53]Sutskever I, Vinyals O, Le QVV, 2014. Sequence to sequence learning with neural networks. NIPS, p.1-9.
[54]Tokuhisa R, Inui K, Matsumoto Y, 2008. Emotion classification using massive examples extracted from the web. Proc 22nd Int Conf on Computational Linguistics, p.881-888.
[55]Tur G, de Mori R, 2011. Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. John Wiley and Sons, New York, NY.
[56]Tur G, Deng L, 2011. Intent determination and spoken utterance classification. In: Tur G, de Mori R (Eds), Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. John Wiley and Sons, New York, NY.
[57]Turing A, 1950. Computing machinery and intelligence. Mind, 59:433-460.
[58]van den Oord A, Dieleman S, Zen HG, et al., 2016. WaveNet: a generative model for raw audio. 9th ISCA Speech Synthesis Workshop, p.125.
[59]Vinyals O, Le QV, 2015. A neural conversational model. Proc 31st Int Conf on Machine Learning.
[60]Vinyals O, Toshev A, Bengio S, et al., 2015. Show and tell: a neural image caption generator. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3156-3164.
[61]Walker M, Aberdeen J, Boland J, et al., 2001. DARPA Communicator dialog travel planning systems: the June 2000 data collection. Proc 7th European Conf on Speech Communication and Technology.
[62]Walker M, Rudnicky AI, Aberdeen JS, et al., 2002, DARPA Communicator evaluation: progress from 2000 to 2001. Proc Int Conf on Spoken Language Processing, p. 273-276.
[63]Wallace RS, 2009. The anatomy of A.L.I.C.E. In: Epstein R, Roberts G, Beber G (Eds.), Parsing the Turing Test: Philosophical and Methodological Issues in the Quest for the Thinking Computer. Springer, Dordrecht, p.181-210.
[64]Wang HN, He XD, Chang MW, et al., 2013. Personalized ranking model adaptation for web search. Proc 36th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.323-332.
[65]Wang YY, Deng L, Acero A, 2011. Semantic frame-based spoken language understanding. In: Tur G, de Mori R (Eds.), Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. John Wiley and Sons, New York, NY.
[66]Wang ZY, Wang HX, Wen JR, et al., 2015. An inference approach to basic level of categorization. Proc 24th ACM Int Conf on Information and Knowledge Management, p.653-662.
[67]Weizenbaum J, 1966. ELIZA—a computer program for the study of natural language communication between man and machine. Commun ACM, 9(1):36-45.
[68]Wen TH, Vandyke D, Mrksic N, et al., 2016. A network-based end-to-end trainable task-oriented dialogue system. Proc 15th Conf on European Chapter of the Association for Computational Linguistics, p.438-449.
[69]Williams JD, Young S, 2007. Partially observable Markov decision processes for spoken dialog systems. Comput Speech Lang, 21(2):393-422.
[70]Xiong W, Droppo J, Huang XD, et al., 2016. Achieving human parity in conversational speech recognition. IEEE/ACM Trans Audio Speech Lang Process, in press.
[71]Yan R, Song YP, Wu H, 2016. Learning to respond with deep neural networks for retrieval-based human-computer conversation system. Proc 39th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.55-64.
[72]Yang ZC, He XD, Gao JF, et al., 2016a. Stacked attention networks for image question answering. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.21-29.
[73]Yang ZC, Yang DY, Dyer C, et al., 2016b. Hierarchical attention networks for document classification. Proc 15th Annual Conf on North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.1480-1489.
[74]Yu Z, Xu ZY, Black AW, et al., 2016. Chatbot evaluation and database expansion via crowdsourcing. Proc RE-WOCHAT Workshop of LREC.
Open peer comments: Debate/Discuss/Question/Opinion
<1>