JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering 2018 Vol.19 No.1 P.10-26

From Eliza to XiaoIce: challenges and opportunities with social chatbots

Author(s): Heung-yeung Shum, Xiao-dong He, Di Li
Affiliation(s): Microsoft Corporation, Redmond, WA 98052, USA
Corresponding email(s): hshum@microsoft.com, xiaohe@microsoft.com, lidi@microsoft.com
Key Words: Conversational system, Social Chatbot, Intelligent personal assistant, Artificial intelligence, XiaoIce

Share this article to： More <<< Previous Article \|Next Article >>>

Heung-yeung Shum, Xiao-dong He, Di Li. From Eliza to XiaoIce: challenges and opportunities with social chatbots[J]. Frontiers of Information Technology & Electronic Engineering, 2018, 19(1): 10-26.

@article{title="From Eliza to XiaoIce: challenges and opportunities with social chatbots",
author="Heung-yeung Shum, Xiao-dong He, Di Li",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="19",
number="1",
pages="10-26",
year="2018",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1700826"
}

%0 Journal Article
%T From Eliza to XiaoIce: challenges and opportunities with social chatbots
%A Heung-yeung Shum
%A Xiao-dong He
%A Di Li
%J Frontiers of Information Technology & Electronic Engineering
%V 19
%N 1
%P 10-26
%@ 2095-9184
%D 2018
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1700826

TY - JOUR
T1 - From Eliza to XiaoIce: challenges and opportunities with social chatbots
A1 - Heung-yeung Shum
A1 - Xiao-dong He
A1 - Di Li
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 19
IS - 1
SP - 10
EP - 26
%@ 2095-9184
Y1 - 2018
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1700826

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: conversational systems have come a long way since their inception in the 1960s. After decades of research and development, we have seen progress from Eliza and Parry in the 1960s and 1970s, to task-completion systems as in the Defense Advanced Research Projects Agency (DARPA) communicator program in the 2000s, to intelligent personal assistants such as Siri, in the 2010s, to today’s social Chatbots like xiaoIce. social Chatbots’ appeal lies not only in their ability to respond to users’ diverse requests, but also in being able to establish an emotional connection with users. The latter is done by satisfying users’ need for communication, affection, as well as social belonging. To further the advancement and adoption of social Chatbots, their design must focus on user engagement and take both intellectual quotient (IQ) and emotional quotient (EQ) into account. Users should want to engage with a social Chatbot; as such, we define the success metric for social Chatbots as conversation-turns per session (CPS). Using xiaoIce as an illustrative example, we discuss key technologies in building social Chatbots from core chat to visual awareness to skills. We also show how xiaoIce can dynamically recognize emotion and engage the user throughout long conversations with appropriate interpersonal responses. As we become the first generation of humans ever living with artificial intelligenc (AI), we have a responsibility to design social Chatbots to be both useful and empathetic, so they will become ubiquitous and help society as a whole.

从Eliza到小冰：社交对话机器人的机遇和挑战

概要：会话系统经过数十年研究与开发，从20世纪六七十年代的Eliza和Parry，到航空旅行信息系统（Airline Travel Information System, ATIS）项目中的自动任务完成系统，从智能个人助理Siri，再到社交对话机器人微软小冰，出现了多种形式。社交对话机器人的吸引力在于其不仅具有回应用户不同请求的能力，还具有与用户建立情感联系的能力。其中，后者通过满足用户对于沟通、情感及社会归属感的感性需求来完成。社交对话机器人的设计必须专注于用户参与度，同时也须考虑智商和情商。为了吸引用户和聊天机器人交流，我们将社交对话机器人的成功程度以每次会话中交流回合数（conversation-turns per session, CPS）来衡量。以小冰为例，在本文中我们讨论了从核心对话、视觉感知到可扩展技巧等一系列社交对话机器人构建中的重要技术，展示了小冰动态识别用户感情的能力，并在长时间交互中以适当的人际关系反应吸引用户。作为第一代与人工智能共生的人类，感情丰富且功能强大的社交对话机器人将很快变成我们生活中不可或缺的一部分。

关键词：会话系统；社交对话机器人；智能个人助理；人工智能；小冰

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Alam F, Danieli M, Riccardi G, 2017. Annotating and modeling empathy in spoken conversations. Comput Speech Lang, 50:40-61.

[2]Andreani G, di Fabbrizio G, Gilbert M, et al., 2006. Let’s DISCOH: collecting an annotated open corpus with dialogue acts and reward signals for natural language helpdesks. Proc IEEE Spoken Language Technology Workshop, p.218-221.

[3]Bahdanau D, Cho K, Bengio Y, 2014. Neural machine translation by jointly learning to align and translate. https://arxiv.org/abs/1409.0473

[4]Beldoch M, 1964. Sensitivity to expression of emotional meaning in three modes of communication. In: Davitz JR (Ed.), The Communication of Emotional Meaning. McGraw-Hill, New York, p.31-42.

[5]Bengio Y, Ducharme R, Vincent P, et al., 2003. A neural probabilistic language model. Proc Neural Information Processing Systems, p.1137-1155.

[6]Chen HM, Sun MS, Tu CC, et al., 2016. Neural sentiment classification with user and product attention. Proc Conf on Empirical Methods in Natural Language Processing, p.1650-1659.

[7]Colby KM, 1975. Artificial Paranoia: a Computer Simulation of Paranoid Processes. Pergamon Press INC. Maxwell House, New York, NY, England.

[8]Dahl DA, Bates M, Brown M, et al., 1994. Expanding the scope of the ATIS task: the ATIS-3 corpus. Proc Workshop on Human Language Technology, p.43-48.

[9]Deng L, Li JY, Huang JT, et al., 2013. Recent advances in deep learning for speech research at Microsoft. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.8604-8608.

[10]Elkahky AM, Song Y, He XD, 2015. A multi-view deep learning approach for cross domain user modeling in recommendation systems. Proc 24^th Int Conf on World Wide Web, p.278-288.

[11]Fang H, Gupta S, Iandola F, et al., 2015. From captions to visual concepts and back. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.1473-1482.

[12]Fung P, Bertero D, Wan Y, et al., 2016. Towards empathetic human-robot interactions. Proc 17^th Int Conf on Intelligent Text and Computational Linguistics.

[13]Gan C, Gan Z, He XD, et al., 2017, StyleNet: generating attractive visual captions with styles. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3137-3146.

[14]Gardner H, 1983. Frames of Mind: the Theory of Multiple Intelligences. Basic Books, New York.

[15]Glass J, Flammia G, Goodine D, et al., 1995. Multilingual spoken-language understanding in the MIT Voyager system. Speech Commun, 17(1):1-18.

[16]Goleman D, 1995. Emotional Intelligence: Why It Can Matter More than IQ. Bloomsbury, Inc., New York, NY, England.

[17]Goleman D, 1998. Working with Emotional Intelligence. Bloomsbury, Inc., New York, NY, England.

[18]Güzeldere G, Franchi S, 1995. Dialogues with colorful “personalities” of early AI. Stanford Human Rev, 4(2):161-169.

[19]He KM, Zhang YX, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778.

[20]He XD, Deng L, 2013. Speech-centric information processing: an optimization-oriented approach. Proc IEEE, 101(5): 116-1135.

[21]He XD, Deng L, 2017. Deep learning for image-to-text generation: a technical overview. IEEE Signal Process Mag, 34(6):109-116.

[22]Hemphill CT, Godfrey JJ, Doddington GR, 1990. The ATIS spoken language systems pilot corpus. Proc Workshop on Speech and Natural Language, p.96-101.

[23]Hinton G, Deng L, Yu D, et al., 2012. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag, 29(6):82-97.

[24]Hochreiter S, Schmidhuber J, 1997. Long short-term memory. Neur Comput, 9(8):1735-1780.

[25]Huang PS, He XD, Gao JF, et al., 2013. Learning deep structured semantic models for web search using click through data. Proc 22^nd ACM Int Conf on Information & Knowledge Management, p.2333-2338.

[26]Karpathy A, Li FF, 2015. Deep visual-semantic alignments for generating image descriptions. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3128-3137.

[27]Krizhevsky A, Sutskever I, Hinton GE, 2012. ImageNet classification with deep convolutional neural networks. Proc 25^th Int Conf on Neural Information Processing Systems, p.1097-1105.

[28]Levin E, Narayanan S, Pieraccini R, et al., 2000. The ATT-DARPA ommunicator mixed-initiative spoken dialog system. 6^th Int Conf on Spoken Language Processing.

[29]Li JW, Galley M, Brockett C, et al., 2016. A persona-based neural conversation model. Proc 54^th Annual Meeting of the Association for Computational Linguistics, p.944-1003.

[30]Li X, Mou LL, Yan R, et al., 2016. Stalematebreaker: a proactive content-introducing approach to automatic human-computer conversation. Proc 25^th Int Joint Conf on Artificial Intelligence, p.2845-2851.

[31]Liu XD, Gao JF, He XD, et al., 2015. Representation learning using multi-task deep neural networks for semantic classification and information retrieval. Proc Annual Conf on North American Chapter of the ACL, p.912-921.

[32]Lu ZD, Li H, 2013. A deep architecture for matching short texts. Proc Int Conf on Neural Information Processing Systems, p.1367-1375.

[33]Maslow AH, 1943. A theory of human motivation. Psychol Rev, 50(4):370-396.

[34]Mathews A, Xie LX, He XM, 2016. SentiCap: generating image descriptions with sentiments. Proc 30^th AAAI Conf on Artificial Intelligence, p.3574-3580.

[35]Mesnil G, He X, Deng L, et al., 2013. Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. Interspeech, p.3771-3775.

[36]Mesnil G, Dauphin Y, Yao KS, et al., 2015. Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Trans Audio Speech Lang Process, 23(3):530-539.

[37]Mikolov T, Sutskever I, Chen K, et al., 2013. Distributed representations of words and phrases and their compositionality. Proc 26^th Int Conf on Neural Information Processing Systems, p.3111-3119.

[38]Mower E, Matarić MJ, Narayanan S, 2011. A framework for automatic human emotion classification using emotion profiles. IEEE Trans Audio Speech Lang Process, 19(5): 1057-1070.

[39]Murphy KR, 2007. A critique of emotional intelligence: what are the problems and how can they be fixed? Pers Psychol-, 60(1):235-238.

[40]Price PJ, 1990. Evaluation of spoken language systems: the ATIS domain. Proc Workshop on Speech and Natural Language, p.91-95.

[41]Qian Y, Fan YC, Hu WP, et al., 2014. On the training aspects of deep neural network (DNN) for parametric TTS synthesis. Proc IEEE Int Conf on Acoustics, Speech and Signal Processing, p.3829-3833.

[42]Raux A, Langner B, Bohus D, et al., 2005. Let’s go public! Taking a spoken dialog system to the real world. 9^th European Conf on Speech Communication and Technology, p.885-888.

[43]Rudnicky AI, Thayer EH, Constantinides PC, et al., 1999. Creating natural dialogs in the Carnegie Mellon communicator system. 6^th European Conf on Speech Communication and Technology.

[44]Sarikaya R, 2017. The technology behind personal digital assistants—an overview of the system architecture and key components. IEEE Signal Process Mag, 34(1):67-81.

[45]Sarikaya R, Crook PA, Marin A, et al., 2016. An overview of end-to-end language understanding and dialog management for personal digital assistants. Proc IEEE Spoken Language Technology Workshop, p.391-397.

[46]Seneff S, Hurley E, Lau R, et al., 1998. Galaxy-II: a reference architecture for conversational system development. 5^th Int Conf on Spoken Language Processing.

[47]Serban IV, Klinger T, Tesauro G, et al., 2017. Multiresolution recurrent neural networks: an application to dialogue response generation. AAAI, p.3288-3294.

[48]Shawar BA, Atwell E, 2007. Different measurements metrics to evaluate a chatbot system. Proc Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies, p.89-96.

[49]Shieber SM, 1994. Lessons from a restricted Turing test. Commun ACM, 37(6):70-78.

[50]Socher R, Perelygin A, Wu JY, et al., 2013. Recursive deep models for semantic compositionality over a sentiment treebank. Proc Conf on Empirical Methods in Natural Language Processing, p.1631-1642.

[51]Song R, 2018. Image to poetry by cross-modality understanding with unpaired data. Personal Communication.

[52]Sordoni A, Galley M, Auli M, et al., 2015. A neural network approach to context-sensitive generation of conversational responses. Proc Annual Conf on North American Chapter of the ACL, p.196-205.

[53]Sutskever I, Vinyals O, Le QVV, 2014. Sequence to sequence learning with neural networks. NIPS, p.1-9.

[54]Tokuhisa R, Inui K, Matsumoto Y, 2008. Emotion classification using massive examples extracted from the web. Proc 22^nd Int Conf on Computational Linguistics, p.881-888.

[55]Tur G, de Mori R, 2011. Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. John Wiley and Sons, New York, NY.

[56]Tur G, Deng L, 2011. Intent determination and spoken utterance classification. In: Tur G, de Mori R (Eds), Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. John Wiley and Sons, New York, NY.

[57]Turing A, 1950. Computing machinery and intelligence. Mind, 59:433-460.

[58]van den Oord A, Dieleman S, Zen HG, et al., 2016. WaveNet: a generative model for raw audio. 9^th ISCA Speech Synthesis Workshop, p.125.

[59]Vinyals O, Le QV, 2015. A neural conversational model. Proc 31^st Int Conf on Machine Learning.

[60]Vinyals O, Toshev A, Bengio S, et al., 2015. Show and tell: a neural image caption generator. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3156-3164.

[61]Walker M, Aberdeen J, Boland J, et al., 2001. DARPA Communicator dialog travel planning systems: the June 2000 data collection. Proc 7^th European Conf on Speech Communication and Technology.

[62]Walker M, Rudnicky AI, Aberdeen JS, et al., 2002, DARPA Communicator evaluation: progress from 2000 to 2001. Proc Int Conf on Spoken Language Processing, p. 273-276.

[63]Wallace RS, 2009. The anatomy of A.L.I.C.E. In: Epstein R, Roberts G, Beber G (Eds.), Parsing the Turing Test: Philosophical and Methodological Issues in the Quest for the Thinking Computer. Springer, Dordrecht, p.181-210.

[64]Wang HN, He XD, Chang MW, et al., 2013. Personalized ranking model adaptation for web search. Proc 36^th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.323-332.

[65]Wang YY, Deng L, Acero A, 2011. Semantic frame-based spoken language understanding. In: Tur G, de Mori R (Eds.), Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. John Wiley and Sons, New York, NY.

[66]Wang ZY, Wang HX, Wen JR, et al., 2015. An inference approach to basic level of categorization. Proc 24^th ACM Int Conf on Information and Knowledge Management, p.653-662.

[67]Weizenbaum J, 1966. ELIZA—a computer program for the study of natural language communication between man and machine. Commun ACM, 9(1):36-45.

[68]Wen TH, Vandyke D, Mrksic N, et al., 2016. A network-based end-to-end trainable task-oriented dialogue system. Proc 15^th Conf on European Chapter of the Association for Computational Linguistics, p.438-449.

[69]Williams JD, Young S, 2007. Partially observable Markov decision processes for spoken dialog systems. Comput Speech Lang, 21(2):393-422.

[70]Xiong W, Droppo J, Huang XD, et al., 2016. Achieving human parity in conversational speech recognition. IEEE/ACM Trans Audio Speech Lang Process, in press.

[71]Yan R, Song YP, Wu H, 2016. Learning to respond with deep neural networks for retrieval-based human-computer conversation system. Proc 39^th Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.55-64.

[72]Yang ZC, He XD, Gao JF, et al., 2016a. Stacked attention networks for image question answering. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.21-29.

[73]Yang ZC, Yang DY, Dyer C, et al., 2016b. Hierarchical attention networks for document classification. Proc 15^th Annual Conf on North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.1480-1489.

[74]Yu Z, Xu ZY, Black AW, et al., 2016. Chatbot evaluation and database expansion via crowdsourcing. Proc RE-WOCHAT Workshop of LREC.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Similar articles

- Go to

从Eliza到小冰：社交对话机器人的机遇和挑战

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference