Journal of Zhejiang University

Frontiers of Information Technology & Electronic Engineering 2017 Vol.18 No.1 P.44-57

Cross-media analysis and reasoning: advances and directions

Author(s): Yu-xin Peng, Wen-wu Zhu, Yao Zhao, Chang-sheng Xu, Qing-ming Huang, Han-qing Lu, Qing-hua Zheng, Tie-jun Huang, Wen Gao
Affiliation(s): 1. Institute of Computer Science and Technology, Peking University, Beijing 100871, China more
Corresponding email(s): pengyuxin@pku.edu.cn, wwzhu@tsinghua.edu.cn
Key Words: Cross-media analysis, Cross-media reasoning, Cross-media applications

Share this article to： More <<< Previous Article \|Next Article >>>

Yu-xin Peng, Wen-wu Zhu, Yao Zhao, Chang-sheng Xu, Qing-ming Huang, Han-qing Lu, Qing-hua Zheng, Tie-jun Huang, Wen Gao. Cross-media analysis and reasoning: advances and directions[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(1): 44-57.

@article{title="Cross-media analysis and reasoning: advances and directions",
author="Yu-xin Peng, Wen-wu Zhu, Yao Zhao, Chang-sheng Xu, Qing-ming Huang, Han-qing Lu, Qing-hua Zheng, Tie-jun Huang, Wen Gao",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="18",
number="1",
pages="44-57",
year="2017",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1601787"
}

%0 Journal Article
%T Cross-media analysis and reasoning: advances and directions
%A Yu-xin Peng
%A Wen-wu Zhu
%A Yao Zhao
%A Chang-sheng Xu
%A Qing-ming Huang
%A Han-qing Lu
%A Qing-hua Zheng
%A Tie-jun Huang
%A Wen Gao
%J Frontiers of Information Technology & Electronic Engineering
%V 18
%N 1
%P 44-57
%@ 2095-9184
%D 2017
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1601787

TY - JOUR
T1 - Cross-media analysis and reasoning: advances and directions
A1 - Yu-xin Peng
A1 - Wen-wu Zhu
A1 - Yao Zhao
A1 - Chang-sheng Xu
A1 - Qing-ming Huang
A1 - Han-qing Lu
A1 - Qing-hua Zheng
A1 - Tie-jun Huang
A1 - Wen Gao
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 18
IS - 1
SP - 44
EP - 57
%@ 2095-9184
Y1 - 2017
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1601787

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: cross-media analysis and reasoning is an active research area in computer science, and a promising direction for artificial intelligence. However, to the best of our knowledge, no existing work has summarized the state-of-the-art methods for cross-media analysis and reasoning or presented advances, challenges, and future directions for the field. To address these issues, we provide an overview as follows: (1) theory and model for cross-media uniform representation; (2) cross-media correlation understanding and deep mining; (3) cross-media knowledge graph construction and learning methodologies; (4) cross-media knowledge evolution and reasoning; (5) cross-media description and generation; (6) cross-media intelligent engines; and (7) cross-media intelligent applications. By presenting approaches, advances, and future directions in cross-media analysis and reasoning, our goal is not only to draw more attention to the state-of-the-art advances in the field, but also to provide technical insights by discussing the challenges and research directions in these areas.

跨媒体分析与推理：研究进展与发展方向

概要：跨媒体分析与推理是计算机科学的热点问题，也是人工智能中一个具有广阔前景的研究方向。目前，尚未有文献对跨媒体分析与推理的现有方法进行归纳总结并给出它的研究进展、挑战及发展方向。为解决这些问题，本文从七个方面进行综述：（1）跨媒体统一表征理论与模型；（2）跨媒体关联理解与深度挖掘；（3）跨媒体知识图谱构建与学习方法；（4）跨媒体知识演化与推理；（5）跨媒体描述与生成；（6）跨媒体智能引擎；（7）跨媒体智能应用。本文的目标是给出跨媒体分析与推理的方法、进展以及发展方向，吸引更多人关注该领域的最新进展，通过探讨面临的挑战和研究方向，为研究者提供重要参考。

关键词：跨媒体分析；跨媒体推理；跨媒体应用

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Aamodt, A., Plaza, E., 1994. Case-based reasoning: foundational issues, methodological variations, and system approaches. AI Commun., 7(1):39-59.

[2]Adib, F., Hsu, C.Y., Mao, H., et al., 2015. Capturing the human figure through a wall. ACM Trans. Graph., 34(6):219.

[3]Andrew, G., Arora, R., Bilmes, J., et al., 2013. Deep canonical correlation analysis. Int. Conf. on Machine Learning, p.1247-1255.

[4]Antenucci, D., Li, E., Liu, S., et al., 2013. Ringtail: a generalized nowcasting system. Proc. VLDB Endow., 6(12): 1358-1361.

[5]Antol, S., Agrawal, A., Lu, J., et al., 2015. VQA: visual question answering. IEEE Int. Conf. on Computer Vision, p.2425-2433.

[6]Babenko, A., Slesarev, A., Chigorin, A., et al., 2014. Neural codes for image retrieval. European Conf. on Computer Vision, p.584-599.

[7]Brownson, R.C., Gurney, J.G., Land, G.H., 1999. Evidence-based decision making in public health. J. Publ. Health Manag. Pract., 5(5):86-97.

[8]Carlson, C., Betteridge, J., Kisiel, B., et al., 2010. Towards an architecture for never-ending language learning. AAAI Conf. on Artificial Intelligence, p.1306-1313.

[9]Chen, D.P., Weber, S.C., Constantinou, P.S., et al., 2007. Clinical arrays of laboratory measures, or “clinarrays”, built from an electronic health record enable disease subtyping by severity. AMIA Annual Symp. Proc., p.115-119.

[10]Chen, X., Shrivastava, A., Gupta, A., 2013. NEIL: extracting visual knowledge from web data. IEEE Int. Conf. on Computer Vision, p.1409-1416.

[11]Chen, Y., Carroll, R.J., Hinz, E.R.M., et al., 2013. Applying active learning to high-throughput phenotyping algorithms for electronic health records data. J. Am. Med. Inform. Assoc., 20(e2):253-259.

[12]Cilibrasi, R.L., Vitanyi, P.M.B., 2007. The Google similarity distance. IEEE Trans. Knowl. Data Eng., 19(3):370-383.

[13]Culotta, A., 2014. Estimating county health statistics with twitter. ACM Conf. on Human Factors in Computing Systems, p.1335-1344.

[14]Daras, P., Manolopoulou, S., Axenopoulos, A., 2012. Search and retrieval of rich media objects supporting multiple multimodal queries. IEEE Trans. Multim., 14(3):734-746.

[15]Davenport, T.H., Prusak, L., 1998. Working Knowledge: How Organizations Manage What They Know. Harvard Business School Press, Boston, p.5.

[16]Deng, J., Dong, W., Socher, R., et al., 2009. ImageNet: a large-scale hierarchical image database. IEEE Conf. on Computer Vision and Pattern Recognition, p.248-255.

[17]Dong, X., Gabrilovich, E., Heitz, G., et al., 2014. Knowledge vault: a Web-scale approach to probabilistic knowledge fusion. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.601-610.

[18]Fang, Q., Xu, C., Sang, J., et al., 2016. Folksonomy-based visual ontology construction and its applications. IEEE Trans. Multim., 18(4):702-713.

[19]Fellbaum, C., Miller, G., 1998. WordNet: an Electronic Lexical Database. MIT Press, Cambridge, MA.

[20]Feng, F., Wang, X., Li, R., 2014. Cross-modal retrieval with correspondence autoencoder. ACM Int. Conf. on Multimedia, p.7-16.

[21]Ferrucci, D., Levas, A., Bagchi, S., et al., 2013. Watson: beyond jeopardy! Artif. Intell., 199-200:93-105.

[22]Fuentes-Pacheco, J., Ruiz-Ascencio, J., Rendón-Mancha, J.M., 2015. Visual simultaneous localization and mapping: a survey. Artif. Intell. Rev., 43(1):55-81.

[23]Garfield, E., 2004. Historiographic mapping of knowledge domains literature. J. Inform. Sci., 30(2):119-145.

[24]Gibney, E., 2015. DeepMind algorithm beats people at classic video games. Nature, 518(7540):465-466.

[25]Ginsberg, J., Mohebbi, M., Patel, R.S., et al., 2009. Detecting influenza epidemics using search engine query data. Nature, 457(7232):1012-1014.

[26]Gong, Y., Ke, Q., Isard, M., et al., 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. Int. J. Comput. Vis., 106(2):210-233.

[27]Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neur. Comput., 9(8):1735-1780.

[28]Hodosh, M., Young, P., Hockenmaier, J., 2013. Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res., 47(1):853-899.

[29]Hotelling, H., 1936. Relations between two sets of variates. Biometrika, 28(3-4):321-377. https://doi.org/10.1093/biomet/28.3-4.321

[30]Hsu, F., 2002. Behind Deep Blue: Building the Computer that Defeated the World Chess Champion. Princeton University Press, Princeton, USA.

[31]Hua, Y., Wang, S., Liu, S., et al., 2014. TINA: cross-modal correlation learning by adaptive hierarchical semantic aggregation. IEEE Int. Conf. on Data Mining, p.190-199.

[32]Jia, X., Gavves, E., Fernando, B., et al., 2015. Guiding long-short term memory for image caption generation. arXiv:1509.04942.

[33]Johnson, J., Krishna, R., Stark, M., et al., 2015. Image retrieval using scene graphs. IEEE Conf. on Computer Vision and Pattern Recognition, p.3668-3678.

[34]Karpathy, A., Li, F.F., 2015. Deep visual-semantic alignments for generating image descriptions. IEEE Conf. on Computer Vision and Pattern Recognition, p.3128-3137.

[35]Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet: classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, p.1097-1105.

[36]Kulkarni, G., Premraj, V., Dhar, S., et al., 2011. Baby talk: understanding and generating simple image descriptions. IEEE Conf. on Computer Vision and Pattern Recognition, p.1601-1608.

[37]Kumar, S., Sanderford, M., Gray, V.E., et al., 2012. Evolutionary diagnosis method for variants in personal exomes. Nat. Meth., 9(9):855-856.

[38]Kuznetsova, P., Ordonezz, V., Berg, T.L., et al., 2014. TREETALK: composition and compression of trees for image descriptions. Trans. Assoc. Comput. Ling., 2:351-362.

[39]Lazaric, A., 2012. Transfer in reinforcement learning: a frame-work and a survey. In: Wiering, M., van Otterlo, M. (Eds.), Reinforcement Learning: State-of-the-Art. Springer Berlin Heidelberg, Berlin, p.143-173.

[40]Lazer, D., Kennedy, R., King, G., et al., 2014. The parable of Google flu: traps in big data analysis. Science, 343(6176): 1203-1205.

[41]Lew, M.S., Sebe, N., Djeraba, C., et al., 2006. Content-based multimedia information retrieval: state of the art and challenges. ACM Trans. Multim. Comput. Commun. Appl., 2(1):1-19.

[42]Lin, T., Pantel, P., Gamon, M., et al., 2012. Active objects: actions for entity-centric search. ACM Int. Conf. on World Wide Web, p.589-598.

[43]Luo, G., Tang, C., 2008. On iterative intelligent medical search. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.3-10.

[44]Mao, X., Lin, B., Cai, D., et al., 2013. Parallel field alignment for cross media retrieval. ACM Int. Conf. on Multimedia, p.897-906.

[45]McGurk, H., MacDonald, J., 1976. Hearing lips and seeing voices. Nature, 264(5588):746-748.

[46]MIT Technology Review, 2014. Data driven healthcare. https://www.technologyreview.com/business-report/data-driven-health-care/free [Dec. 06, 2016].

[47]Mnih, V., Kavukcuoglu, K., Silver, D., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529-333.

[48]Ngiam, J., Khosla, A., Kim, M., et al., 2011. Multimodal deep learning. Int. Conf. on Machine Learning, p.689-696.

[49]Ordonez, V., Kulkarni, G., Berg, T.L., 2011. Im2text: describing images using 1 million captioned photographs. Advances in Neural Information Processing Systems, p.1143-1151.

[50]Pan, Y.H., 2016. Heading toward artificial intelligence 2.0. Engineering, 2(4):409-413.

[51]Pearl, J., 2000. Causality: Models, Reasoning and Inference. Cambridge University Press, Cambridge, UK.

[52]Peng, Y., Huang, X., Qi, J., 2016a. Cross-media shared representation by hierarchical learning with multiple deep networks. Int. Joint Conf. on Artificial Intelligence, p.3846-3853.

[53]Peng, Y., Zhai, X., Zhao, Y., et al., 2016b. Semi-supervised cross-media feature learning with unified patch graph regularization. IEEE Trans. Circ. Syst. Video Technol., 26(3):583-596.

[54]Prabhu, N., Babu, R.V., 2015. Attribute-Graph: a graph based approach to image ranking. IEEE Int. Conf. on Computer Vision, p.1071-1079.

[55]Radinsky, K., Davidovich, S., Markovitch, S., 2012. Learning causality for news events prediction. Int. Conf. on World Wide Web, p.909-918.

[56]Rasiwasia, N., Costa Pereira, J., Coviello, E., et al., 2010. A new approach to cross-modal multimedia retrieval. ACM Int. Conf. on Multimedia, p.251-260.

[57]Rasiwasia, N., Mahajan, D., Mahadevan, V., et al., 2014. Cluster canonical correlation analysis. Int. Conf. on Artificial Intelligence and Statistics, p.823-831.

[58]Rautaray, S.S., Agrawal, A., 2015. Vision based hand gesture recognition for human computer interaction: a survey. Artif. Intell. Rev., 43(1):1-54.

[59]Roller, S., Schulte im Walde, S., 2013. A multimodal LDA model integrating textual, cognitive and visual modalities. Conf. on Empirical Methods in Natural Language Processing, p.1146-1157.

[60]Sadeghi, F., Divvala, S.K., Farhadi, A., 2015. VisKE: visual knowledge extraction and question answering by visual verification of relation phrases. IEEE Conf. on Computer Vision and Pattern Recognition, p.1456-1464.

[61]Singhal, A., 2012. Introducing the knowledge graph: things, not strings. Official Blog of Google.

[62]Socher, R., Lin, C., Ng, A.Y., et al., 2011. Parsing natural scenes and natural language with recursive neural networks. Int. Conf. on Machine Learning, p.129-136.

[63]Socher, R., Karpathy, A., Le, Q., et al., 2014. Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Ling., 2:207-218.

[64]Srivastava, N., Salakhutdinov, R., 2012. Multimodal learning with deep Boltzmann machines. Advances in Neural Information Processing Systems, p.2222-2230.

[65]Suchanek, F., Weikum, G., 2014. Knowledge bases in the age of big data analytics. Proc. VLDB Endow., 7(13):1713-1714.

[66]Uyar, A., Aliyu, F.M., 2015. Evaluating search features of Google Knowledge Graph and Bing Satori: entity types, list searches and query interfaces. Onl. Inform. Rev., 39(2):197-213.

[67]Vinyals, O., Toshev, A., Bengio, S., et al., 2015. Show and tell: a neural image caption generator. IEEE Conf. on Computer Vision and Pattern Recognition, p.3156-3164.

[68]Wang, D., Cui, P., Ou, M., et al., 2015. Learning compact hash codes for multimodal representations using orthogonal deep structure. IEEE Trans. Multim., 17(9): 1404-1416.

[69]Wang, W., Ooi, B.C., Yang, X., et al., 2014. Effective multi-modal retrieval based on stacked auto-encoders. Proc. VLDB Endow., 7(8):649-660.

[70]Wang, Y., Wu, F., Song, J., et al., 2014. Multi-modal mutual topic reinforce modeling for cross-media retrieval. ACM Int. Conf. on Multimedia, p.307-316.

[71]Wei, Y., Zhao, Y., Lu, C., et al., 2017. Cross-modal retrieval with CNN visual features: a new baseline. IEEE Trans. Cybern., 47(2):449-460.

[72]Wu, W., Xu, J., Li, H., 2010. Learning similarity function between objects in heterogeneous spaces. Technique Report MSR-TR-2010-86, Microsoft.

[73]Xu, K., Ba, J., Kiros, R., et al., 2015. Show, attend and tell: neural image caption generation with visual attention. Int. Conf. on Machine Learning, p.2048-2057.

[74]Yang, Y., Zhuang, Y., Wu, F., et al., 2008. Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Trans. Multim., 10(3):437-446.

[75]Yang, Y., Teo, C.L., Daume, H., et al., 2011. Corpus-guided sentence generation of natural images. Conf. on Empirical Methods in Natural Language Processing, p.444-454.

[76]Yang, Y., Nie, F., Xu, D., et al., 2012. A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans. Patt. Anal. Mach. Intell., 34(4):723-742.

[77]Yuan, L., Pan, C., Ji, S., et al., 2014. Automated annotation of developmental stages of Drosophila embryos in images containing spatial patterns of expression. Bioinformatics, 30(2):266-273.

[78]Zhai, X., Peng, Y., Xiao, J., 2014. Learning cross-media joint representation with sparse and semi-supervised regularization. IEEE Trans. Circ. Syst. Video Technol., 24(6):965-978.

[79]Zhang, H., Yang, Y., Luan, H., et al., 2014a. Start from scratch: towards automatically identifying, modeling, and naming visual attributes. ACM Int. Conf. on Multimedia, p.187-196.

[80]Zhang, H., Yuan, J., Gao, X., et al., 2014b. Boosting cross-media retrieval via visual-auditory feature analysis and relevance feedback. ACM Int. Conf. on Multimedia, p.953-956.

[81]Zhang, H., Shang, X., Luan, H., et al., 2016. Learning from collective intelligence: feature learning using social images and tags. ACM Trans. Multim. Comput. Commun. Appl., 13(1):1.

[82]Zhang, J., Wang, S., Huang, Q., 2015. Location-based parallel tag completion for geo-tagged social image retrieval. ACM Int. Conf. on Multimedia Retrieval, p.355-362.

[83]Zhu, Y., Zhang, C., Ré, C., et al., 2015. Building a large-scale multimodal knowledge base system for answering visual queries. arXiv:1507.05670.

Open peer comments: Debate/Discuss/Question/Opinion

<1>