CLC number: TP391
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2017-01-01
Cited: 1
Clicked: 7183
Yu-xin Peng, Wen-wu Zhu, Yao Zhao, Chang-sheng Xu, Qing-ming Huang, Han-qing Lu, Qing-hua Zheng, Tie-jun Huang, Wen Gao. Cross-media analysis and reasoning: advances and directions[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(1): 44-57.
@article{title="Cross-media analysis and reasoning: advances and directions",
author="Yu-xin Peng, Wen-wu Zhu, Yao Zhao, Chang-sheng Xu, Qing-ming Huang, Han-qing Lu, Qing-hua Zheng, Tie-jun Huang, Wen Gao",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="18",
number="1",
pages="44-57",
year="2017",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1601787"
}
%0 Journal Article
%T Cross-media analysis and reasoning: advances and directions
%A Yu-xin Peng
%A Wen-wu Zhu
%A Yao Zhao
%A Chang-sheng Xu
%A Qing-ming Huang
%A Han-qing Lu
%A Qing-hua Zheng
%A Tie-jun Huang
%A Wen Gao
%J Frontiers of Information Technology & Electronic Engineering
%V 18
%N 1
%P 44-57
%@ 2095-9184
%D 2017
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1601787
TY - JOUR
T1 - Cross-media analysis and reasoning: advances and directions
A1 - Yu-xin Peng
A1 - Wen-wu Zhu
A1 - Yao Zhao
A1 - Chang-sheng Xu
A1 - Qing-ming Huang
A1 - Han-qing Lu
A1 - Qing-hua Zheng
A1 - Tie-jun Huang
A1 - Wen Gao
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 18
IS - 1
SP - 44
EP - 57
%@ 2095-9184
Y1 - 2017
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1601787
Abstract: cross-media analysis and reasoning is an active research area in computer science, and a promising direction for artificial intelligence. However, to the best of our knowledge, no existing work has summarized the state-of-the-art methods for cross-media analysis and reasoning or presented advances, challenges, and future directions for the field. To address these issues, we provide an overview as follows: (1) theory and model for cross-media uniform representation; (2) cross-media correlation understanding and deep mining; (3) cross-media knowledge graph construction and learning methodologies; (4) cross-media knowledge evolution and reasoning; (5) cross-media description and generation; (6) cross-media intelligent engines; and (7) cross-media intelligent applications. By presenting approaches, advances, and future directions in cross-media analysis and reasoning, our goal is not only to draw more attention to the state-of-the-art advances in the field, but also to provide technical insights by discussing the challenges and research directions in these areas.
[1]Aamodt, A., Plaza, E., 1994. Case-based reasoning: foundational issues, methodological variations, and system approaches. AI Commun., 7(1):39-59.
[2]Adib, F., Hsu, C.Y., Mao, H., et al., 2015. Capturing the human figure through a wall. ACM Trans. Graph., 34(6):219.
[3]Andrew, G., Arora, R., Bilmes, J., et al., 2013. Deep canonical correlation analysis. Int. Conf. on Machine Learning, p.1247-1255.
[4]Antenucci, D., Li, E., Liu, S., et al., 2013. Ringtail: a generalized nowcasting system. Proc. VLDB Endow., 6(12): 1358-1361.
[5]Antol, S., Agrawal, A., Lu, J., et al., 2015. VQA: visual question answering. IEEE Int. Conf. on Computer Vision, p.2425-2433.
[6]Babenko, A., Slesarev, A., Chigorin, A., et al., 2014. Neural codes for image retrieval. European Conf. on Computer Vision, p.584-599.
[7]Brownson, R.C., Gurney, J.G., Land, G.H., 1999. Evidence-based decision making in public health. J. Publ. Health Manag. Pract., 5(5):86-97.
[8]Carlson, C., Betteridge, J., Kisiel, B., et al., 2010. Towards an architecture for never-ending language learning. AAAI Conf. on Artificial Intelligence, p.1306-1313.
[9]Chen, D.P., Weber, S.C., Constantinou, P.S., et al., 2007. Clinical arrays of laboratory measures, or “clinarrays”, built from an electronic health record enable disease subtyping by severity. AMIA Annual Symp. Proc., p.115-119.
[10]Chen, X., Shrivastava, A., Gupta, A., 2013. NEIL: extracting visual knowledge from web data. IEEE Int. Conf. on Computer Vision, p.1409-1416.
[11]Chen, Y., Carroll, R.J., Hinz, E.R.M., et al., 2013. Applying active learning to high-throughput phenotyping algorithms for electronic health records data. J. Am. Med. Inform. Assoc., 20(e2):253-259.
[12]Cilibrasi, R.L., Vitanyi, P.M.B., 2007. The Google similarity distance. IEEE Trans. Knowl. Data Eng., 19(3):370-383.
[13]Culotta, A., 2014. Estimating county health statistics with twitter. ACM Conf. on Human Factors in Computing Systems, p.1335-1344.
[14]Daras, P., Manolopoulou, S., Axenopoulos, A., 2012. Search and retrieval of rich media objects supporting multiple multimodal queries. IEEE Trans. Multim., 14(3):734-746.
[15]Davenport, T.H., Prusak, L., 1998. Working Knowledge: How Organizations Manage What They Know. Harvard Business School Press, Boston, p.5.
[16]Deng, J., Dong, W., Socher, R., et al., 2009. ImageNet: a large-scale hierarchical image database. IEEE Conf. on Computer Vision and Pattern Recognition, p.248-255.
[17]Dong, X., Gabrilovich, E., Heitz, G., et al., 2014. Knowledge vault: a Web-scale approach to probabilistic knowledge fusion. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.601-610.
[18]Fang, Q., Xu, C., Sang, J., et al., 2016. Folksonomy-based visual ontology construction and its applications. IEEE Trans. Multim., 18(4):702-713.
[19]Fellbaum, C., Miller, G., 1998. WordNet: an Electronic Lexical Database. MIT Press, Cambridge, MA.
[20]Feng, F., Wang, X., Li, R., 2014. Cross-modal retrieval with correspondence autoencoder. ACM Int. Conf. on Multimedia, p.7-16.
[21]Ferrucci, D., Levas, A., Bagchi, S., et al., 2013. Watson: beyond jeopardy! Artif. Intell., 199-200:93-105.
[22]Fuentes-Pacheco, J., Ruiz-Ascencio, J., Rendón-Mancha, J.M., 2015. Visual simultaneous localization and mapping: a survey. Artif. Intell. Rev., 43(1):55-81.
[23]Garfield, E., 2004. Historiographic mapping of knowledge domains literature. J. Inform. Sci., 30(2):119-145.
[24]Gibney, E., 2015. DeepMind algorithm beats people at classic video games. Nature, 518(7540):465-466.
[25]Ginsberg, J., Mohebbi, M., Patel, R.S., et al., 2009. Detecting influenza epidemics using search engine query data. Nature, 457(7232):1012-1014.
[26]Gong, Y., Ke, Q., Isard, M., et al., 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. Int. J. Comput. Vis., 106(2):210-233.
[27]Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neur. Comput., 9(8):1735-1780.
[28]Hodosh, M., Young, P., Hockenmaier, J., 2013. Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res., 47(1):853-899.
[29]Hotelling, H., 1936. Relations between two sets of variates. Biometrika, 28(3-4):321-377. https://doi.org/10.1093/biomet/28.3-4.321
[30]Hsu, F., 2002. Behind Deep Blue: Building the Computer that Defeated the World Chess Champion. Princeton University Press, Princeton, USA.
[31]Hua, Y., Wang, S., Liu, S., et al., 2014. TINA: cross-modal correlation learning by adaptive hierarchical semantic aggregation. IEEE Int. Conf. on Data Mining, p.190-199.
[32]Jia, X., Gavves, E., Fernando, B., et al., 2015. Guiding long-short term memory for image caption generation. arXiv:1509.04942.
[33]Johnson, J., Krishna, R., Stark, M., et al., 2015. Image retrieval using scene graphs. IEEE Conf. on Computer Vision and Pattern Recognition, p.3668-3678.
[34]Karpathy, A., Li, F.F., 2015. Deep visual-semantic alignments for generating image descriptions. IEEE Conf. on Computer Vision and Pattern Recognition, p.3128-3137.
[35]Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet: classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, p.1097-1105.
[36]Kulkarni, G., Premraj, V., Dhar, S., et al., 2011. Baby talk: understanding and generating simple image descriptions. IEEE Conf. on Computer Vision and Pattern Recognition, p.1601-1608.
[37]Kumar, S., Sanderford, M., Gray, V.E., et al., 2012. Evolutionary diagnosis method for variants in personal exomes. Nat. Meth., 9(9):855-856.
[38]Kuznetsova, P., Ordonezz, V., Berg, T.L., et al., 2014. TREETALK: composition and compression of trees for image descriptions. Trans. Assoc. Comput. Ling., 2:351-362.
[39]Lazaric, A., 2012. Transfer in reinforcement learning: a frame-work and a survey. In: Wiering, M., van Otterlo, M. (Eds.), Reinforcement Learning: State-of-the-Art. Springer Berlin Heidelberg, Berlin, p.143-173.
[40]Lazer, D., Kennedy, R., King, G., et al., 2014. The parable of Google flu: traps in big data analysis. Science, 343(6176): 1203-1205.
[41]Lew, M.S., Sebe, N., Djeraba, C., et al., 2006. Content-based multimedia information retrieval: state of the art and challenges. ACM Trans. Multim. Comput. Commun. Appl., 2(1):1-19.
[42]Lin, T., Pantel, P., Gamon, M., et al., 2012. Active objects: actions for entity-centric search. ACM Int. Conf. on World Wide Web, p.589-598.
[43]Luo, G., Tang, C., 2008. On iterative intelligent medical search. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.3-10.
[44]Mao, X., Lin, B., Cai, D., et al., 2013. Parallel field alignment for cross media retrieval. ACM Int. Conf. on Multimedia, p.897-906.
[45]McGurk, H., MacDonald, J., 1976. Hearing lips and seeing voices. Nature, 264(5588):746-748.
[46]MIT Technology Review, 2014. Data driven healthcare. https://www.technologyreview.com/business-report/data-driven-health-care/free [Dec. 06, 2016].
[47]Mnih, V., Kavukcuoglu, K., Silver, D., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529-333.
[48]Ngiam, J., Khosla, A., Kim, M., et al., 2011. Multimodal deep learning. Int. Conf. on Machine Learning, p.689-696.
[49]Ordonez, V., Kulkarni, G., Berg, T.L., 2011. Im2text: describing images using 1 million captioned photographs. Advances in Neural Information Processing Systems, p.1143-1151.
[50]Pan, Y.H., 2016. Heading toward artificial intelligence 2.0. Engineering, 2(4):409-413.
[51]Pearl, J., 2000. Causality: Models, Reasoning and Inference. Cambridge University Press, Cambridge, UK.
[52]Peng, Y., Huang, X., Qi, J., 2016a. Cross-media shared representation by hierarchical learning with multiple deep networks. Int. Joint Conf. on Artificial Intelligence, p.3846-3853.
[53]Peng, Y., Zhai, X., Zhao, Y., et al., 2016b. Semi-supervised cross-media feature learning with unified patch graph regularization. IEEE Trans. Circ. Syst. Video Technol., 26(3):583-596.
[54]Prabhu, N., Babu, R.V., 2015. Attribute-Graph: a graph based approach to image ranking. IEEE Int. Conf. on Computer Vision, p.1071-1079.
[55]Radinsky, K., Davidovich, S., Markovitch, S., 2012. Learning causality for news events prediction. Int. Conf. on World Wide Web, p.909-918.
[56]Rasiwasia, N., Costa Pereira, J., Coviello, E., et al., 2010. A new approach to cross-modal multimedia retrieval. ACM Int. Conf. on Multimedia, p.251-260.
[57]Rasiwasia, N., Mahajan, D., Mahadevan, V., et al., 2014. Cluster canonical correlation analysis. Int. Conf. on Artificial Intelligence and Statistics, p.823-831.
[58]Rautaray, S.S., Agrawal, A., 2015. Vision based hand gesture recognition for human computer interaction: a survey. Artif. Intell. Rev., 43(1):1-54.
[59]Roller, S., Schulte im Walde, S., 2013. A multimodal LDA model integrating textual, cognitive and visual modalities. Conf. on Empirical Methods in Natural Language Processing, p.1146-1157.
[60]Sadeghi, F., Divvala, S.K., Farhadi, A., 2015. VisKE: visual knowledge extraction and question answering by visual verification of relation phrases. IEEE Conf. on Computer Vision and Pattern Recognition, p.1456-1464.
[61]Singhal, A., 2012. Introducing the knowledge graph: things, not strings. Official Blog of Google.
[62]Socher, R., Lin, C., Ng, A.Y., et al., 2011. Parsing natural scenes and natural language with recursive neural networks. Int. Conf. on Machine Learning, p.129-136.
[63]Socher, R., Karpathy, A., Le, Q., et al., 2014. Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Ling., 2:207-218.
[64]Srivastava, N., Salakhutdinov, R., 2012. Multimodal learning with deep Boltzmann machines. Advances in Neural Information Processing Systems, p.2222-2230.
[65]Suchanek, F., Weikum, G., 2014. Knowledge bases in the age of big data analytics. Proc. VLDB Endow., 7(13):1713-1714.
[66]Uyar, A., Aliyu, F.M., 2015. Evaluating search features of Google Knowledge Graph and Bing Satori: entity types, list searches and query interfaces. Onl. Inform. Rev., 39(2):197-213.
[67]Vinyals, O., Toshev, A., Bengio, S., et al., 2015. Show and tell: a neural image caption generator. IEEE Conf. on Computer Vision and Pattern Recognition, p.3156-3164.
[68]Wang, D., Cui, P., Ou, M., et al., 2015. Learning compact hash codes for multimodal representations using orthogonal deep structure. IEEE Trans. Multim., 17(9): 1404-1416.
[69]Wang, W., Ooi, B.C., Yang, X., et al., 2014. Effective multi-modal retrieval based on stacked auto-encoders. Proc. VLDB Endow., 7(8):649-660.
[70]Wang, Y., Wu, F., Song, J., et al., 2014. Multi-modal mutual topic reinforce modeling for cross-media retrieval. ACM Int. Conf. on Multimedia, p.307-316.
[71]Wei, Y., Zhao, Y., Lu, C., et al., 2017. Cross-modal retrieval with CNN visual features: a new baseline. IEEE Trans. Cybern., 47(2):449-460.
[72]Wu, W., Xu, J., Li, H., 2010. Learning similarity function between objects in heterogeneous spaces. Technique Report MSR-TR-2010-86, Microsoft.
[73]Xu, K., Ba, J., Kiros, R., et al., 2015. Show, attend and tell: neural image caption generation with visual attention. Int. Conf. on Machine Learning, p.2048-2057.
[74]Yang, Y., Zhuang, Y., Wu, F., et al., 2008. Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Trans. Multim., 10(3):437-446.
[75]Yang, Y., Teo, C.L., Daume, H., et al., 2011. Corpus-guided sentence generation of natural images. Conf. on Empirical Methods in Natural Language Processing, p.444-454.
[76]Yang, Y., Nie, F., Xu, D., et al., 2012. A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans. Patt. Anal. Mach. Intell., 34(4):723-742.
[77]Yuan, L., Pan, C., Ji, S., et al., 2014. Automated annotation of developmental stages of Drosophila embryos in images containing spatial patterns of expression. Bioinformatics, 30(2):266-273.
[78]Zhai, X., Peng, Y., Xiao, J., 2014. Learning cross-media joint representation with sparse and semi-supervised regularization. IEEE Trans. Circ. Syst. Video Technol., 24(6):965-978.
[79]Zhang, H., Yang, Y., Luan, H., et al., 2014a. Start from scratch: towards automatically identifying, modeling, and naming visual attributes. ACM Int. Conf. on Multimedia, p.187-196.
[80]Zhang, H., Yuan, J., Gao, X., et al., 2014b. Boosting cross-media retrieval via visual-auditory feature analysis and relevance feedback. ACM Int. Conf. on Multimedia, p.953-956.
[81]Zhang, H., Shang, X., Luan, H., et al., 2016. Learning from collective intelligence: feature learning using social images and tags. ACM Trans. Multim. Comput. Commun. Appl., 13(1):1.
[82]Zhang, J., Wang, S., Huang, Q., 2015. Location-based parallel tag completion for geo-tagged social image retrieval. ACM Int. Conf. on Multimedia Retrieval, p.355-362.
[83]Zhu, Y., Zhang, C., Ré, C., et al., 2015. Building a large-scale multimodal knowledge base system for answering visual queries. arXiv:1507.05670.
Open peer comments: Debate/Discuss/Question/Opinion
<1>