Abstract
This paper introduces an intelligent multimedia information system, which exploits machine learning and database technologies. The system extracts semantic contents of videos automatically by using the visual, auditory and textual modalities, then, stores the extracted contents in an appropriate format to retrieve them efficiently in subsequent requests for information. The semantic contents are extracted from these three modalities of data separately. Afterwards, the outputs from these modalities are fused to increase the accuracy of the object extraction process. The semantic contents that are extracted using the information fusion are stored in an intelligent and fuzzy object-oriented database system. In order to answer user queries efficiently, a multidimensional indexing mechanism that combines the extracted high-level semantic information with the low-level video features is developed. The proposed multimedia information system is implemented as a prototype and its performance is evaluated using news video datasets for answering content and concept-based queries considering all these modalities and their fused data. The performance results show that the developed multimedia information system is robust and scalable for large scale multimedia applications.
















Similar content being viewed by others
References
Aydinlilar M, Yazici A (2012) Semi-automatic semantic video annotation tool. In: 27th Int. Symposium on Computer and Information Sciences (ISCIS 2012), Computer and Information Sciences III. Springer, London, pp 303–310
Aygun RS, Yazici A (2004) Modeling and management of fuzzy information in multimedia database applications. Multimedia Tools and Applications 24:29–56
Bastan M, Cam H, Gudukbay U, Ulusoy O (2010) BilVideo-7: an MPEG-7- compatible video indexing and retrieval system. IEEE MultiMedia 17(3):62–73
Benavent X, Garcia-Serrano A, Granados R, Benavent J, De Ves E (2013) Multimedia information retrieval based on late semantic fusion approaches: experiments on a Wikipedia image collection. IEEE Trans Multimedia 15(8):2009–2021
Berchtold S, Keim DA, Kriegel H-P (1996) The X-Tree: An index structure for high-dimensional data. In: Proc. of the 22th International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., pp 28–39
Bertini M, Del Bimbo A, Torniai C (2005) Automatic video annotation using ontologies extended with visual information. In: Proc. of the 13th annual ACM international conference on Multimedia, ACM, pp 395–398
Brendan J, Hongzhi L, Joseph GE, Daniel M-A, Hih-Fu C (2013) Structured exploration of who, what, when, and where in heterogeneous multimedia news sources. In: Proc. of the 21st ACM international conference on Multimedia, ACM, pp 357–360
Bu S, Cheng S, Liu Z, Han J (2014) Multimodal feature fusion for 3D shape recognition and retrieval. IEEE Multimedia 21(4):38–46
Calistru C, Riberio C, David G (2006) Multidimensional descriptor indexing: exploring the BitMatrix. In: CIVR 2006, lecture notes in computer science, 4071: 401–410, Springer Berlin Heidelberg
Daras P, Manolopoulou S, Axenopoulos A (2012) Search and retrieval of Rich Media objects supporting multiple multimodal queries. IEEE Trans Multimedia 14(3):734–746
Datta R, Li J, Wang JZ (2005) Content-based image retrieval: approaches and trends of the new age. In: Proc. of the 7th ACM International Workshop on Multimedia Information Retrieval, ACM, pp 253–262
Deng Y, Manjunath BS (2001) Unsupervised segmentation of color-texture regions in images and video. IEEE Trans Pattern Anal Mach Intell 23(8):800–810
Ekin A, Tekalp AM, Mehrotra R (2004) Integrated semantic-syntactic video modeling for search and browsing. IEEE Trans Multimedia 6:839–851
Faloutsos C, Lin K-I (1995) Fastmap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proc. of the int. conf. on management of data (SIGMOD 95), ACM, pp 163–174
Fan J, Elmagarmid AK, Zhu X, Aref WG, Wu L (2004) ClassView: hierarchical video shot classification, indexing, and accessing. IEEE Trans Multimedia 6:70–86
Fei-Fei L, Fergus R, Perona P (2006) One-shot learning of object categories. IEEE Trans Pattern Anal Mach Intell 28(4):594–611
Gonalves B, Calistru C, Riberio C, David G (2007) An evaluation framework for multidimensional multimedia descriptor indexing. In: 23rd Int. Conf. on Data Engineering (ICDE 2007), IEEE, pp 95–102
Gulen E, Yilmaz T, Yazici A (2012) Multimodal information fusion for semantic video analysis. Int J Multimedia Data Eng Manag 3(4):51–73
Hacid MS, Decleir C, Kouloumdjian J (2000) A database approach for modeling and querying video data. IEEE Trans Knowl Data Eng 12:729–750
Hardoon DR, Szedmak S, Taylor JS (2003) Canonical correlation analysis; An overview with application to learning methods. Royal Holloway, University of London, Technical Report CSD-TR-03-02
Hjelsvold R, Midtstraum R (2012) Modeling and querying video data. In: Proc. of the 20th Int. Conf. on Very Large Data Bases, Morgan Kaufmann Publishers Inc., pp. 686–694
Hotelling H (1936) Relations between two sets of variants. Biometrika 28:321–377
Jiang YG (2012) SUPER: Towards real-time event recognition in Internet videos. In: Proceedings of ACM international conference on multimedia retrieval (ICMR ‘12), ACM, Article no. 7
Jiang Y-G, Ye G, Chang S-F, Ellis D, Loui AC (2011) Consumer video understanding: a benchmark database and an evaluation of human and machine performance. In: Proc. Int. Conf. on Multimedia Retrieval (ICMR), ACM, Article No. 29
Jiang YG, Bhattacharya S, Chang SF, Shah M (2012) High-level event recognition in unconstrained videos. Int J Multimedia Information Retrieval 2(2):73–101
Kucuk D, Yazici A (2012) A hybrid named entity recognizer for Turkish. Expert Syst Appl 39(3):2733–2742
Kucuk D, Ozgur NB, Yazici A, Koyuncu M (2009) A fuzzy conceptual model for multimedia data with a text-based automatic annotation scheme. Int J Uncertainty Fuzziness Knowledge Based Syst 17(1):135–152
Kuss M, Graepel T (2003) The geometry of kernel canonical correlation analysis. In: Technical Report No. 108. Max Planck Institute
Lew MS, Sebe N, Djeraba C, Jain R (2006) Content-based multimedia information retrieval: state of the art and challenges. ACM Trans Multimed Comput Commun Appl 2:1–19
Li J, Allinson N, Tao D, Li X (2006) Multitraining support vector machine for image retrieval. IEEE Trans Image Process 15(11):3597–3601
Liu Y, Zhang D, Lu G, Ma W (2007) A survey of content-based image retrieval with high-level semantics. Pattern Recogn 40:262–282
Liu Z, Wang X, Bu S (2016) Human-centered saliency detection. IEEE Trans Neural Netw Learn Syst 27(6):1150–1162
LSCOM Lexicon Definitions and Annotations Version 1.0, DTO Challenge Workshop on Large Scale Concept Ontology for Multimedia, Columbia University ADVENT Technical Report #217–2006-3, March 2006
Ma ZM, Yan L (2010) A literature overview of fuzzy conceptual data modeling. J Inf Sci Eng 26:427–441
Marques O, Furht B (2002) MUSE: a content-based image search and retrieval system using relevance feedback. Multimedia Tools Applications 17:21–50
Meng T, Shyu ML (2012a) Leveraging concept association network for multimedia rare concept mining and retrieval. In: Int. Conf. on Multimedia and Expo (ICME 2012), IEEE, pp. 860–865
Meng T, Shyu ML (2012b) Model-driven collaboration and information integration for enhancing video semantic concept detection. In: 13th Int. Conf. on Information Reuse and Integration (IRI 2012), IEEE, pp 144–151
Montagnuolo M, Messina A (2009) Parallel neural networks for multimodal video genre classification. Multimedia Tools Applications 41(1):125–159
MPEG-7 http://mpeg.chiariglione.org/standards/mpeg-7. Accessed date 13.02.2013
NTVMSNBC http://www.ntvmsnbc.com/. Accessed date May 2013
Okuyucu C, Sert M, Yazici A (2013) Audio feature and classifier analysis for efficient recognition of environmental sounds. In: Proc. Int. Symposium on Multimedia (ISM 2013), IEEE, pp 125–132
Over P, Awad G, Kraaij W, Smeaton AF (2007) Trecvid 2007–overview. In TRECVID 2007, National Institute of Standards and Technology (NIST)
Ozgur NB, Koyuncu M, Yazici A (2009) An intelligent fuzzy object-oriented database framework for video database applications. Fuzzy Set Syst 160:2253–2274
Petkovic M, Jonker W (2000) An overview of data models and query languages for content-based video retrieval. In: Int. Conf. on Advances in Infrastructure for E-Business, Science, and Education on the Internet
Rho S, Lee SC, Hwang E, Lee YK (2004) XCRAB: A Content and Annotation-Based Multimedia Indexing and Retrieval System. In: ICCSA 2004, LNCS 3046(4). Springer, pp 859–868
Safadi B, Sahuguet M, Huet B (2014) When textual and visual information join forces for multimedia retrieval. In: Proc. of Int. Conf. on Multimedia Retrieval (ICMR 2014), p 265
Saggion H, Cunningham H, Bontcheva K, Maynard D, Hamza O, Wilks Y (2004) Multimedia indexing through multi-source and multi-language information extraction: MUMIS project. Data Knowl Eng 48:247–264
Salton G (1983) Introduction to modern information retrieval. McGraw-Hill
Sattari S, Yazici A (2015) Efficient Multimedia Information Retrieval with Query Level Fusion. In: the Int. Conf. on Flexible Query Answering Systems, Advances in Intelligent Systems and Computing, 400. Springer, pp 367–379
Shao J, Shen HT, Zhou X (2008) Challenges and techniques for effective and efficient similarity search in large video databases. Proceedings of the VLDB Endowment 1(2):1598–1603
Smith JR (2013) Riding the multimedia big data wave. In: Proc. of the 36th int. ACM SIGIR conf. on Research and development in information retrieval (SIGIR 2013), ACM, pp 1–2
Stowell D, Giannoulis D, Benetos E, Lagrange M, Plumbley MD (2015) Detection and classification of acoustic scenes and events. IEEE Trans Multimedia 17(10):1733–1746
Tusch R, Kosch H, Böszörményi L (2000) VIDEX: an integrated generic video indexing approach. In: Proc. of the eighth ACM int. conf. on Multimedia, ACM, pp 448–451
Wang G, Zhang Y, Fei-Fei L (2006) Using dependent regions for object categorization in a generative framework. In: Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), IEEE Computer Society, pp 1597–1604
Yan R, Hauptman A (2007) A review of text and image retrieval approaches for broadcast news video. Inf Retr 10(445):–484
Yazici A, Ince C, Koyuncu M (2008) FOOD index: a multidimensional index structure for similarity-based fuzzy object-oriented database models. IEEE Trans Fuzzy Syst 16(4):942–957
Yazici Y, Sattari S, Yilmaz T, Sert M, Koyuncu M, Gulen E (2016) METU-MMDS: An Intelligent Multimedia Database System for Multimodal Content Extraction and Querying. In: the 22nd Int. Conf. on Multimedia Modelling (MMM 2016), LNCS 9516 (2). Springer, pp 354–360
Yilmaz T, Yazici A, Yildirim Y (2011) Exploiting class-specific features in multi-feature dissimilarity space for efficient querying of images. In: the int. conf. On flexible query answering systems, LNCS, Springer, Berlin Heidelberg 7022: 149–161
Yilmaz T, Yazici A, Kitsuregawa M (2014) RELIEF-MM: effective modality weighting for multimedia information retrieval. Multimedia Systems 20(4):389–413
Acknowledgements
This work is supported by the research grants from TUBITAK with the grant numbers “MFAG-114R082”. We thank to all of previous researchers of Multimedia DB Lab. at METU and Ahmet Cosar, who have contributed to this research.
Author information
Authors and Affiliations
Corresponding author
Appendix: Example Screenshots of Developed System
Appendix: Example Screenshots of Developed System
Example screen for semantic concept extractor: a given video is divided into shots (upper-left table). For the selected shot (shot 18) four keyframes are detected (second table). The selected keyframe is segmented and the objects in segmented parts are recognized (image). One of the segmented parts is selected and marked in red. The semantic content extractor determines this object as a football player with a score of 1.0 (shown in the table under the image)
Example screen for a query-by-content (QBE): an image (image at the lower part of screen) is given as an example and videos containing similar images are queried. The video shot given at the top is one of the answers returned by the system. We see a car accident in the query image and the answer image contains a car crashing into a shop
An example multimodal query containing visual, audio and text modals: In this query, we search for video shots that are related to tennis videos containing tennis court and tennis players in visual modal; applause and crowd events in audio modal; the tennis player Federer in text modal. A number of video shots are selected and displayed based on their matching scores in decreasing order. The best matched result is shown at the top of the screen capture
Rights and permissions
About this article
Cite this article
Yazici, A., Koyuncu, M., Yilmaz, T. et al. An intelligent multimedia information system for multimodal content extraction and querying. Multimed Tools Appl 77, 2225–2260 (2018). https://doi.org/10.1007/s11042-017-4378-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-017-4378-6