A contribution of the Visual Analytics Research Group of the TIB – Leibniz Information Center for Science and Technology received the "Best Paper Award" at this year's ACM International Conference on Multimedia Retrieval (ICMR), which took place from 10 to 13 June 2019 in Ottawa (Canada). A total of 84 papers in the "Full Paper" category were submitted for review, of which 26 were invited to give a talk at the conference.
Christian Otto, Matthias Springstein and Ralph Ewerth (all TIB) as well as Avishek Anand (L3S Research Center and Assistant Professor at the Faculty of Electrical Engineering and Computer Science of Leibniz Universität Hannover) show in the article "Understanding, Categorizing and Predicting Semantic Image-Text Relations" how the relationship between visual and related textual information can be described formally.
In the paper, the current state of art on image-text relations is supplemented by a further dimension. So far, image-text combinations have been characterized using the two metrics "Cross-modal Mutual Information" (CMI) ("How many objects/persons do image and text have in common?") and "Semantic Correlation" (SC) ("How much interpretation and context do image and text share?"). The winning paper now adds another dimension: the status relation of image and text. This relation describes whether both modalities – text and image – are equally important in conveying information or if one of them plays a superior role.
It is further shown how these three metrics can be used to derive a categorization of semantic image-text classes that allows (automatic) classification of image-text pairs according to their type. The authors worked interdisciplinary and took up research results from the communication sciences and transferred them in the field of multimedia information retrieval.
The authors present a system based on deep neural networks ("deep learning") that can automatically determine these image-text metrics and classes. To train such networks and to support future research, an (almost completely) automatically generated dataset is made publicly available.
Applications for this work can be found, for example, in the field of web-based learning or in schools: here, user-specific or topic-specific content can be filtered or sorted according to relevance. Potentially, however, the results can be applied to many different tasks in the context of multimodal information (generation of image descriptions, automatic question answering, search engines, etc.), as they provide a deeper insight into the interplay of image and text from a computer science perspective.