Advancing the Relevance Criteria for Video Search and Visual Summarization

More Info
expand_more

Abstract

To facilitate finding of relevant information in ever-growing multimedia collections, a number of multimedia information retrieval solutions have been proposed over the past years. The essential element of any such solution is the relevance criterion deployed to select or rank the items from a multimedia collection to be presented to the user. Due to the inability of computational approaches to interpret multimedia items and their semantic relations in the same way as humans, the research community has mainly focused on the relevance criteria that can be handled by the modern computers, e.g., finding images or videos depicting a particular object, setting or event. However, in practice the user information needs are often specified at a higher semantic (abstraction) level, which creates a strong need for multimedia information retrieval mechanisms operating with more complex relevance criteria, such as those referring to topicality, aesthetic appeal and sentiment of multimedia items. By considering the practical use-cases associated with different types of multimedia collections, we investigate in this thesis the possibilities of enabling video search and visual summarization based on the relevance criteria defined at a higher semantic level. To start with, we address the problem of video search at the level of semantic theme (general topic, subject matter) in the setting of an unlabeled professional video collection. For this purpose, we propose a retrieval framework based on the query performance prediction principle that makes use of the noisy output of automatic speech recognition and visual concept detection. We demonstrate that valuable information about the semantic theme of a video can be automatically extracted from both its spoken content and the visual channel, which makes the effective retrieval within the proposed framework possible despite the presence of noise and the absence of suitable annotations. The focus of the thesis then moves to the problem of visual summarization in information-rich social media environments. We first investigate the possibilities for improved computing of semantic similarities between images through a multimodal integration of resources ranging from image content and the associated social annotations to the information derived from the analysis of social network in which the images are contextualized. Building on the outcomes of this investigation and inspired by the prospect of using social media in tourist applications, we then propose an approach to automatic creation of visual summaries composed of community-contributed images and depicting various aspects of a selected geographic area. Although the proposed visual summarization approach is proven effective in yielding a good coverage of a targeted geographic area, like most approaches presented in related work, it suffers from a drawback that the user judgment about image suitability for the visual summary is not directly incorporated in the summarization algorithm. This observation inspires probably the most daring research question addressed in the thesis, namely, whether it is possible to learn to automatically identify images that the humans would select if asked to create a visual summary. We give a positive answer to this question and present an image selection approach that makes use of reference visual summaries obtained through crowdsourcing and a versatile image representation that goes beyond the analysis of image content and context to incorporate an analysis of their aesthetic appeal and the sentiment they evoke in the users. Finally, we address the problem of automatic evaluation of the quality of visual summaries and image sets in general, first by using the image metadata only and then based on the human-created references. In conclusion, with this thesis we believe to have pushed the boundaries of relevance criteria that can be deployed in automated multimedia information retrieval systems by demonstrating that the video search and visual summarization can be performed at a higher semantic level. We also show, however, that the effective deployment of advanced relevance criteria requires innovative and unconventional multimedia representation for improved capturing of semantic similarities between multimedia items. Additionally, we demonstrate that properly addressing the user information needs often requires a much more complex mix of relevance criteria than commonly assumed and prove that learning their interplay is possible. Finally, we point out that social media analysis and the emerging technologies such as e.g., crowdsourcing show a great promise in better understanding and automatically modeling the actual user information needs and the way the users interpret and interact with multimedia.