Modeling Audio Fingerprints

Structure, Distortion, Capacity

More Info
expand_more

Abstract

An audio fingerprint is a compact low-level representation of a multimedia signal. An audio fingerprint can be used to identify audio files or fragments in a reliable way. The use of audio fingerprints for identification consists of two phases. In the enrollment phase known content is fingerprinted, and ingested into a database, together with all relevant metadata. In the identification phase, unknown audio content is fingerprinted, and the fingerprints form the query to the database. The query fingerprint is compared to the fingerprints in the database. If a similar fingerprint is found in the database, the relevant metadata corresponding to the fingerprint is returned. In this thesis we develop models for audio fingerprints. The emphasis here is on fingerprint extraction and the properties of the fingerprint, not on matching the query fingerprint to the fingerprints in the database, and the actual identification. We also do not develop new practical fingerprinting algorithms. There is a wide variety of applications for audio fingerprinting, including broadcast monitoring, audience measurement, forensic applications, blacklisting of unauthorized content, 'name that tune' services and linking of special offers to television or radio commercials. Content which uses the same recorded source material, but which is in different representation, or distorted in different ways, will generate similar audio fingerprints. This distinguishes audio fingerprints from hashes and content-based retrieval. The hash of an audio file changes when one sample changes. Two perceptually equal audio items can have completely different hash values, but will generate similar fingerprints. Content-based retrieval looks for audio items which apply to a similar concept, like the same genre, artist or style, while fingerprinting looks for the reuse of the recorded content. Of course, the exact requirements for a fingerprinting system strongly depend on the application. Relevant aspects for the topics discussed in this thesis are the robustness, uniqueness, accuracy (notably the False Acceptance Rate and False Rejection Rate), granularity and the size of the fingerprints. In this thesis we make three contributions in the form of models. First, we model the structure of a particular type of audio fingerprint, the Philips Robust Hash (PRH). The PRH fingerprint extracts a series of spectral energy related features from the audio signal, which are represented efficiently but coarsely as a binary time-series. The time-series captures the temporal and spectral dynamics of the audio signal, and has a very particular structure mainly depending on a limited number of parameters in the fingerprint extraction. The model describes the structure of the PRH as a function of a number of parameters. It can be used for better understanding and potentially optimization of the fingerprinting system. We experimentally verify the model on synthetic Gaussian iid data, and conclude that the model capture the structure of the PRH fingerprint well. This analysis was reformulated and extended by Balado, Hurley, McCarthy and Silvestre. Second, we observe that distortions in the audio are reflected in changes in the corresponding fingerprint. This kind of distortion affects the quality of the audio signal and changes the resulting fingerprint. The idea is to estimate the amount of distortion on the audio signal by comparing the corresponding fingerprint to a reference fingerprint extracted from a high quality copy of the same audio. In this way one could extend the functionality of a fingerprinting system. We implement and compare the behaviour of a number of algorithms from literature, and observe similar behaviour of the distance between corresponding fingerprints due to compression. We model the effect of particular distortions in the audio due to compression or additive white noise on the difference introduced in the PRH fingerprints. The main result of our modeling effort is a closed form relation between Signal-to-Noise Ratio (SNR) and average fingerprint distance for PRH audio fingerprints of independent identically distributed (iid) signals. We also experimentally verify the developed models. The model fits perfectly for synthetic signals, and captures the behavior observed in a wider variety of fingerprinting algorithms on actual music. Third, we consider an information theoretical framework developed by Westover and O'Sullivan (WOS). The main question is `how many signals can be identified by a fingerprinting system, under certain conditions'. The conditions relate to characteristics of the fingerprint (size of the fingerprint, and representation of the fingerprint), and characteristics of the environment in which the system operates (representation and statistical characteristics of the signals that need to be identified, how much distortion is allowed). We use the results of the model developed for the PRH fingerprint to compare to estimate up to how many signals can be identified with a binary fingerprint like the PRH. Finally, we check whether the changes in the fingerprints we observe in practice due to distortions in the audio signals, and which have been modeled in this thesis, fit in the information theoretical framework of the WOS model. We outline the differences in the WOS-model compared to practical implementations. We finish with a list of recommendations on extending the models to take jointly consider distortion and uniqueness characteristics; to take more distortion types into account, and to extend to images and video; to develop an evaluation framework for audio fingerprinting; to integrate psycho-acoustics; and to develop a theoretical framework for comparing specific algorithms to the capacity bound.