J.H. Kim | TU Delft Repository

Still Making Noise

Improving Deep-Learning-Based Side-Channel Analysis

Journal article (2025) - Jaehun Kim, Stjepan Picek, Annelie Heuser, Shivam Bhasin, Alan Hanjalic

Editor’s notes: Side-channel attacks have been undermining cryptosystems for almost three decades. Advances in machine learning techniques have shown great promise in improving the performance and efficiency of side-channel attacks, even on systems with countermeasures. This article provides a systematic approach to applying ML techniques for side-channel attacks. ...

The power of deep without going deep? A study of HDPGMM music representation learning

Conference paper (2022) - Jaehun Kim, C.C.S. Liem

In the previous decade, Deep Learning (DL) has proven to be one of the most effective machine learning methods to tackle a wide range of Music Information Retrieval (MIR) tasks. It offers highly expressive learning capacity that can fit any music representation needed for MIR-relevant downstream tasks. However, it has been criticized for sacrificing interpretability. On the other hand, the Bayesian nonparametric (BN) approach promises similar positive properties as DL, such as high flexibility, while being robust to overfitting and preserving interpretability. Therefore, the primary motivation of this work is to explore the potential of Bayesian nonparametric models in comparison to DL models for music representation learning. More specifically, we assess the music representation learned from the Hierarchical Dirichlet Process Gaussian Mixture Model (HDPGMM), an infinite mixture model based on the Bayesian nonparametric approach, to MIR tasks, including classification, auto-tagging, and recommendation. The experimental result suggests that the HDPGMM music representation can outperform DL representations in certain scenarios, and overall comparable. ...

Increasing trust in complex machine learning systems

Studies in the music domain

Doctoral thesis (2021) - Jaehun Kim

Machine learning (ML) has become a core technology for many real-world applications. Modern ML models are applied to unprecedentedly complex and difficult challenges, including very large and subjective problems. For instance, applications towards multimedia understanding have been advanced substantially. Here, it is already prevalent that cultural/artistic objects such as music and videos are analyzed and served to users according to their preference, enabled throughML techniques. One of the most recent breakthroughs in ML is Deep Learning (DL), which has been immensely adopted to tackle such complex problems. DL allows for higher learning capacity, making end-to-end learning possible, which reduces the need for substantial engineering effort, while achieving high effectiveness. At the same time, this also makes DL models more complex than conventional ML models. Reports in several domains indicate that such more complex ML models may have potentially critical hidden problems: various biases embedded in the training data can emerge in the prediction, extremely sensitive models can make unaccountable mistakes. Furthermore, the black-box nature of the DL models hinders the interpretation of the mechanisms behind them. Such unexpected drawbacks result in a significant impact on the trustworthiness of the systems in which the ML models are equipped as the core apparatus. In this thesis, a series of studies investigates aspects of trustworthiness for complex ML applications, namely the reliability and explainability. Specifically, we focus on music as the primary domain of interest, considering its complexity and subjectivity. Due to this nature of music, ML models for music are necessarily complex for achieving meaningful effectiveness. As such, the reliability and explainability of music ML models are crucial in the field. ...

Generative autoregressive networks for 3d dancing move synthesis from music

Journal article (2020) - Hyemin Ahn, Jaehun Kim, Kihyun Kim, Songhwai Oh

This letter proposes a framework which is able to generate a sequence of three-dimensional human dance poses for a given music. The proposed framework consists of three components: A music feature encoder, a pose generator, and a music genre classifier. We focus on integrating these components for generating a realistic 3D human dancing move from music, which can be applied to artificial agents and humanoid robots. The trained dance pose generator, which is a generative autoregressive model, is able to synthesize a dance sequence longer than 1,000 pose frames. Experimental results of generated dance sequences from various songs show how the proposed method generates human-like dancing move to a given music. In addition, a generated 3D dance sequence is applied to a humanoid robot, showing that the proposed framework can make a robot to dance just by listening to music. ...

“Butter lyrics over hominy grit”^†

Comparing audio and psychology-based text features in MIR tasks

Conference paper (2020) - Jaehun Kim, Andrew M. Demetriou, Sandy Manolios, M. Stella Tavella, Cynthia C.S. Liem

Psychology research has shown that song lyrics are a rich source of data, yet they are often overlooked in the field of MIR compared to audio. In this paper, we provide an initial assessment of the usefulness of features drawn from lyrics for various fields, such as MIR and Music Psychology. To do so, we assess the performance of lyric-based text features on 3 MIR tasks, in comparison to audio features. Specifically, we draw sets of text features from the field of Natural Language Processing and Psychology. Further, we estimate their effect on performance while statistically controlling for the effect of audio features, by using a hierarchical regression statistical model. Lyric-based features show a small but statistically significant effect, that anticipates further research. Implications and directions for future studies are discussed. ...

Make Some Noise

Unleashing the Power of Convolutional Neural Networks for Profiled Side-channel Analysis

Journal article (2019) - J.H. Kim, Stjepan Picek, Annelie Heuser, Shivam Bhasin, Alan Hanjalic

Profiled side-channel analysis based on deep learning, and more precisely Convolutional Neural Networks, is a paradigm showing significant potential. The results, although scarce for now, suggest that such techniques are even able to break cryptographic implementations protected with countermeasures. In this paper, we start by proposing a new Convolutional Neural Network instance able to reach high performance for a number of considered datasets. We compare our neural network with the one designed for a particular dataset with masking countermeasure and we show that both are good designs but also that neither can be considered as a superior to the other one.
Next, we address how the addition of artificial noise to the input signal can be actually beneficial to the performance of the neural network. Such noise addition is equivalent to the regularization term in the objective function. By using this technique, we are able to reduce the number of measurements needed to reveal the secret key by orders of magnitude for both neural networks. Our new convolutional neural network instance with added noise is able to break the implementation protected with the random delay countermeasure by using only 3 traces in the attack phase. To further strengthen our experimental results, we investigate the performance with a varying number of training samples, noise levels, and epochs. Our findings show that adding noise is beneficial throughout all training set sizes and epochs.
...

One deep music representation to rule them all? A comparative analysis of different representation learning strategies

Journal article (2019) - Jaehun Kim, Julián Urbano, Cynthia C.S. Liem, Alan Hanjalic

Inspired by the success of deploying deep learning in the fields of Computer Vision and Natural Language Processing, this learning paradigm has also found its way into the field of Music Information Retrieval. In order to benefit from deep learning in an effective, but also efficient manner, deep transfer learning has become a common approach. In this approach, it is possible to reuse the output of a pre-trained neural network as the basis for a new learning task. The underlying hypothesis is that if the initial and new learning tasks show commonalities and are applied to the same type of input data (e.g., music audio), the generated deep representation of the data is also informative for the new task. Since, however, most of the networks used to generate deep representations are trained using a single initial learning source, their representation is unlikely to be informative for all possible future tasks. In this paper, we present the results of our investigation of what are the most important factors to generate deep representations for the data and learning tasks in the music domain. We conducted this investigation via an extensive empirical study that involves multiple learning sources, as well as multiple deep learning architectures with varying levels of information sharing between sources, in order to learn music representations. We then validate these representations considering multiple target datasets for evaluation. The results of our experiments yield several insights into how to approach the design of methods for learning widely deployable deep data representations in the music domain. ...

Are Nearby Neighbors Relatives?

Testing Deep Music Embeddings

Journal article (2019) - Jaehun Kim, Julián Urbano, Cynthia C.S. Liem, Alan Hanjalic

Deep neural networks have frequently been used to directly learn representations useful for a given task from raw input data. In terms of overall performance metrics, machine learning solutions employing deep representations frequently have been reported to greatly outperform those using hand-crafted feature representations. At the same time, they may pick up on aspects that are predominant in the data, yet not actually meaningful or interpretable. In this paper, we therefore propose a systematic way to test the trustworthiness of deep music representations, considering musical semantics. The underlying assumption is that in case a deep representation is to be trusted, distance consistency between known related points should be maintained both in the input audio space and corresponding latent deep space. We generate known related points through semantically meaningful transformations, both considering imperceptible and graver transformations. Then, we examine within- and between-space distance consistencies, both considering audio space and latent embedded space, the latter either being a result of a conventional feature extractor or a deep encoder. We illustrate how our method, as a complement to task-specific performance, provides interpretable insight into what a network may have captured from training data signals. ...

Beyond Explicit Reports

Comparing Data-Driven Approaches to Studying Underlying Dimensions of Music Preference

Conference paper (2019) - Jaehun Kim, Sandy Manolios, Andrew Demetriou, Cynthia Liem

Prior research from the field of music psychology has suggested that there are factors common to music preference beyond individual genres. Specifically, research has shown that self-reported ratings of preference for individual musical genres can be reduced to 4 or 5 dimensions, which in turn have been shown to correlate to relevant psychological constructs, such as personality. However, the number of dimensions emerging from multiple studies has varied despite the care taken in conducting such research. Data-driven approaches offer opportunities to further this line of research with actual listening data, at a scale and scope surpassing that of traditional psychological studies. Although listening data can be considered more direct and comprehensive evidence of listening preference, transforming this data into meaningful measurements is non-trivial. In the current paper, we report on investigations seeking to find interpretable underlying dimensions of music taste, using implicit large-scale listening data. Offering a critical reflection on potential researchers' degrees of freedom, we adopt an explicit systematic approach, investigating the impact of varying different parameters, analysis, and normalization techniques. More precisely, we consider various ways to extract listening preference information from two large, openly available datasets of music listening behavior, making use of principal component analysis and variational autoencoders to extract potential underlying dimensions. Results and implications are discussed in light of prior psychological theory, and the potential of user listening data to further research on music preference. ...

On the Performance of Convolutional Neural Networks for Side-Channel Analysis

Conference paper (2018) - Stjepan Picek, Ioannis Petros Samiotis, Jeahun Kim, Annelie Heuser, Shivam Bhasin, Axel Legay

In this work, we ask a question whether Convolutional Neural Networks are more suitable for side-channel attacks than some other machine learning techniques and if yes, in what situations. Our results point that Convolutional Neural Networks indeed outperform machine learning in several scenarios when considering accuracy. Still, often there is no compelling reason to use such a complex technique. In fact, if comparing techniques without extra steps like preprocessing, we see an obvious advantage for Convolutional Neural Networks when the level of noise is small, and the number of measurements and features is high. The other tested settings show that simpler machine learning techniques, for a significantly lower computational cost, perform similarly or sometimes even better. The experiments with guessing entropy indicate that methods like Random Forest or XGBoost could perform better than Convolutional Neural Networks for the datasets we investigated. ...

Towards Seed-Free Music Playlist Generation

Enhancing collaborative Filtering with Playlist Title Information

Conference paper (2018) - Jaehun Kim, Minz Won, Cynthia C.S. Liem, Alan Hanjalic

In this paper, we propose a hybrid Neural Collaborative Filtering (NCF) model trained with a multi-objective function to achieve a music playlist generation system. The proposed approach focuses particularly on the cold-start problem (playlists with no seed tracks) and uses a text encoder employing a Recurrent Neural Network (RNN) to exploit textual information given by the playlist title. To accelerate the training, we first apply Weighted Regularized Matrix Factorization (WRMF) as the basic recommendation model to prelearn latent factors of playlists and tracks. These factors then feed into the proposed multi-objective optimization that also involves embeddings of playlist titles. The experimental study indicates that the proposed approach can effectively suggest suitable music tracks for a given playlist title, compensating poor original recommendation results made on empty playlists by the WRMF model. ...

Transfer Learning of Artist Group Factors to Musical Genre Classification

Conference paper (2018) - Jaehun Kim, Minz Won, Xavier Serra, Cynthia C. S. Liem

The automated recognition of music genres from audio information is a challenging problem, as genre labels are subjective and noisy. Artist labels are less subjective and less noisy, while certain artists may relate more strongly to certain genres. At the same time, at prediction time, it is not guaranteed that artist labels are available for a given audio segment. Therefore, in this work, we propose to apply the transfer learning framework, learning artist-related information which will be used at inference time for genre classification. We consider different types of artist-related information, expressed through artist group factors, which will allow for more efficient learning and stronger robustness to potential label noise. Furthermore, we investigate how to achieve the highest validation accuracy on the given FMA dataset, by experimenting with various kinds of transfer methods, including single-task transfer, multi-task transfer and finally multi-task learning. ...