A. Hanjalic | TU Delft Repository

Introduction to the Special Issue on Deep Multimodal Generation and Retrieval

Journal article (2025) - Hao Fei, Wei Ji, Yinwei Wei, Zhedong Zheng, Jialie Shen, Alan Hanjalic, Roger Zimmermann

Still Making Noise

Improving Deep-Learning-Based Side-Channel Analysis

Journal article (2025) - Jaehun Kim, Stjepan Picek, Annelie Heuser, Shivam Bhasin, Alan Hanjalic

Editor’s notes: Side-channel attacks have been undermining cryptosystems for almost three decades. Advances in machine learning techniques have shown great promise in improving the performance and efficiency of side-channel attacks, even on systems with countermeasures. This article provides a systematic approach to applying ML techniques for side-channel attacks. ...

Estimating nodal spreading influence using partial temporal networks

Journal article (2025) - Tianrui Mao, Shilun Zhang, Alan Hanjalic, Huijuan Wang

Networks facilitate the spread of information and epidemics. The average number of nodes infected via a spreading process on a network starting from a single seed node over a given long period is called the influence of that node. Estimating nodal influence early in time is essential for the epidemic/misinformation mitigation. Influence estimation has been investigated in static networks, which identifies the relation between topological properties of a node and its influence and assumes the networks are completely known. However, the networks underlying spreading processes such as social interactions are not static but temporal networks, whose links are activated or deactivated over time. When predicting nodal influence in the long-term future, the temporal network is usually only observable till the time of prediction and only locally around the node due to data accessibility. To bridge this gap, we address the question of how to utilize the partially observed temporal network (local and of short duration) around each node, to estimate the ranking of nodes in spreading influence on the full network over a long period. This would also enable us to understand which network properties of a node, in its partially observed temporal network determine its influence. Centrality metrics (nodal properties) have been proposed recently in temporal networks. However, using such a metric derived for each node from its partial network to estimate the ranking of nodes in influence is likely to be limiting. This is because the spread of information is possibly through any time-respecting path, beyond the shortest time-respecting path considered by existing metrics. To address this disparity, we systematically propose a set of novel nodal centrality metrics that encode diverse properties of (time-respecting) walks to predict nodal influence rankings. The proposed metrics derived from partial network information, in general, outperform classic centrality metrics utilizing either full or partial temporal network information. It is found that distinct centrality metrics perform the best depending on the infection probability of the spreading process. For a broad range of the infection probability, a node tends to be influential if it can reach many distinct nodes via time-respecting walks and if these nodes can be reached early in time. ...

Networks facilitate the spread of information and epidemics. The average number of nodes infected via a spreading process on a network starting from a single seed node over a given long period is called the influence of that node. Estimating nodal influence early in time is essential for the epidemic/misinformation mitigation. Influence estimation has been investigated in static networks, which identifies the relation between topological properties of a node and its influence and assumes the networks are completely known. However, the networks underlying spreading processes such as social interactions are not static but temporal networks, whose links are activated or deactivated over time. When predicting nodal influence in the long-term future, the temporal network is usually only observable till the time of prediction and only locally around the node due to data accessibility. To bridge this gap, we address the question of how to utilize the partially observed temporal network (local and of short duration) around each node, to estimate the ranking of nodes in spreading influence on the full network over a long period. This would also enable us to understand which network properties of a node, in its partially observed temporal network determine its influence. Centrality metrics (nodal properties) have been proposed recently in temporal networks. However, using such a metric derived for each node from its partial network to estimate the ranking of nodes in influence is likely to be limiting. This is because the spread of information is possibly through any time-respecting path, beyond the shortest time-respecting path considered by existing metrics. To address this disparity, we systematically propose a set of novel nodal centrality metrics that encode diverse properties of (time-respecting) walks to predict nodal influence rankings. The proposed metrics derived from partial network information, in general, outperform classic centrality metrics utilizing either full or partial temporal network information. It is found that distinct centrality metrics perform the best depending on the infection probability of the spreading process. For a broad range of the infection probability, a node tends to be influential if it can reach many distinct nodes via time-respecting walks and if these nodes can be reached early in time.

A Reproducibility Study of Product-side Fairness in Bundle Recommendation

Conference paper (2025) - Huy Son Nguyen, M. Mansoury, A. Hanjalic

Recommender systems are known to exhibit fairness issues, particularly on the product side, where products and their associated suppliers receive unequal exposure in recommended results. While this problem has been widely studied in traditional recommendation settings, its implications for bundle recommendation (BR) remain largely unexplored. This emerging task introduces additional complexity: recommendations are generated at the bundle level, yet user satisfaction and product (or supplier) exposure depend on both the bundle and the individual items it contains. Existing fairness frameworks and metrics designed for traditional recommender systems may not directly translate to this multi-layered setting. In this paper, we conduct a comprehensive reproducibility study of product-side fairness in BR across three real-world datasets using four state-of-the-art BR methods. We analyze exposure disparities at both the bundle and item levels using multiple fairness metrics, uncovering important patterns. Our results show that exposure patterns differ notably between bundles and items, revealing the need for fairness interventions that go beyond bundle-level assumptions. We also find that fairness assessments vary considerably depending on the metric used, reinforcing the need for multi-faceted evaluation. Furthermore, user behavior plays a critical role: when users interact more frequently with bundles than with individual items, BR systems tend to yield fairer exposure distributions across both levels. Overall, our findings offer actionable insights for building fairer bundle recommender systems and establish a vital foundation for future research in this emerging domain. ...

Preface

Journal article (2024) - Stevan Rudinac, Alan Hanjalic, Cynthia Liem, Marcel Worring, Björn Þór Jónsson, Bei Liu, Yoko Yamakata

MultiMedia Modeling
30th International Conference, MMM 2024, Amsterdam, The Netherlands, January 29 – February 2, 2024, Proceedings, Part III ...

Preface

Journal article (2024) - Stevan Rudinac, Alan Hanjalic, Cynthia Liem, Marcel Worring, Björn Þór Jónsson, Bei Liu, Yoko Yamakata

MultiMedia Modeling

30th International Conference, MMM 2024, Amsterdam, The Netherlands, January 29 – February 2, 2024, Proceedings, Part II ...

Predicting nodal influence via local iterative metrics

Journal article (2024) - Shilun Zhang, Alan Hanjalic, Huijuan Wang

Nodal spreading influence is the capability of a node to activate the rest of the network when it is the seed of spreading. Combining nodal properties (centrality metrics) derived from local and global topological information respectively has been shown to better predict nodal influence than using a single metric. In this work, we investigate to what extent local and global topological information around a node contributes to the prediction of nodal influence and whether relatively local information is sufficient for the prediction. We show that by leveraging the iterative process used to derive a classical nodal centrality such as eigenvector centrality, we can define an iterative metric set that progressively incorporates more global information around the node. We propose to predict nodal influence using an iterative metric set that consists of an iterative metric from order 1 to K produced in an iterative process, encoding gradually more global information as K increases. Three iterative metrics are considered, which converge to three classical node centrality metrics, respectively. In various real-world networks and synthetic networks with community structures, we find that the prediction quality of each iterative based model converges to its optimal when the metric of relatively low orders (K∼4) are included and increases only marginally when further increasing K. This fast convergence of prediction quality with K is further explained by analyzing the correlation between the iterative metric and nodal influence, the convergence rate of each iterative process and network properties. The prediction quality of the best performing iterative metric set with K=4 is comparable with the benchmark method that combines seven centrality metrics: their prediction quality ratio is within the range [91%,106%] across all three quality measures and networks. In two spatially embedded networks with an extremely large diameter, however, iterative metric of higher orders, thus a large K, is needed to achieve comparable prediction quality with the benchmark. ...

Nodal spreading influence is the capability of a node to activate the rest of the network when it is the seed of spreading. Combining nodal properties (centrality metrics) derived from local and global topological information respectively has been shown to better predict nodal influence than using a single metric. In this work, we investigate to what extent local and global topological information around a node contributes to the prediction of nodal influence and whether relatively local information is sufficient for the prediction. We show that by leveraging the iterative process used to derive a classical nodal centrality such as eigenvector centrality, we can define an iterative metric set that progressively incorporates more global information around the node. We propose to predict nodal influence using an iterative metric set that consists of an iterative metric from order 1 to K produced in an iterative process, encoding gradually more global information as K increases. Three iterative metrics are considered, which converge to three classical node centrality metrics, respectively. In various real-world networks and synthetic networks with community structures, we find that the prediction quality of each iterative based model converges to its optimal when the metric of relatively low orders (K∼4) are included and increases only marginally when further increasing K. This fast convergence of prediction quality with K is further explained by analyzing the correlation between the iterative metric and nodal influence, the convergence rate of each iterative process and network properties. The prediction quality of the best performing iterative metric set with K=4 is comparable with the benchmark method that combines seven centrality metrics: their prediction quality ratio is within the range [91%,106%] across all three quality measures and networks. In two spatially embedded networks with an extremely large diameter, however, iterative metric of higher orders, thus a large K, is needed to achieve comparable prediction quality with the benchmark.

Preface

Journal article (2024) - Stevan Rudinac, Alan Hanjalic, Cynthia Liem, Marcel Worring, Björn Þór Jónsson, Bei Liu, Yoko Yamakata

MultiMedia Modeling

30th International Conference, MMM 2024, Amsterdam, The Netherlands, January 29 – February 2, 2024, Proceedings, Part I ...

Mitigating Mainstream Bias in Recommendation via Cost-sensitive Learning

Conference paper (2023) - Roger Zhe Li, Julián Urbano, Alan Hanjalic

Mainstream bias, where some users receive poor recommendations because their preferences are uncommon or simply because they are less active, is an important aspect to consider regarding fairness in recommender systems. Existing methods to mitigate mainstream bias do not explicitly model the importance of these non-mainstream users or, when they do, it is in a way that is not necessarily compatible with the data and recommendation model at hand. In contrast, we use the recommendation utility as a more generic and implicit proxy to quantify mainstreamness, and propose a simple user-weighting approach to incorporate it into the training process while taking the cost of potential recommendation errors into account. We provide extensive experimental results showing that quantifying mainstreamness via utility is better able at identifying non-mainstream users, and that they are indeed better served when training the model in a cost-sensitive way. This is achieved with negligible or no loss in overall recommendation accuracy, meaning that the models learn a better balance across users. In addition, we show that research of this kind, which evaluates recommendation quality at the individual user level, may not be reliable if not using enough interactions when assessing model performance. ...

Multi-label Node Classification On Graph-Structured Data

Journal article (2023) - T. Zhao, Ngan Thi Dong, A. Hanjalic, M. Khosla

Graph Neural Networks (GNNs) have shown state-of-the-art improvements in node classification tasks on graphs. While these improvements have been largely demonstrated in a multi-class classification scenario, a more general and realistic scenario in which each node could have multiple labels has so far received little attention. The first challenge in conducting focused studies on multi-label node classification is the limited number of publicly available multi-label graph datasets. Therefore, as our first contribution, we collect and release three real-world biological datasets and develop a multi-label graph generator to generate datasets with tunable properties. While high label similarity (high homophily) is usually attributed to the success of GNNs, we argue that a multi-label scenario does not follow the usual semantics of homophily and heterophily so far defined for a multi-class scenario. As our second contribution, we define homophily and Cross-Class Neighborhood Similarity for the multi-label scenario and provide a thorough analyses of the collected multi-label datasets. Finally, we perform a large-scale comparative study with methods and datasets and analyse the performances of the methods to assess the progress made by current state of the art in the multi-label node classification scenario. We release our benchmark at https://github.com/Tianqi-py/MLGNC. ...

Weakly-supervised Learning for Fine-grained Emotion Recognition using Physiological Signals

Journal article (2023) - Tianyi Zhang, Abdallah El Ali, Chen Wang, Alan Hanjalic, Pablo Cesar

Instead of predicting just one emotion for one activity (e.g., video watching), fine-grained emotion recognition enables more temporally precise recognition. Previous works on fine-grained emotion recognition require segment-by-segment, fine-grained emotion labels to train the recognition algorithm. However, experiments to collect these labels are costly and time-consuming compared with only collecting one emotion label after the user watched that stimulus (i.e., the post-stimuli emotion labels). To recognize emotions at a finer granularity level when trained with only post-stimuli labels, we propose an emotion recognition algorithm based on Deep Multiple Instance Learning (EDMIL) using physiological signals. EDMIL recognizes fine-grained valence and arousal (V-A) labels by identifying which instances represent the post-stimuli V-A annotated by users after watching the videos. Instead of fully-supervised training, the instances are weakly-supervised by the post-stimuli labels in the training stage. The V-A of instances are estimated by the instance gains, which indicate the probability of instances to predict the post-stimuli labels. We tested EDMIL on three different datasets, CASE, MERCA and CEAP-360VR, collected in three different environments: desktop, mobile and HMD-based Virtual Reality, respectively. Recognition results validated with the fine-grained V-A self-reports show that for subject-independent 3-class classification (high/neutral/low), EDMIL obtains promising recognition accuracies: 75.63% and 79.73% for V-A on CASE, 70.51% and 67.62% for V-A on MERCA and 65.04% and 67.05% for V-A on CEAP-360VR. Our ablation study shows that all components of EDMIL contribute to both the classification and regression tasks. Our experiments also show that (1) compared with fully-supervised learning, weakly-supervised learning can reduce the problem of overfitting caused by the temporal mismatch between fine-grained annotations and physiological signals, (2) instance segment lengths between 1-2 s result in the highest recognition accuracies and (3) EDMIL performs best if post-stimuli annotations consist of less than 30% or more than 60% of the entire video watching. ...

Instead of predicting just one emotion for one activity (e.g., video watching), fine-grained emotion recognition enables more temporally precise recognition. Previous works on fine-grained emotion recognition require segment-by-segment, fine-grained emotion labels to train the recognition algorithm. However, experiments to collect these labels are costly and time-consuming compared with only collecting one emotion label after the user watched that stimulus (i.e., the post-stimuli emotion labels). To recognize emotions at a finer granularity level when trained with only post-stimuli labels, we propose an emotion recognition algorithm based on Deep Multiple Instance Learning (EDMIL) using physiological signals. EDMIL recognizes fine-grained valence and arousal (V-A) labels by identifying which instances represent the post-stimuli V-A annotated by users after watching the videos. Instead of fully-supervised training, the instances are weakly-supervised by the post-stimuli labels in the training stage. The V-A of instances are estimated by the instance gains, which indicate the probability of instances to predict the post-stimuli labels. We tested EDMIL on three different datasets, CASE, MERCA and CEAP-360VR, collected in three different environments: desktop, mobile and HMD-based Virtual Reality, respectively. Recognition results validated with the fine-grained V-A self-reports show that for subject-independent 3-class classification (high/neutral/low), EDMIL obtains promising recognition accuracies: 75.63% and 79.73% for V-A on CASE, 70.51% and 67.62% for V-A on MERCA and 65.04% and 67.05% for V-A on CEAP-360VR. Our ablation study shows that all components of EDMIL contribute to both the classification and regression tasks. Our experiments also show that (1) compared with fully-supervised learning, weakly-supervised learning can reduce the problem of overfitting caused by the temporal mismatch between fine-grained annotations and physiological signals, (2) instance segment lengths between 1-2 s result in the highest recognition accuracies and (3) EDMIL performs best if post-stimuli annotations consist of less than 30% or more than 60% of the entire video watching.

Few-shot Learning for Fine-grained Emotion Recognition using Physiological Signals

Journal article (2022) - Tianyi Zhang, Abdallah El Ali, Alan Hanjalic, Pablo Cesar

Fine-grained emotion recognition can model the temporal dynamics of emotions, which is more precise than predicting one emotion retrospectively for an activity (e.g., video clip watching). Previous works require large amounts of continuously annotated data to train an accurate recognition model, however experiments to collect such large amounts of continuously annotated physiological signals are costly and time-consuming. To overcome this challenge, we propose an Emotion recognition algorithm based on Deep Siamese Networks (EmoDSN) which can rapidly converge on a small amount of training data, typically less than 10 samples per class (i.e., <10 shot). EmoDSN recognizes fine-grained valence and arousal (V-A) labels by maximizing the distance metric between signal segments with different V-A labels. We tested EmoDSN on three different datasets collected in three different environments: desktop, mobile and HMD-based virtual reality, respectively. The results from our experiments show that EmoDSN achieves promising results for both one-dimension binary (high/low V-A, 1D-2 C) and two-dimensional 5-class (four quadrants of V- A space + neutral, 2D-5 C) classification. We get an averaged accuracy of 76.04, 76.62 and 57.62% for 1D-2 C valence, 1D-2 C arousal, and 2D-5 C, respectively, by using only 5 shots of training data. Our experiments show that EmoDSN can achieve better results if we select training samples from the changing points of emotion or the ending moments of video watching. ...

Topological-Temporal properties of evolving networks

Journal article (2022) - Alberto Ceria, Shlomo Havlin, Alan Hanjalic, Huijuan Wang

Many real-world complex systems including human interactions can be represented by temporal (or evolving) networks, where links activate or deactivate over time. Characterizing temporal networks is crucial to compare different real-world networks and to detect their common patterns or differences. A systematic method that can characterize simultaneously the temporal and topological relations of the time-specific interactions (also called contacts or events) of a temporal network, is still missing. In this article, we propose a method to characterize to what extent contacts that happen close in time occur also close in topology. Specifically, we study the interrelation between temporal and topological properties of the contacts from three perspectives: (1) the correlation (among the elements) of the activity time series which records the total number of contacts in a network that happen at each time step; (2) the interplay between the topological distance and time difference of two arbitrary contacts; (3) the temporal correlation of contacts within the local neighbourhood centred at each link (so-called ego-network) to explore whether such contacts that happen close in topology are also close in time. By applying our method to 13 real-world temporal networks, we found that temporal-Topological correlation of contacts is more evident in virtual contact networks than in physical contact networks. This could be due to the lower cost and easier access of online communications than physical interactions, allowing and possibly facilitating social contagion, that is, interactions of one individual may influence the activity of its neighbours. We also identify different patterns between virtual and physical networks and among physical contact networks at, for example, school and workplace, in the formation of correlation in local neighbourhoods. Patterns and differences detected via our method may further inspire the development of more realistic temporal network models, that could reproduce jointly temporal and topological properties of contacts. ...

Many real-world complex systems including human interactions can be represented by temporal (or evolving) networks, where links activate or deactivate over time. Characterizing temporal networks is crucial to compare different real-world networks and to detect their common patterns or differences. A systematic method that can characterize simultaneously the temporal and topological relations of the time-specific interactions (also called contacts or events) of a temporal network, is still missing. In this article, we propose a method to characterize to what extent contacts that happen close in time occur also close in topology. Specifically, we study the interrelation between temporal and topological properties of the contacts from three perspectives: (1) the correlation (among the elements) of the activity time series which records the total number of contacts in a network that happen at each time step; (2) the interplay between the topological distance and time difference of two arbitrary contacts; (3) the temporal correlation of contacts within the local neighbourhood centred at each link (so-called ego-network) to explore whether such contacts that happen close in topology are also close in time. By applying our method to 13 real-world temporal networks, we found that temporal-Topological correlation of contacts is more evident in virtual contact networks than in physical contact networks. This could be due to the lower cost and easier access of online communications than physical interactions, allowing and possibly facilitating social contagion, that is, interactions of one individual may influence the activity of its neighbours. We also identify different patterns between virtual and physical networks and among physical contact networks at, for example, school and workplace, in the formation of correlation in local neighbourhoods. Patterns and differences detected via our method may further inspire the development of more realistic temporal network models, that could reproduce jointly temporal and topological properties of contacts.

Joint Feature Synthesis and Embedding

Adversarial Cross-Modal Retrieval Revisited

Journal article (2022) - Xing Xu, Kaiyi Lin , Yang Yang, Alan Hanjalic, Heng Tao Shen

Recently, generative adversarial network (GAN) has shown its strong ability on modeling data distribution via adversarial learning. Cross-modal GAN, which attempts to utilize the power of GAN to model the cross-modal joint distribution and to learn compatible cross-modal features, is becoming the research hotspot. However, the existing cross-modal GAN approaches typically 1) require labeled multimodal data of massive labor cost to establish cross-modal correlation; 2) utilize the vanilla GAN model that results in unstable training procedure and meaningless synthetic features; and 3) lack of extensibility for retrieving cross-modal data of new classes. In this article, we revisit the adversarial learning in existing cross-modal GAN methods and propose Joint Feature Synthesis and Embedding (JFSE), a novel method that jointly performs multimodal feature synthesis and common embedding space learning to overcome the above three shortcomings. Specifically, JFSE deploys two coupled conditional Wassertein GAN modules for the input data of two modalities, to synthesize meaningful and correlated multimodal features under the guidance of the word embeddings of class labels. Moreover, three advanced distribution alignment schemes with advanced cycle-consistency constraints are proposed to preserve the semantic compatibility and enable the knowledge transfer in the common embedding space for both the true and synthetic cross-modal features. All these add-ons in JFSE not only help to learn more effective common embedding space that effectively captures the cross-modal correlation but also facilitate to transfer knowledge to multimodal data of new classes. Extensive experiments are conducted on four widely used cross-modal datasets, and the comparisons with more than ten state-of-the-art approaches show that our JFSE method achieves remarkably accuracy improvement on both standard retrieval and the newly explored zero-shot and generalized zero-shot retrieval tasks. ...

Recently, generative adversarial network (GAN) has shown its strong ability on modeling data distribution via adversarial learning. Cross-modal GAN, which attempts to utilize the power of GAN to model the cross-modal joint distribution and to learn compatible cross-modal features, is becoming the research hotspot. However, the existing cross-modal GAN approaches typically 1) require labeled multimodal data of massive labor cost to establish cross-modal correlation; 2) utilize the vanilla GAN model that results in unstable training procedure and meaningless synthetic features; and 3) lack of extensibility for retrieving cross-modal data of new classes. In this article, we revisit the adversarial learning in existing cross-modal GAN methods and propose Joint Feature Synthesis and Embedding (JFSE), a novel method that jointly performs multimodal feature synthesis and common embedding space learning to overcome the above three shortcomings. Specifically, JFSE deploys two coupled conditional Wassertein GAN modules for the input data of two modalities, to synthesize meaningful and correlated multimodal features under the guidance of the word embeddings of class labels. Moreover, three advanced distribution alignment schemes with advanced cycle-consistency constraints are proposed to preserve the semantic compatibility and enable the knowledge transfer in the common embedding space for both the true and synthetic cross-modal features. All these add-ons in JFSE not only help to learn more effective common embedding space that effectively captures the cross-modal correlation but also facilitate to transfer knowledge to multimodal data of new classes. Extensive experiments are conducted on four widely used cross-modal datasets, and the comparisons with more than ten state-of-the-art approaches show that our JFSE method achieves remarkably accuracy improvement on both standard retrieval and the newly explored zero-shot and generalized zero-shot retrieval tasks.

Subjective QoE Evaluation of User-Centered Adaptive Streaming of Dynamic Point Clouds

Conference paper (2022) - Shishir Subramanyam, Irene Viola, Jack Jansen, Evangelos Alexiou, Alan Hanjalic, Pablo Cesar

Technological advances in head-mounted displays and novel real-time 3D acquisition and reconstruction solutions have fostered the development of 6 Degrees of Freedom (6DoF) teleimmersive systems for social VR applications. Point clouds have emerged as a popular format for such applications, owing to their simplicity and versatility; yet, dense point cloud contents are too large to deliver directly over bandwidth-limited networks. In this context, user-adaptive delivery mechanisms are a promising solution to exploit the increased range of motion offered by 6DoF VR applications to yield gains in perceived quality of 3D point cloud user representations, while reducing their bandwidth requirements. In this paper, we perform a user study in VR to quantify the gains adaptive tile selection strategies can bring with respect to non-adaptive solutions. In particular, we define an auxiliary utility function, we employ established methods from the literature and newly-proposed schemes for distributing the bit budget across the tiles, and we evaluate them together with non-adaptive streaming baselines through subjective QoE assessment. Results confirm that considerable gains can be obtained with user-adaptive streaming, achieving bit rate gains of up to 65% with respect to a non-adaptive approach to deliver comparable quality. Our analysis provides useful insights for the design and development of social VR applications. ...

Task-Aware Connectivity Learning for Incoming Nodes Over Growing Graphs

Journal article (2022) - Bishwadeep Das, Alan Hanjalic, Elvin Isufi

Data processing over graphs is usually done on graphs of fixed size. However, graphs often grow with new nodes arriving over time. Knowing the connectivity information of these nodes, and thus, the expanded graph is crucial for processing data over the expanded graph. In its absence, its inference and the subsequent data processing become essential. This paper provides contributions along this direction by considering task-driven data processing for incoming nodes without connectivity information. We model the incoming node attachment as a random process dictated by the parameterized vectors of probabilities and weights of attachment. The attachment is driven by the existing graph topology, the corresponding graph signal, and an associated processing task. We consider two such tasks, one of interpolation at the incoming node, and that of graph signal smoothness. We show that the model bounds implicitly the spectral perturbation between the nominal topology of the expanded graph and the drawn realizations. In the absence of connectivity information our topology, task, and data-aware stochastic attachment performs better than purely data-driven and topology driven stochastic attachment rules, as is confirmed by numerical results over synthetic and real data. ...

Influence of clustering coefficient on network embedding in link prediction

Journal article (2022) - Omar F. Robledo, Xiu Xiu Zhan, Alan Hanjalic, Huijuan Wang

Multiple network embedding algorithms have been proposed to perform the prediction of missing or future links in complex networks. However, we lack the understanding of how network topology affects their performance, or which algorithms are more likely to perform better given the topological properties of the network. In this paper, we investigate how the clustering coefficient of a network, i.e., the probability that the neighbours of a node are also connected, affects network embedding algorithms’ performance in link prediction, in terms of the AUC (area under the ROC curve). We evaluate classic embedding algorithms, i.e., Matrix Factorisation, Laplacian Eigenmaps and node2vec, in both synthetic networks and (rewired) real-world networks with variable clustering coefficient. Specifically, a rewiring algorithm is applied to each real-world network to change the clustering coefficient while keeping key network properties. We find that a higher clustering coefficient tends to lead to a higher AUC in link prediction, except for Matrix Factorisation, which is not sensitive to the change of clustering coefficient. To understand such influence of the clustering coefficient, we (1) explore the relation between the link rating (probability that a node pair is the missing link) derived from the aforementioned algorithms and the number of common neighbours of the node pair, and (2) evaluate these embedding algorithms’ ability to reconstruct the original training (sub)network. All the network embedding algorithms that we tested tend to assign higher likelihood of connection to node pairs that share an intermediate or high number of common neighbours, independently of the clustering coefficient of the training network. Then, the predicted networks will have more triangles, thus a higher clustering coefficient. As the clustering coefficient increases, all the algorithms but Matrix Factorisation could also better reconstruct the training network. These two observations may partially explain why increasing the clustering coefficient improves the prediction performance. ...

Multiple network embedding algorithms have been proposed to perform the prediction of missing or future links in complex networks. However, we lack the understanding of how network topology affects their performance, or which algorithms are more likely to perform better given the topological properties of the network. In this paper, we investigate how the clustering coefficient of a network, i.e., the probability that the neighbours of a node are also connected, affects network embedding algorithms’ performance in link prediction, in terms of the AUC (area under the ROC curve). We evaluate classic embedding algorithms, i.e., Matrix Factorisation, Laplacian Eigenmaps and node2vec, in both synthetic networks and (rewired) real-world networks with variable clustering coefficient. Specifically, a rewiring algorithm is applied to each real-world network to change the clustering coefficient while keeping key network properties. We find that a higher clustering coefficient tends to lead to a higher AUC in link prediction, except for Matrix Factorisation, which is not sensitive to the change of clustering coefficient. To understand such influence of the clustering coefficient, we (1) explore the relation between the link rating (probability that a node pair is the missing link) derived from the aforementioned algorithms and the number of common neighbours of the node pair, and (2) evaluate these embedding algorithms’ ability to reconstruct the original training (sub)network. All the network embedding algorithms that we tested tend to assign higher likelihood of connection to node pairs that share an intermediate or high number of common neighbours, independently of the clustering coefficient of the training network. Then, the predicted networks will have more triangles, thus a higher clustering coefficient. As the clustering coefficient increases, all the algorithms but Matrix Factorisation could also better reconstruct the training network. These two observations may partially explain why increasing the clustering coefficient improves the prediction performance.

Guest Editorial Learning From Noisy Multimedia Data

Review (2022) - Jian Zhang, Alan Hanjalic, Ramesh Jain, Xiansheng Hua, Shin'ichi Satoh, Yazhou Yao, Dan Zeng

This special issue provides a premier forum for researchers in multimedia big data to share challenges and recent advancements in learning from noisy multimedia data. The multimedia age and its proliferation of devices and platforms is fueling exponential data growth. As computational power and deep learning algorithms rapidly evolve, the web has become a rich source of potential training data for robust machine learning, with search engines such as Google and Bing, Twitter, TikTok, Instagram, and short video sharing platforms offering large-scale data points in the hundreds of millions. The concurrent shift in the Internet to richer web data modalities such as text, audio, image, and video reveal further opportunities to leverage large-scale data for the automatic construction of a variety of datasets for model training and testing. However, the ubiquity of multimedia data means noise is a fundamental challenge, with a label noisea and a domain mismatcha the most critical issues in automatically collected datasets. Learning from noisy multimedia data tends towards poor performance, making it increasingly essential to address these challenges. ...

Temporal Network Prediction and Interpretation

Journal article (2022) - Li Zou, Xiu xiu Zhan, Jie Sun, Alan Hanjalic, Huijuan Wang

Temporal networks refer to networks like physical contact networks whose topology changes over time. Predicting future temporal network is crucial e.g., to forecast the epidemics. Existing prediction methods are either relatively accurate but black-box, or white-box but less accurate. The lack of interpretable and accurate prediction methods motivates us to explore what intrinsic properties/mechanisms facilitate the prediction of temporal networks. We use interpretable learning algorithms, Lasso Regression and Random Forest, to predict, based on the current activities (i.e., connected or not) of all links, the activity of each link at the next time step. From the coefficients learned from each algorithm, we construct the prediction backbone network that presents the influence of all links in determining each links future activity. Analysis of the backbone, its relation to the link activity time series and to the time aggregated network reflects which properties of temporal networks are captured by the learning algorithms. Via six real-world contact networks, we find that the next step activity of a particular link is mainly influenced by (a) its current activity and (b) links strongly correlated in the time series to that particular link and close in distance (in hops) in the aggregated network. ...

Generating Images from Spoken Descriptions

Journal article (2021) - Xinsheng Wang, Tingting Qiao, Jihua Zhu, Alan Hanjalic, Odette Scharenborg

Text-based technologies, such as text translation from one language to another, and image captioning, are gaining popularity. However, approximately half of the world's languages are estimated to be lacking a commonly used written form. Consequently, these languages cannot benefit from text-based technologies. This paper presents 1) a new speech technology task, i.e., a speech-to-image generation (S2IG) framework which translates speech descriptions to photo-realistic images 2) without using any text information, thus allowing unwritten languages to potentially benefit from this technology. The proposed speech-to-image framework, referred to as S2IGAN, consists of a speech embedding network and a relation-supervised densely-stacked generative model. The speech embedding network learns speech embeddings with the supervision of corresponding visual information from images. The relation-supervised densely-stacked generative model synthesizes images, conditioned on the speech embeddings produced by the speech embedding network, that are semantically consistent with the corresponding spoken descriptions. Extensive experiments are conducted on four public benchmark databases: two databases that are commonly used in text-to-image generation tasks, i.e., CUB-200 and Oxford-102 for which we created synthesized speech descriptions, and two databases with natural speech descriptions which are often used in the field of cross-modal learning of speech and images, i.e., Flickr8k and Places. Results on these databases demonstrate the effectiveness of the proposed S2IGAN on synthesizing high-quality and semantically-consistent images from the speech signal, yielding a good performance and a solid baseline for the S2IG task. ...