M. Mansoury | TU Delft Repository

Path-Level Explainability in Knowledge Graph Recommenders

Decoding Recommendations: Insights from Knowledge Graph Paths

Master thesis (2026) - S. LIU, M. Mansoury, M. Khosla, U.K. Gadiraju

Knowledge Graph-based Recommender Systems (KGRS) have attracted significant attention due to their potential to enhance both recommendation accuracy and interpretability by leveraging structured relational knowledge. Despite the widespread use of reasoning paths as explanations, systematic evaluation of explanation quality across different KGRS paradigms, and its relationship with recommendation correctness, remains limited. This study provides a comprehensive empirical analysis of three representative KGRS paradigms, including path-based, embedding-based, and hybrid methods, focusing on multiple dimensions of explanation quality, including temporal relevance, popularity, diversity, semantic consistency, and faithfulness.

The analysis investigates differences in explanation characteristics across paradigms, revealing that models such as TMER achieve high recommendation accuracy through constrained path structures, whereas reinforcement learning-based models, including TPRec and PGPR, generate explanations with stronger lexical alignment to user review rationales. The study also examines explanation quality for correctly recommended (relevant) and incorrectly recommended (irrelevant) items. The results show that explanation-ground truth consistency metrics (Precision, Recall, and F1) exhibit greater disparities between correctly and incorrectly recommended items than other evaluation metrics. In RippleNet, the impact of ripple set size and explanation-oriented neighbor sampling strategies on recommendation performance and explanation quality is analyzed. Non-uniform sampling guided by temporal relevance, popularity, or diversity effectively shapes ripple sets to enhance explanation properties without significantly affecting overall recommendation accuracy.

The findings highlight trade-offs between recommendation accuracy and explanation quality, demonstrating that careful model design and sampling strategies can produce interpretable, user-aligned recommendations while maintaining high performance. These insights provide guidance for developing KGRS capable of delivering accurate predictions accompanied by semantically rich, temporally relevant, and user-preferred explanations, thereby improving transparency, trust, and user satisfaction in real-world applications. ...

Knowledge Graph-based Recommender Systems (KGRS) have attracted significant attention due to their potential to enhance both recommendation accuracy and interpretability by leveraging structured relational knowledge. Despite the widespread use of reasoning paths as explanations, systematic evaluation of explanation quality across different KGRS paradigms, and its relationship with recommendation correctness, remains limited. This study provides a comprehensive empirical analysis of three representative KGRS paradigms, including path-based, embedding-based, and hybrid methods, focusing on multiple dimensions of explanation quality, including temporal relevance, popularity, diversity, semantic consistency, and faithfulness.

The analysis investigates differences in explanation characteristics across paradigms, revealing that models such as TMER achieve high recommendation accuracy through constrained path structures, whereas reinforcement learning-based models, including TPRec and PGPR, generate explanations with stronger lexical alignment to user review rationales. The study also examines explanation quality for correctly recommended (relevant) and incorrectly recommended (irrelevant) items. The results show that explanation-ground truth consistency metrics (Precision, Recall, and F1) exhibit greater disparities between correctly and incorrectly recommended items than other evaluation metrics. In RippleNet, the impact of ripple set size and explanation-oriented neighbor sampling strategies on recommendation performance and explanation quality is analyzed. Non-uniform sampling guided by temporal relevance, popularity, or diversity effectively shapes ripple sets to enhance explanation properties without significantly affecting overall recommendation accuracy.

The findings highlight trade-offs between recommendation accuracy and explanation quality, demonstrating that careful model design and sampling strategies can produce interpretable, user-aligned recommendations while maintaining high performance. These insights provide guidance for developing KGRS capable of delivering accurate predictions accompanied by semantically rich, temporally relevant, and user-preferred explanations, thereby improving transparency, trust, and user satisfaction in real-world applications.

Algorithmic Bias in Recommender Systems

Investigating the behavior of Recommender Systems bias and fairness interventions

Master thesis (2026) - P.I. Petrov, A. Hanjalic, A. Anand, M. Mansoury

Recommender systems have seen considerable adoption in recent years, driven by modern streaming services, social media, and novel LLM applications. However, recommender systems show clear statistical biases and can expose stakeholders to social biases. This phenomenon is caused by several factors, such as the data, which tends to be too dirty for the statistical methods used in recommendation; the pipeline, which might directly introduce bias into recommendations; and the evaluation, which is often unaware of the underlying biases in the task. As such, we first reviewed current literature on the topic and discovered several discrepancies. Firstly, fairness-aware solutions, which aim to reduce social bias, are often not tested on Missing-at-Random (MAR) data, which is cleaner than the traditional Missing-not-at-Random (MNAR) data used in recommendations. Further, we establish a connection between statistical bias and social bias, and identify the need for user-group-based studies. As such, we first tested whether fairness-aware solutions benefit from MAR data similarly to debiasing solutions, which aim to reduce statistical bias. Then we investigated the extent to which debiasing solutions can address fairness issues, before finally delving into more detail on the individual user-group performance of some of our configurations. We found that some fairness-aware algorithms benefit from MAR data, though this does not appear to be universal. We also observed a noticeable benefit from diversification enabled by debiasing solutions, and we identified interesting insights into how interventions impact users based on the share of popular items they interact with. ...

Personalized Recommender Systems for Gym Workouts: A Reinforcement Learning Approach

Master thesis (2026) - R. Rosema, M. Mansoury, H. Torkamaan

A good workout is more than a list of exercises. In the gym, recommendations must also decide how much work a user should do, whether that workload is realistic, and how the next recommendation should adapt when a user starts skipping exercises. This makes gym workout recommendation a sequential decision problem rather than a standard item-ranking task. This thesis studies whether reinforcement learning (RL) improves workout recommendation when the problem is extended from exercise selection to full prescription. Starting from the Home-Fitness RL framework of Tragos et al., we develop a simulator-based gym recommendation framework with four environments: exercise-only and full-prescription settings, each with and without skip-based interaction. The full-prescription environments recommend exercise, sets, repetitions, and load, while the skip-enabled environments use skip-only feedback for online personalization. Because suitable realworld gym interaction data was not available, synthetic user pools were used for training and evaluation under static, dynamic, and stress-test conditions. The results show that the value of reinforcement learning depends strongly on the structure of the problem. In the exercise-only setting, the RL algorithm Proximal Policy Optimization (PPO) clearly outperforms random recommendation and remains competitive with Particle Swarm Optimization (PSO), but it does not outperform a strong greedy baseline. In the full-prescription setting, PPO becomes the strongest method and outperforms all baselines, showing that reinforcement learning becomes more useful once the recommendation task includes dose and user-specific capacity. Skip-enabled environments lead to more adherence-aware behavior, but also introduce trade-offs between completion and other reward components. Finally, PPO remains stable under realistic gradual user drift, while highly chaotic user changes substantially reduce performance, especially when online personalization is involved. Overall, the thesis shows that reinforcement learning is not uniformly superior for workout recommendation, but becomes clearly more convincing when the problem is extended to realistic gym prescription and user interaction. ...

A good workout is more than a list of exercises. In the gym, recommendations must also decide how much work a user should do, whether that workload is realistic, and how the next recommendation should adapt when a user starts skipping exercises. This makes gym workout recommendation a sequential decision problem rather than a standard item-ranking task. This thesis studies whether reinforcement learning (RL) improves workout recommendation when the problem is extended from exercise selection to full prescription. Starting from the Home-Fitness RL framework of Tragos et al., we develop a simulator-based gym recommendation framework with four environments: exercise-only and full-prescription settings, each with and without skip-based interaction. The full-prescription environments recommend exercise, sets, repetitions, and load, while the skip-enabled environments use skip-only feedback for online personalization. Because suitable realworld gym interaction data was not available, synthetic user pools were used for training and evaluation under static, dynamic, and stress-test conditions. The results show that the value of reinforcement learning depends strongly on the structure of the problem. In the exercise-only setting, the RL algorithm Proximal Policy Optimization (PPO) clearly outperforms random recommendation and remains competitive with Particle Swarm Optimization (PSO), but it does not outperform a strong greedy baseline. In the full-prescription setting, PPO becomes the strongest method and outperforms all baselines, showing that reinforcement learning becomes more useful once the recommendation task includes dose and user-specific capacity. Skip-enabled environments lead to more adherence-aware behavior, but also introduce trade-offs between completion and other reward components. Finally, PPO remains stable under realistic gradual user drift, while highly chaotic user changes substantially reduce performance, especially when online personalization is involved. Overall, the thesis shows that reinforcement learning is not uniformly superior for workout recommendation, but becomes clearly more convincing when the problem is extended to realistic gym prescription and user interaction.

Bridging the Semantic-Collaborative Gap

Unified Item Quantization for LLM-based Generative Recommendation

Master thesis (2026) - B. Lu, M. Mansoury, A. Hanjalic, S. Tan

Large Language Model (LLM)-based generative recommendation reformulates item retrieval as an autoregressive sequence generation problem, representing items through discrete semantic identifiers (SIDs) constructed via the vector quantization of item embeddings. However, a critical yet underexplored limitation of existing tokenization methods is the semantic-collaborative gap. SIDs derived purely from item content fail to capture latent user preference patterns encoded in historical interaction data, whereas purely collaborative identifiers lack semantic grounding and generalize poorly to sparse or cold-start scenarios.

To bridge this gap, we propose the Unified Q-Former (UQF), a novel pre-quantization fusion framework designed to explicitly integrate semantic and collaborative signals into a unified item representation before discretization. Inspired by the query-based multimodal alignment of BLIP-2, UQF employs a set of learnable queries, parallel cross-attention over pre-trained item text embeddings and graph-based collaborative embeddings (via LightGCN), and adaptive gated fusion to dynamically extract complementary information from both modalities. To ensure robustness and structure preservation, the framework is optimized using a hybrid contrastive learning objective—incorporating both structural and semantic neighbors—coupled with asymmetric modality dropout.

The resulting unified representations are quantized into discrete SIDs via residual vector quantization (RQ-VAE) and utilized as target generation tokens for a downstream LLM recommender. Extensive experiments on two real-world Amazon Review datasets (Office Products and Musical Instruments) demonstrate that UQF consistently improves state-of-the-art LC-Rec and TIGER-style generative recommendation backbones. Our framework outperforms strong traditional, sequential, and recent unified generative baselines, yielding highly interpretable, hierarchical SID structures with significantly improved semantic and collaborative consistency. ...

Large Language Model (LLM)-based generative recommendation reformulates item retrieval as an autoregressive sequence generation problem, representing items through discrete semantic identifiers (SIDs) constructed via the vector quantization of item embeddings. However, a critical yet underexplored limitation of existing tokenization methods is the semantic-collaborative gap. SIDs derived purely from item content fail to capture latent user preference patterns encoded in historical interaction data, whereas purely collaborative identifiers lack semantic grounding and generalize poorly to sparse or cold-start scenarios.

To bridge this gap, we propose the Unified Q-Former (UQF), a novel pre-quantization fusion framework designed to explicitly integrate semantic and collaborative signals into a unified item representation before discretization. Inspired by the query-based multimodal alignment of BLIP-2, UQF employs a set of learnable queries, parallel cross-attention over pre-trained item text embeddings and graph-based collaborative embeddings (via LightGCN), and adaptive gated fusion to dynamically extract complementary information from both modalities. To ensure robustness and structure preservation, the framework is optimized using a hybrid contrastive learning objective—incorporating both structural and semantic neighbors—coupled with asymmetric modality dropout.

The resulting unified representations are quantized into discrete SIDs via residual vector quantization (RQ-VAE) and utilized as target generation tokens for a downstream LLM recommender. Extensive experiments on two real-world Amazon Review datasets (Office Products and Musical Instruments) demonstrate that UQF consistently improves state-of-the-art LC-Rec and TIGER-style generative recommendation backbones. Our framework outperforms strong traditional, sequential, and recent unified generative baselines, yielding highly interpretable, hierarchical SID structures with significantly improved semantic and collaborative consistency.

Comparative Analysis of Recommendation Models on Scopus Data

Unveiling Patterns in Sparse Interactions for Academic Discovery

Master thesis (2025) - B.V. Yıldız, M. Mansoury, Amin Tabatabaei

This thesis presents the design, implementation, and evaluation of a scalable, modular recommendation framework for academic article discovery on the Scopus platform. The research addresses limitations in Scopus’s existing “Related documents” module, which produces static, non-personalized suggestions based solely on metadata keyword overlap. To overcome these constraints, the proposed framework introduces a dual-mode retrieval strategy capable of generating both personalized recommendations, informed by historical user interactions, and non-personalized recommendations, based solely on the context of a target article.

The study begins with an extensive exploratory data analysis (EDA) of 2024 Scopus interaction logs, comprising over 31 million user-item events. A novel data transformation pipeline is developed to convert implicit feedback signals, such as downloads, views, and exports, into continuous-valued preference scores that are suitable for collaborative filtering models. This enables the application of state-of-the-art algorithms despite the absence of explicit ratings.

Four recommendation models are implemented and compared: Bayesian Personalized Ranking (BPR), Factored Item Similarity Model (FISM), Light Graph Convolutional Network (LightGCN), and Knowledge Graph Attention Network (KGAT). Model evaluation is performed using both traditional offline ranking metrics (Recall@10, Precision@10, NDCG@10, MRR@10, Hit Rate@10) and a novel Large Language Model (LLM) based evaluation framework leveraging GPT-4o for semantic assessment of relevance and serendipity.

Results show that LightGCN consistently outperforms other models in both personalized and non-personalized scenarios, achieving the highest accuracy and scalability. Non-personalized recommendations remain valuable in cold-start and anonymous browsing contexts. The integration of LLM based evaluation offers deeper qualitative insights into recommendation quality, capturing semantic alignment and novelty beyond what is reflected in traditional metrics. The proposed framework demonstrates that a unified embedding based architecture can effectively serve heterogeneous recommendation needs on large-scale scholarly platforms. The methodology and findings have broader implications for the design of academic recommender systems in data sparse and mixed user environments. ...

This thesis presents the design, implementation, and evaluation of a scalable, modular recommendation framework for academic article discovery on the Scopus platform. The research addresses limitations in Scopus’s existing “Related documents” module, which produces static, non-personalized suggestions based solely on metadata keyword overlap. To overcome these constraints, the proposed framework introduces a dual-mode retrieval strategy capable of generating both personalized recommendations, informed by historical user interactions, and non-personalized recommendations, based solely on the context of a target article.

The study begins with an extensive exploratory data analysis (EDA) of 2024 Scopus interaction logs, comprising over 31 million user-item events. A novel data transformation pipeline is developed to convert implicit feedback signals, such as downloads, views, and exports, into continuous-valued preference scores that are suitable for collaborative filtering models. This enables the application of state-of-the-art algorithms despite the absence of explicit ratings.

Four recommendation models are implemented and compared: Bayesian Personalized Ranking (BPR), Factored Item Similarity Model (FISM), Light Graph Convolutional Network (LightGCN), and Knowledge Graph Attention Network (KGAT). Model evaluation is performed using both traditional offline ranking metrics (Recall@10, Precision@10, NDCG@10, MRR@10, Hit Rate@10) and a novel Large Language Model (LLM) based evaluation framework leveraging GPT-4o for semantic assessment of relevance and serendipity.

Results show that LightGCN consistently outperforms other models in both personalized and non-personalized scenarios, achieving the highest accuracy and scalability. Non-personalized recommendations remain valuable in cold-start and anonymous browsing contexts. The integration of LLM based evaluation offers deeper qualitative insights into recommendation quality, capturing semantic alignment and novelty beyond what is reflected in traditional metrics. The proposed framework demonstrates that a unified embedding based architecture can effectively serve heterogeneous recommendation needs on large-scale scholarly platforms. The methodology and findings have broader implications for the design of academic recommender systems in data sparse and mixed user environments.

Enhancing Privacy of Course Recommendation Systems

A Privacy-Focused Matrix Factorization Approach

Master thesis (2025) - D. Šterns, Z. Erkin, M. Mansoury

Personalized course-recommendation systems can help students make better academic choices and improve learning outcomes. Matrix factorization (MF) is a well-established and effective approach for this task, producing accurate recommendations from historical student–course performance data. However, the deployment of MF-based recommenders is hindered by privacy and regulatory risks, particularly when sensitive student records are processed by third-party or centralized systems. In the privacy-preserving setting, MF models exhibit reduced accuracy: when combined with differential privacy, accuracy is fundamentally degraded by the added noise, while existing cryptography-based approaches omit bias terms, resulting in a measurable accuracy gap with their plaintext equivalents.
This thesis enhances a Homomorphic-Encryption-based recommendation protocol to support biased Matrix Factorization through two additions: data centering and vector augmentation. These modifications maintain the security guarantees of the original protocol under the semi-honest adversary model while enabling the model to incorporate user and item biases. Evaluated in the plaintext domain on the MovieLens-100k dataset, the enhanced model achieved a test RMSE of 0.9213, a notable improvement over the baseline's 0.9507, and reached the baseline’s best RMSE with only 15 training iterations instead of 145. Beyond accuracy and efficiency, separating bias terms from the student–course interaction extends the system from a simple grade predictor into a tool for academic discovery, allowing for recommendations that consider inherent compatibility, not solely predicted grades. Although demonstrated in a course-recommendation setting, the approach is applicable to any privacy-preserving recommender system, offering reduced computational costs and narrowing the accuracy gap with non-private methods. ...

Fairness in Collaborative Filtering Recommender Systems

A Comparative Analysis of Trade-offs Across Model Architectures

Bachelor thesis (2025) - Jeeyoon Kang, Masoud Mansoury, Nergis Tömen

Recommender systems personalize content by predicting user preferences, but this often results in unequal treatment of users and items—for example, some users may receive lower-quality recommendations, while niche items remain underexposed. Although fairness-enhancing interventions exist, they can obscure the extent to which disparities stem from model architecture alone.
This study investigates how collaborative filtering architectures affect both accuracy and fairness. We evaluate six models, including two non-personalized baselines, across two public datasets using a unified pipeline without fairness-specific interventions.
Our results reveal a general trade-off: models with higher accuracy often exhibit greater fairness disparities, particularly on the user side. For example, LightGCN combines strong accuracy with relatively high item-side fairness, while SLIMElastic ranks high in accuracy but worsens unfairness. However, this trade-off is not uniform across datasets; NeuMF degrades notably on sparser data.
These findings demonstrate that model architecture alone can shape fairness–accuracy trade-offs, highlighting the importance of considering dataset characteristics and model design when selecting or developing recommender systems. ...

Fairness and Bias in Recommender Systems

Alleviating the unfairness issue with knowledge-aware recommendation models

Bachelor thesis (2025) - Y.Z. Popov, M. Mansoury, M. Mansoury

This study investigates fairness in knowledge-aware recommender systems by evaluating their performance across both accuracy and fairness metrics. Using the MovieLens 1M dataset, we compare general, knowledge-aware, and fairness-optimized models through a custom RecBole-based pipeline. Results indicate knowledge-aware models offer some fairness benefits without major accuracy loss, though no model excels universally. Adjusting loss component weights reveals complex trade-offs and component importance, underscoring the need for nuanced fairness optimization. ...

LLM-augmented counterfactual explanations

Improving faithfulness and user-preference alignment

Master thesis (2025) - A. Hasami, M. Mansoury, A. Hanjalic, M.S. Pera

Counterfactual explanations (CFEs) offer a tangible and actionable way to explain recommendations by showing users a "what-if" scenario that demonstrates how small changes in their history would alter the system’s output. However, existing CFE methods are susceptible to bias, generating explanations that might misalign with the user's actual preferences. In this thesis, we study ACCENT, a neural CFE framework, and analyze its behavior through the lens of popularity bias. We introduce two alignment metrics, popularity distribution similarity (PDS) and expected popularity deviation (EPD), and evaluate 736 users with strongly niche- or blockbuster-oriented histories on MovieLens 1M and Amazon Video Games. Analysis shows that ACCENT’s explanations are systematically misaligned with historical user popularity preference. To address this, we propose a pre-processing step that leverages large language models to identify and filter out-of-character history items before generating explanations. Compared to simple heuristics and embedding-based filters, LLM-based filtering yields counterfactuals that are more closely aligned with each user’s popularity preferences, while preserving explanation conciseness and fidelity. A comparison between 4B and 8B parameter models further reveals that larger LLMs provide more stable, instruction-following behavior and stronger alignment, at the cost of increased computational overhead. ...

Opening the Black Box: Interpretable Remedies for Popularity Bias in Recommender Systems

Master thesis (2025) - P. Ahmadov, M. Mansoury, A. Hanjalic, P.K. Murukannaiah

Popularity bias is a long-standing challenge in recommender systems, where a small set of highly popular items dominates recommendations, while the majority of less popular items are overlooked. This imbalance undermines fairness, decreases recommendation diversity, and negatively impacts users’ ability to discover novel or niche content. Although existing mitigation methods address this issue to some extent, they often lack transparency in how they operate - they fix the symptoms without exposing the internal mechanisms that generate the bias, which makes the effects of bias difficult to interpret or control. As modern recommendation models increasingly rely on deep learning based architectures, which are inherently hard to interpret, this opacity has become a fundamental limitation.

This thesis introduces PopSteer, a novel post-processing strategy to analyze and mitigate popularity
bias in deep recommender systems that is also interpretable. PopSteer builds on a Sparse Autoencoder that converts dense embeddings into a sparse feature space where individual neurons align with human-readable features. PopSteer consists of 3 stages: (i) Sparse Autoencoder training (ii) Neuron analysis stage through synthetic data (iii) Neuron steering stages. In the training stage, a Sparse Autoencoder (SAE) is attached to the hidden representation of a pretrained model to generate a relatively disentangled feature space, where individual neurons correspond to features. In the neuron analysis stage, two synthetic user sets are passed through the SAE, one set favoring popular items and the other favoring unpopular items. Each neuron’s alignment with popularity is quantified from the difference in activation between the two sets using Cohen’s d. In the neuron steering stage, activations of the most biased neurons are adjusted.

Experimental results show that PopSteer consistently increases exposure fairness with only minor accuracy degradation compared to state-of-the-art baselines. Effects are stronger when synthetic sets are used in the neuron analysis stage, as opposed to real data, because they better isolate the bias by providing extreme preference patterns. Furthermore, its neuron-level analysis provides insights into how popularity bias emerges within model embeddings, and validates the interpretability of individual neurons. Sensitivity analysis demonstrates that the hyperparameters have a predictable but asymmetric effect on accuracy and fairness. The count of steered neurons matters less than selecting the right neurons, which keeps intervention focused and efficient. Results show that inference and training costs stay modest, indicating deployability. Overall, the results indicate that PopSteer provides an effective and interpretable way to reduce popularity bias in deep recommender systems, while keeping accuracy loss manageable. ...

Popularity bias is a long-standing challenge in recommender systems, where a small set of highly popular items dominates recommendations, while the majority of less popular items are overlooked. This imbalance undermines fairness, decreases recommendation diversity, and negatively impacts users’ ability to discover novel or niche content. Although existing mitigation methods address this issue to some extent, they often lack transparency in how they operate - they fix the symptoms without exposing the internal mechanisms that generate the bias, which makes the effects of bias difficult to interpret or control. As modern recommendation models increasingly rely on deep learning based architectures, which are inherently hard to interpret, this opacity has become a fundamental limitation.

This thesis introduces PopSteer, a novel post-processing strategy to analyze and mitigate popularity
bias in deep recommender systems that is also interpretable. PopSteer builds on a Sparse Autoencoder that converts dense embeddings into a sparse feature space where individual neurons align with human-readable features. PopSteer consists of 3 stages: (i) Sparse Autoencoder training (ii) Neuron analysis stage through synthetic data (iii) Neuron steering stages. In the training stage, a Sparse Autoencoder (SAE) is attached to the hidden representation of a pretrained model to generate a relatively disentangled feature space, where individual neurons correspond to features. In the neuron analysis stage, two synthetic user sets are passed through the SAE, one set favoring popular items and the other favoring unpopular items. Each neuron’s alignment with popularity is quantified from the difference in activation between the two sets using Cohen’s d. In the neuron steering stage, activations of the most biased neurons are adjusted.

Experimental results show that PopSteer consistently increases exposure fairness with only minor accuracy degradation compared to state-of-the-art baselines. Effects are stronger when synthetic sets are used in the neuron analysis stage, as opposed to real data, because they better isolate the bias by providing extreme preference patterns. Furthermore, its neuron-level analysis provides insights into how popularity bias emerges within model embeddings, and validates the interpretability of individual neurons. Sensitivity analysis demonstrates that the hyperparameters have a predictable but asymmetric effect on accuracy and fairness. The count of steered neurons matters less than selecting the right neurons, which keeps intervention focused and efficient. Results show that inference and training costs stay modest, indicating deployability. Overall, the results indicate that PopSteer provides an effective and interpretable way to reduce popularity bias in deep recommender systems, while keeping accuracy loss manageable.