Bridging the Semantic-Collaborative Gap

Unified Item Quantization for LLM-based Generative Recommendation

Master Thesis (2026)
Author(s)

B. Lu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

M. Mansoury – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

A. Hanjalic – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

S. Tan – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
28-05-2026
Awarding Institution
Delft University of Technology
Programme
Computer Science
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
18
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Language Model (LLM)-based generative recommendation reformulates item retrieval as an autoregressive sequence generation problem, representing items through discrete semantic identifiers (SIDs) constructed via the vector quantization of item embeddings. However, a critical yet underexplored limitation of existing tokenization methods is the semantic-collaborative gap. SIDs derived purely from item content fail to capture latent user preference patterns encoded in historical interaction data, whereas purely collaborative identifiers lack semantic grounding and generalize poorly to sparse or cold-start scenarios.

To bridge this gap, we propose the Unified Q-Former (UQF), a novel pre-quantization fusion framework designed to explicitly integrate semantic and collaborative signals into a unified item representation before discretization. Inspired by the query-based multimodal alignment of BLIP-2, UQF employs a set of learnable queries, parallel cross-attention over pre-trained item text embeddings and graph-based collaborative embeddings (via LightGCN), and adaptive gated fusion to dynamically extract complementary information from both modalities. To ensure robustness and structure preservation, the framework is optimized using a hybrid contrastive learning objective—incorporating both structural and semantic neighbors—coupled with asymmetric modality dropout.

The resulting unified representations are quantized into discrete SIDs via residual vector quantization (RQ-VAE) and utilized as target generation tokens for a downstream LLM recommender. Extensive experiments on two real-world Amazon Review datasets (Office Products and Musical Instruments) demonstrate that UQF consistently improves state-of-the-art LC-Rec and TIGER-style generative recommendation backbones. Our framework outperforms strong traditional, sequential, and recent unified generative baselines, yielding highly interpretable, hierarchical SID structures with significantly improved semantic and collaborative consistency.

Files

Full-thesis.pdf
(pdf | 20.7 Mb)
License info not available