Understanding Memorization in Large Language Models

None, None

Understanding Memorization in Large Language Models

What controls memorization rate? From entropy to conditional entropy or conditioning structure

Master Thesis (2026)

Author(s)

R. Alvarez Lucendo (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Kubilay Atasu – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

J.C. van Gemert – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Jérémie Decouchant – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Madhur Panwar – Mentor (École Polytechnique Fédérale de Lausanne)

Faculty

Electrical Engineering, Mathematics and Computer Science

Transformers Memorization Conditional Entropy

To reference this document use

https://resolver.tudelft.nl/uuid:547c7533-be8c-400e-a218-33b7b023eda2

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

29-05-2026

Awarding Institution

Delft University of Technology

Programme

Computer Science

Abstract

Large language models (LLMs) can reproduce passages from their training data verbatim, raising privacy and copyright concerns. Prior work attributes memorization to factors such as model size, sequence entropy, context length, and repetition, but these findings lack a unified explanation. This thesis proposes a disambiguation complexity framework: memorization speed is governed not by the information content of a sequence, but by the difficulty of identifying it, specifically by the complexity of the minimal conditioning structure the model must extract from context to uniquely determine the correct continuation.

We demonstrate a counterintuitive regime in which random token sequences are memorized faster than structured natural language, contradicting standard explanations. We formalize a hierarchy of conditioning levels and introduce K-arity, a scalar complexity measure counting the number of prefix tokens jointly required to make a continuation deterministic. Through controlled experiments on synthetic datasets, we show that conditioning level and K-arity are predictive of memorization behavior. Attention analysis reveals that disambiguating cues are most clearly visible in early attention patterns. Natural language experiments show that, in text rich with redundant linguistic cues, isolated manipulations of conditioning complexity do not produce detectable differences, highlighting the gap between synthetic and naturalistic settings. This single principle connects input representation, entropy, identifying tokens, and context length within a common theoretical lens.

Files

Llms-memorization.pdf

(pdf | 1.11 Mb)

License info not available