Automated Attention Pattern Discovery at Scale in Large Language Models

None, None; None, None; None, None; None, None; None, None

Automated Attention Pattern Discovery at Scale in Large Language Models

Journal Article (2026)

Author(s)

Jonathan Katzy (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Razvan Popescu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Erik Mekkes (Student TU Delft)

Arie van Deursen (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Maliheh Izadi (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Research Group

Software Engineering

To reference this document use

https://resolver.tudelft.nl/uuid:cbf3a32e-365b-4969-a3ee-6bf0c32bd4af

More Info

expand_more

Publication Year

2026

Language

English

Research Group

Software Engineering

Journal title

Transactions on Machine Learning Research

Issue number

2

Volume number

2026

Downloads counter

35

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large language models have found their success by scaling up their capabilities to work in general settings. The same can unfortunately not be said for their interpretability methods. The current trend in mechanistic interpretability is to provide precise explanations of specific behaviors in controlled settings. These often do not generalize well into other settings, or are too resource intensive for larger studies. In this work we propose to study repeated behaviors in large language models by mining completion scenarios in Java code datasets, through exploiting the structured nature of source code. We then collect the attention patterns generated in the attention heads to demonstrate that they are scalable signals for global interpretability of model components. We show that vision models offer a promising direction for analyzing attention patterns at scale. To demonstrate this, we introduce the Attention Pattern – Masked Autoencoder (AP-MAE), a vision transformer-based model that efficiently reconstructs masked attention patterns. Experiments on StarCoder2 models (3B–15B) show that AP-MAE (i) reconstructs masked attention patterns with high accuracy, (ii) generalizes across unseen models with minimal degradation, (iii) reveals recurring patterns across a large number of inferences, (iv) predicts whether a generation will be correct without access to ground truth, with accuracies ranging from 55% to 70% depending on the task, and (v) enables targeted interventions that increase accuracy by 13.6% when applied selectively, but cause rapid collapse when applied excessively. These results establish attention patterns as a scalable signal for interpretability and demonstrate that AP-MAE provides a transferable foundation for both analysis and intervention in large language models. Beyond its standalone value, AP-MAE can also serve as a selection procedure to guide more fine-grained mechanistic approaches toward the most relevant components. We release code and models to support future work in large-scale interpretability.

Files

5837_Automated_Attention_Patte... (pdf)

(pdf | 2.63 Mb)

License info not available