Using Sparse Transformers as World Models to Improve Generalization

None, None

Using Sparse Transformers as World Models to Improve Generalization

Master Thesis (2025)

Author(s)

A. Ebersberger (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J.W. Böhmer – Mentor (TU Delft - Sequential Decision Making)

Pradeep K. Murukannaiah – Graduation committee member (TU Delft - Interactive Intelligence)

Faculty

Electrical Engineering, Mathematics and Computer Science

Reinforcement Learning Machine Learning Generalization Artificial Inteligence Sparse attention Transformer Model

To reference this document use:

https://resolver.tudelft.nl/uuid:4c5eddf0-443d-4cf0-9531-02ba6fc61b15

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

16-06-2025

Awarding Institution

Delft University of Technology

Programme

['Electrical Engineering | Embedded Systems']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This thesis introduces a novel sparsity-regularized transformer to be used as a world model in model-based reinforcement learning, specifically targeting environments with sparse interactions. Sparse-interactive environments are a class of environments where the state can be decomposed into meaningful components, and state transitions depend primarily on a small subset of state components. Traditional neural networks often struggle with generalization in such environments, as they consider all possible interactions between state components, leading to overfitting and poor sample efficiency. We formally define sparse-interactive environments and propose a simple yet effective modification to the standard transformer architecture that promotes sparsity in the attention mechanism through L1 regularization and thresholding. Through extensive experiments on the Minigrid environment, we demonstrate that our sparsity-regularized transformer achieves higher validation transition accuracy and lower variance across random initializations compared to the original transformer, particularly in low-data regimes.

Files

Thesis.pdf

(pdf | 3.3 Mb)

License info not available