Evaluating Adaptive Activation Functions in Language Models

Does choice of activation function matter in smaller Langaunge Models?

Bachelor thesis (2024)

Authors

F. Ignijic Electrical Engineering, Mathematics and Computer Science

Contributors

M. Izadi Software Engineering - (mentor)

A. van Deursen Software Engineering - (mentor)

Aral de Moor Electrical Engineering, Mathematics and Computer Science (mentor)

T.E.P.M.F. Abeel Pattern Recognition and Bioinformatics - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

To reference this document use:

http://resolver.tudelft.nl/uuid:71858285-c122-40a1-baec-6ba66525898d

More Info

expand_more

Published Date

27-06-2024

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

The rapid expansion of large language models (LLMs) driven by the transformer architecture has raised concerns about the lack of high-quality train ing data. This study investigates the role of acti vation functions in smaller-scale language models, specifically those with approximately 10M param eters, to ensure sustained progress in LLM devel opment despite data limitations. Activation func tions, crucial for neural network performance, have evolved significantly, but comprehensive compar isons under consistent conditions remain scarce, especially for smaller parameter count models. This research systematically evaluates traditional and novel activation functions, including learnable variants, and introduces the Kolmogorov-Arnold Network (KAN) to language modeling. Using Hugging Face implementations of GPT-Neo and RoBERTa models, performance impacts were as sessed through the BabyLM evaluation pipeline. The results indicate that activation functions do not significantly impact the performance of these models. Additionally, the model with the KAN network underperformed compared to models with traditional architectures in the context of this study. These findings suggest that optimizing activation functions may not be crucial for smaller language models, emphasizing the need for further research to explore other architectural improvements.

Files

BSc2024_TinyTransformers_Filip... (pdf)

(pdf | 0.826 Mb)