Embedding-Based Multi-Paradigm Protein Function Prediction in Prokaryotes

None, None

Embedding-Based Multi-Paradigm Protein Function Prediction in Prokaryotes

Master Thesis (2025)

Author(s)

M. Hielkema (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

T.E.P.M.F. Abeel – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Faculty

Electrical Engineering, Mathematics and Computer Science

To reference this document use:

https://resolver.tudelft.nl/uuid:323ee5bd-8977-4ba9-9222-d31c7cd2df6e

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

20-02-2025

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In the past decade, protein functional prediction has dramatically shifted towards the usage of large language models (LLMs). In this research, we set out to improve upon the model of SAFPred, a model for prokaryote protein function prediction combining LLM embedding based sequence homology prediction with a synteny aware component. With a new technique referred to as stopping layers, we successfully proved that we can remove layers from pre-trained LLMs without sacrificing performance, ultimately giving us the ability to reduce runtime by 70% and required GPU VRAM usage for LLM storage by 72%. Furthermore, we show that we can prune our training dataset by only using prokaryote proteins without any performance impact, reducing the training set by 78%. Additionally, in our evaluated models we show that restricting the amount of training matches per query protein as much as possible is beneficial to model performance.

Files

Master_Thesis_Menno_Hielkema.p... (pdf)

(pdf | 3.15 Mb)

License info not available