Embedding-Based Multi-Paradigm Protein Function Prediction in Prokaryotes
M. Hielkema (TU Delft - Electrical Engineering, Mathematics and Computer Science)
T.E.P.M.F. Abeel – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
In the past decade, protein functional prediction has dramatically shifted towards the usage of large language models (LLMs). In this research, we set out to improve upon the model of SAFPred, a model for prokaryote protein function prediction combining LLM embedding based sequence homology prediction with a synteny aware component. With a new technique referred to as stopping layers, we successfully proved that we can remove layers from pre-trained LLMs without sacrificing performance, ultimately giving us the ability to reduce runtime by 70% and required GPU VRAM usage for LLM storage by 72%. Furthermore, we show that we can prune our training dataset by only using prokaryote proteins without any performance impact, reducing the training set by 78%. Additionally, in our evaluated models we show that restricting the amount of training matches per query protein as much as possible is beneficial to model performance.