In the past decade, protein functional prediction has dramatically shifted towards the usage of large language models (LLMs). In this research, we set out to improve upon the model of SAFPred, a model for prokaryote protein function prediction combining LLM embedding based sequen
...
In the past decade, protein functional prediction has dramatically shifted towards the usage of large language models (LLMs). In this research, we set out to improve upon the model of SAFPred, a model for prokaryote protein function prediction combining LLM embedding based sequence homology prediction with a synteny aware component. With a new technique referred to as stopping layers, we successfully proved that we can remove layers from pre-trained LLMs without sacrificing performance, ultimately giving us the ability to reduce runtime by 70% and required GPU VRAM usage for LLM storage by 72%. Furthermore, we show that we can prune our training dataset by only using prokaryote proteins without any performance impact, reducing the training set by 78%. Additionally, in our evaluated models we show that restricting the amount of training matches per query protein as much as possible is beneficial to model performance.