The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction

None, None; None, None; None, None

The Power of Universal Contextualized Protein Embeddings in Cross-species Protein Function Prediction

Journal Article (2021)

Author(s)

Irene van den Bent (Student TU Delft)

Stavros Makrodimitris (TU Delft - Pattern Recognition and Bioinformatics)

Marcel Reinders (TU Delft - Pattern Recognition and Bioinformatics)

Research Group

Pattern Recognition and Bioinformatics

Copyright

DOI related publication

https://doi.org/10.1177/11769343211062608

Transfer learning Annotating evolutionary distant proteins Protein embedding Protein function prediction Protein language models

To reference this document use:

https://resolver.tudelft.nl/uuid:c76cd811-d4d9-4d02-af8c-cfe850394462

More Info

expand_more

Publication Year

2021

Language

English

Copyright

Research Group

Pattern Recognition and Bioinformatics

Volume number

17

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Computationally annotating proteins with a molecular function is a difficult problem that is made even harder due to the limited amount of available labeled protein training data. Unsupervised protein embeddings partly circumvent this limitation by learning a universal protein representation from many unlabeled sequences. Such embeddings incorporate contextual information of amino acids, thereby modeling the underlying principles of protein sequences insensitive to the context of species. We used an existing pre-trained protein embedding method and subjected its molecular function prediction performance to detailed characterization, first to advance the understanding of protein language models, and second to determine areas of improvement. Then, we applied the model in a transfer learning task by training a function predictor based on the embeddings of annotated protein sequences of one training species and making predictions on the proteins of several test species with varying evolutionary distance. We show that this approach successfully generalizes knowledge about protein function from one eukaryotic species to various other species, outperforming both an alignment-based and a supervised-learning-based baseline. This implies that such a method could be effective for molecular function prediction in inadequately annotated species from understudied taxonomic kingdoms.

Files

11769343211062608.pdf

(pdf | 5.14 Mb)