Evaluation of natural language processing embeddings in protein function prediction for bacteria

Bachelor thesis (2022)

Authors

B.M. Cosma Electrical Engineering, Mathematics and Computer Science

Contributors

A. Urhan Pattern Recognition and Bioinformatics - (supervisor 1)

A Manson McGuire Broad Institute of MIT and Harvard (supervisor 1)

T.E.P.M.F. Abeel Pattern Recognition and Bioinformatics - (supervisor 1)

S.E. Verwer Cyber Security - (supervisor 2)

Faculty

Electrical Engineering, Mathematics and Computer Science

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:98589d38-e25c-47c2-b5c2-4be46b1b9a1d

Published Date

22-06-2022

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Motivation: The development of automated protein function prediction models is essential in closing the gap between the large amount of protein sequence data available and the fraction of validly annotated data. Recent approaches to function prediction rely on unsupervised deep learning models, through which protein sequences are represented as real-valued embeddings that can be used as input to a machine-learning model. This study aims to evaluate embedding models in the context of protein function prediction on bacteria, which are organisms less commonly included in these types of benchmarks. To this end, we generated embeddings with four recently developed embedding models, and predicted protein function using a nearest-neighbor search in the embedding space. We evaluated these predictors on two query sets, with proteins from gram-positive B. subtilis and gram-negative E. coli.

Results: Our nearest neighbor models outperformed BLAST sequence-based protein function annotation, according to the evaluation procedure outlined in the CAFA challenges. The results were also shown to be comparable, and at times better than DeepGOPlus predictions, thus highlighting the potential of embedding-based predictions as state-of-the art models. On the B. subtilis dataset, our nearest neighbor model from ESM1b embeddings scored an Fmax of 0.6 in molecular function predictions, and was able to predict GO terms with a high information content. Hence unsupervised embedding models were shown to encode information about a protein sequence that is useful in the task of function prediction.

Availability: The scripts used in this project are available on GitHub.

Files

Final_thesis.pdf

(.pdf | 1.13 Mb)