Evaluation of natural language processing embeddings in protein function prediction for bacteria

Cosma, Bianca

Evaluation of natural language processing embeddings in protein function prediction for bacteria

Title

Evaluation of natural language processing embeddings in protein function prediction for bacteria

Author

Cosma, Bianca (TU Delft Electrical Engineering, Mathematics and Computer Science)

Contributor

Urhan, A. (mentor)
Manson McGuire, Abigail L. (mentor)
Abeel, T.E.P.M.F. (mentor)
Verwer, S.E. (graduation committee)

Degree granting institution

Delft University of Technology

Programme

Computer Science and Engineering

Project

CSE3000 Research Project

Date

2022-06-22

Abstract

Motivation: The development of automated protein function prediction models is essential in closing the gap between the large amount of protein sequence data available and the fraction of validly annotated data. Recent approaches to function prediction rely on unsupervised deep learning models, through which protein sequences are represented as real-valued embeddings that can be used as input to a machine-learning model. This study aims to evaluate embedding models in the context of protein function prediction on bacteria, which are organisms less commonly included in these types of benchmarks. To this end, we generated embeddings with four recently developed embedding models, and predicted protein function using a nearest-neighbor search in the embedding space. We evaluated these predictors on two query sets, with proteins from gram-positive B. subtilis and gram-negative E. coli.

Results: Our nearest neighbor models outperformed BLAST sequence-based protein function annotation, according to the evaluation procedure outlined in the CAFA challenges. The results were also shown to be comparable, and at times better than DeepGOPlus predictions, thus highlighting the potential of embedding-based predictions as state-of-the art models. On the B. subtilis dataset, our nearest neighbor model from ESM1b embeddings scored an F_max of 0.6 in molecular function predictions, and was able to predict GO terms with a high information content. Hence unsupervised embedding models were shown to encode information about a protein sequence that is useful in the task of function prediction.

Availability: The scripts used in this project are available on GitHub.

Subject

Natural Language Processing
Representation Learning
Protein Function Prediction

To reference this document use:

http://resolver.tudelft.nl/uuid:98589d38-e25c-47c2-b5c2-4be46b1b9a1d

Part of collection

Student theses

Document type

bachelor thesis

Rights

Files

PDF

final_thesis.pdf

1.13 MB

Close viewer