Print Email Facebook Twitter Evaluation of natural language processing embeddings in protein function prediction for bacteria Title Evaluation of natural language processing embeddings in protein function prediction for bacteria Author Cosma, Bianca (TU Delft Electrical Engineering, Mathematics and Computer Science) Contributor Urhan, A. (mentor) Manson McGuire, Abigail L. (mentor) Abeel, T.E.P.M.F. (mentor) Verwer, S.E. (graduation committee) Degree granting institution Delft University of Technology Programme Computer Science and Engineering Project CSE3000 Research Project Date 2022-06-22 Abstract Motivation: The development of automated protein function prediction models is essential in closing the gap between the large amount of protein sequence data available and the fraction of validly annotated data. Recent approaches to function prediction rely on unsupervised deep learning models, through which protein sequences are represented as real-valued embeddings that can be used as input to a machine-learning model. This study aims to evaluate embedding models in the context of protein function prediction on bacteria, which are organisms less commonly included in these types of benchmarks. To this end, we generated embeddings with four recently developed embedding models, and predicted protein function using a nearest-neighbor search in the embedding space. We evaluated these predictors on two query sets, with proteins from gram-positive B. subtilis and gram-negative E. coli. Results: Our nearest neighbor models outperformed BLAST sequence-based protein function annotation, according to the evaluation procedure outlined in the CAFA challenges. The results were also shown to be comparable, and at times better than DeepGOPlus predictions, thus highlighting the potential of embedding-based predictions as state-of-the art models. On the B. subtilis dataset, our nearest neighbor model from ESM1b embeddings scored an Fmax of 0.6 in molecular function predictions, and was able to predict GO terms with a high information content. Hence unsupervised embedding models were shown to encode information about a protein sequence that is useful in the task of function prediction.Availability: The scripts used in this project are available on GitHub. Subject Natural Language ProcessingRepresentation LearningProtein Function Prediction To reference this document use: http://resolver.tudelft.nl/uuid:98589d38-e25c-47c2-b5c2-4be46b1b9a1d Part of collection Student theses Document type bachelor thesis Rights © 2022 Bianca Cosma Files PDF final_thesis.pdf 1.13 MB Close viewer /islandora/object/uuid:98589d38-e25c-47c2-b5c2-4be46b1b9a1d/datastream/OBJ/view