Evaluation of natural language processing embeddings in protein function prediction for bacteria

More Info


Motivation: The development of automated protein function prediction models is essential in closing the gap between the large amount of protein sequence data available and the fraction of validly annotated data. Recent approaches to function prediction rely on unsupervised deep learning models, through which protein sequences are represented as real-valued embeddings that can be used as input to a machine-learning model. This study aims to evaluate embedding models in the context of protein function prediction on bacteria, which are organisms less commonly included in these types of benchmarks. To this end, we generated embeddings with four recently developed embedding models, and predicted protein function using a nearest-neighbor search in the embedding space. We evaluated these predictors on two query sets, with proteins from gram-positive B. subtilis and gram-negative E. coli.

Results: Our nearest neighbor models outperformed BLAST sequence-based protein function annotation, according to the evaluation procedure outlined in the CAFA challenges. The results were also shown to be comparable, and at times better than DeepGOPlus predictions, thus highlighting the potential of embedding-based predictions as state-of-the art models. On the B. subtilis dataset, our nearest neighbor model from ESM1b embeddings scored an Fmax of 0.6 in molecular function predictions, and was able to predict GO terms with a high information content. Hence unsupervised embedding models were shown to encode information about a protein sequence that is useful in the task of function prediction.

Availability: The scripts used in this project are available on GitHub.