Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function

None, None; None, None; None, None; None, None; None, None; None, None

Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function

Journal Article (2020)

Author(s)

Amelia Villegas-Morcillo (TU Delft - Pattern Recognition and Bioinformatics)

Stavros Makrodimitris (TU Delft - Pattern Recognition and Bioinformatics)

Roeland C H J van Ham (TU Delft - Pattern Recognition and Bioinformatics)

Angel M Gomez (TU Delft - Water Resources)

Victoria Sanchez (University of Granada)

Marcel Reinders (TU Delft - Pattern Recognition and Bioinformatics)

Research Group

Pattern Recognition and Bioinformatics

Copyright

DOI related publication

https://doi.org/10.1093/bioinformatics/btaa701

To reference this document use:

https://resolver.tudelft.nl/uuid:dcca64be-1d7f-4e2a-aa08-bc6b78e0805d

More Info

expand_more

Publication Year

2020

Language

English

Copyright

Research Group

Pattern Recognition and Bioinformatics

Bibliographical Note

btaa701@en

Issue number

2

Volume number

37

Pages (from-to)

162-170

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Motivation: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. Results: We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining.

Files

Btaa701.pdf

(pdf | 0.701 Mb)