All-Atom Novel Protein Sequence Generation Using Discrete Diffusion

Master Thesis (2024)
Author(s)

G.J. Admiraal (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Amelia Villegas-Morcillo – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

J.M. Weber – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

MJT Reinders – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Wendelin Böhmer – Graduation committee member (TU Delft - Sequential Decision Making)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2024
Language
English
Graduation Date
02-12-2024
Awarding Institution
Delft University of Technology
Programme
['Computer Science']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Advancing protein design is crucial for breakthroughs in medicine and biotechnology, yet traditional approaches often fall short by focusing solely on representing protein sequences using the 20 canonical amino acids. This thesis explores discrete diffusion models for generating novel protein sequences with an all-atom representation, specifically SELFIES a widely used molecular string representation. This all-atom approach considers the atomic composition of each amino acid in the protein. Enabling the inclusion of non-canonical amino acids and post-translational modifications. Using a modified ByteNet architecture and the D3PM framework, we compare the effects of this all-atom representation to the standard amino acid representation on the generated proteins' quality, diversity and novelty. Additionally, we see how a uniform or absorbing noise process affects the results. While models trained on the all-atom representation struggle to generate fully valid proteins consistently, those successfully designed showed improved novelty and diversity. Moreover, the all-atom representation can achieve comparable structural reliability results from OmegaFold to the amino acid models. Lastly, our results show that the use of an absorbing noise schedule is the most effective for both the all-atom and amino acid representation.

Files

License info not available