Chemical reaction completion: a hybrid rule-based and language model-based approach

Master Thesis (2023)
Author(s)

M.C. van Wijngaarden (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J.M. Weber – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

MJT Reinders – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

G. Vogel – Coach (TU Delft - Pattern Recognition and Bioinformatics)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2023 Matthijs van Wijngaarden
More Info
expand_more
Publication Year
2023
Language
English
Copyright
© 2023 Matthijs van Wijngaarden
Graduation Date
13-11-2023
Awarding Institution
Delft University of Technology
Programme
['Computer Science']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large chemical reaction databases often suffer from incompleteness, such as missing molecules or stoichiometric information. Concurrently, numerous computational models are being developed in predictive chemistry that rely on reaction databases and would hugely benefit from complete reaction equations. Also, research in sustainable chemistry often focuses on automated mass balance tasks, which require a full reaction to properly evaluate. In this work, we present a hybrid approach for computational completion of reaction equations. Specifically, we combine a rule-based method and a machine learning (ML) model to complete reactions. The rule-based approach constructs a balance of atoms and charge on either side of the reaction in an attempt to find missing molecules. We tailor the pre-trained transformer model on the chemical language domain to take partial reactions as inputs and predict missing molecules. Furthermore, we present a novel approach to measure the correctness of our model, which is useful when we apply it to the uncurated dataset and the ground-truth is unknown.

Files

License info not available