Chemical reaction completion: a hybrid rule-based and language model-based approach

None, None

Chemical reaction completion: a hybrid rule-based and language model-based approach

Master Thesis (2023)

Author(s)

M.C. van Wijngaarden (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J.M. Weber – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

M.J.T. Reinders – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

G. Vogel – Coach (TU Delft - Pattern Recognition and Bioinformatics)

Faculty

Electrical Engineering, Mathematics and Computer Science

Language Models Cheminformatics Chemical Reaction

To reference this document use:

https://resolver.tudelft.nl/uuid:fb806f47-1c5a-46d3-a585-b0b95eb626bc

More Info

expand_more

Publication Year

2023

Language

English

Graduation Date

13-11-2023

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large chemical reaction databases often suffer from incompleteness, such as missing molecules or stoichiometric information. Concurrently, numerous computational models are being developed in predictive chemistry that rely on reaction databases and would hugely benefit from complete reaction equations. Also, research in sustainable chemistry often focuses on automated mass balance tasks, which require a full reaction to properly evaluate. In this work, we present a hybrid approach for computational completion of reaction equations. Specifically, we combine a rule-based method and a machine learning (ML) model to complete reactions. The rule-based approach constructs a balance of atoms and charge on either side of the reaction in an attempt to find missing molecules. We tailor the pre-trained transformer model on the chemical language domain to take partial reactions as inputs and predict missing molecules. Furthermore, we present a novel approach to measure the correctness of our model, which is useful when we apply it to the uncurated dataset and the ground-truth is unknown.

Files

MScThesis_MatthijsVanWijngaard... (pdf)

(pdf | 1.08 Mb)

License info not available