Completing Partial Reaction Equations with Rule and Language Model-based Methods

Journal Article (2024)
Author(s)

Matthijs van Wijngaarden (Student TU Delft)

G. Vogel (TU Delft - Pattern Recognition and Bioinformatics)

J.M. Weber (TU Delft - Pattern Recognition and Bioinformatics)

Research Group
Pattern Recognition and Bioinformatics
DOI related publication
https://doi.org/10.1016/B978-0-443-28824-1.50524-X
More Info
expand_more
Publication Year
2024
Language
English
Research Group
Pattern Recognition and Bioinformatics
Volume number
53
Pages (from-to)
3139-3144
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large chemical reaction data sets often suffer from incompleteness, such as missing molecules or stoichiometric information. Incomplete chemical reaction equations currently hinder us to perform automated mass balances across large sets of chemical reactions. In this work, we integrate two approaches for computational completion of partial reaction equations. Specifically, we combine a rule-based method and a machine learning model, a tailored version of the pre-trained Molecular Transformer, to complete reactions. The rule-based method takes sets of helper species into a linear solver and therewith balances some incomplete reactions. The machine learning model is trained to take partial reactions as inputs and predicts missing molecules and stoichiometries. We apply our methodology to the USPTO STEREO chemical reaction data set. The rule-based method completes about 50 % of the reactions. The language model shows a top 1 accuracy of 88.3 % on our test set and high validity (> 99 % of outputs are valid SMILES).

Files

1-s2.0-B978044328824150524X-ma... (pdf)
(pdf | 0.42 Mb)
- Embargo expired in 31-12-2024
License info not available