Completing Partial Reaction Equations with Rule and Language Model-based Methods

None, None; None, None; None, None

Completing Partial Reaction Equations with Rule and Language Model-based Methods

Journal Article (2024)

Author(s)

Matthijs van Wijngaarden (Student TU Delft)

G. Vogel (TU Delft - Pattern Recognition and Bioinformatics)

J.M. Weber (TU Delft - Pattern Recognition and Bioinformatics)

Research Group

Pattern Recognition and Bioinformatics

DOI related publication

https://doi.org/10.1016/B978-0-443-28824-1.50524-X

Chemical reaction completion Language models Molecular transformer Reaction SMILES Rule-based methods

To reference this document use:

https://resolver.tudelft.nl/uuid:5b70ada9-60f6-4a06-bc60-e4b7a50375a6

More Info

expand_more

Publication Year

2024

Language

English

Research Group

Pattern Recognition and Bioinformatics

Volume number

53

Pages (from-to)

3139-3144

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large chemical reaction data sets often suffer from incompleteness, such as missing molecules or stoichiometric information. Incomplete chemical reaction equations currently hinder us to perform automated mass balances across large sets of chemical reactions. In this work, we integrate two approaches for computational completion of partial reaction equations. Specifically, we combine a rule-based method and a machine learning model, a tailored version of the pre-trained Molecular Transformer, to complete reactions. The rule-based method takes sets of helper species into a linear solver and therewith balances some incomplete reactions. The machine learning model is trained to take partial reactions as inputs and predicts missing molecules and stoichiometries. We apply our methodology to the USPTO STEREO chemical reaction data set. The rule-based method completes about 50 % of the reactions. The language model shows a top 1 accuracy of 88.3 % on our test set and high validity (> 99 % of outputs are valid SMILES).

Files

1-s2.0-B978044328824150524X-ma... (pdf)

(pdf | 0.42 Mb)

- Embargo expired in 31-12-2024

License info not available