Improving Chemical Reaction Completion using Atom-Balance Constraints in Transformer Models
M.T.W. Noordsij (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Jana M. Weber – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
Marcel .J.T. Reinders – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
G. Vogel – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
J. Yang – Graduation committee member (TU Delft - Web Information Systems)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Online databases contain extensive collections of (bio)chemical reactions serving as valuable resources for a variety of applications. However, these large datasets often suffer from incomplete reaction data missing, for example, co-reactants and by-products. Machine learning can help to predict these missing molecules in partial reactions. In this study, we adapt an existing transformer model to enhance its capability in completing these incomplete reactions. We retrain the model using a more diverse dataset of atom-balanced ground truth reactions and introduce both soft and hard atom-balance constraints to improve the completeness and chemical validity of the predictions. Our findings indicate that models trained with soft constraints in their loss function do not demonstrate improved balancing performance and require further tuning. Conversely, the implementation of hard atom-balance constraints during constrained beam search, where we restrict predicting tokens that violate the atom-balance of the prediction, effectively improves the performance of transformer-based models in reaction completion tasks. However, this approach also presents the risk of inaccurately balancing reactions; a limitation that is difficult to identify without chemical expertise, underscoring the necessity for reliable ground truth data to evaluate the predictions.