Gaining scientific understanding with small data machine learning
explainable molecule representations and their consensus
Can Özkan (TU Delft - Mechanical Engineering)
Lisa Sahlmann (Helmholtz-Zentrum Hereon)
Tim Würger (Helmholtz-Zentrum Hereon)
Christian Feiler (Helmholtz-Zentrum Hereon)
Sviatlana Lamaka (Helmholtz-Zentrum Hereon)
Mikhail Zheludkevich (Helmholtz-Zentrum Hereon)
Peyman Taheri (TU Delft - Mechanical Engineering)
Arjan Mol (TU Delft - Mechanical Engineering)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Despite the remarkable success of machine learning in materials science, challenges persist in gaining mechanistic insights, especially in low-data regimes where dataset sizes limit the precise applicability of machine learning. The prevailing reliance on high-confidence predictions from the models often leaves the underlying decision-making mechanisms opaque, limiting scientific understanding. This study presents an alternative approach that emphasizes understanding the model decision-making process over individual predictions, enabling the extraction of scientifically meaningful insights from small datasets. Focusing on 107 small organic molecules and their corrosion inhibition properties as a case study, we systematically evaluate 29 molecular featurization methods and 9 target representations, generating over 12 thousand model configurations to identify robust feature-target pairings. We reveal common trends by reverse engineering the best-performing models based on featurization methods of physicochemical descriptors, hashed fingerprints, and structural keys, which we integrate with domain knowledge to create a molecular substructure template for candidate molecules. Using this template, we filter a toxicity database to identify non-toxic corrosion inhibitors, aiming to replace the de facto but hazardous corrosion inhibitor hexavalent chromium. The resulting candidate’s efficacy is validated through electrochemical testing, illustrating the feasibility of achieving mechanistic insights from statistical models in data-scarce environments.