Data, Representation, Models and Analysis: the four horsemen of machine learning for homogeneous catalysis

Master Thesis (2025)
Author(s)

T.H. Chow (TU Delft - Applied Sciences)

Contributor(s)

Evgeny A. Pidko – Mentor (TU Delft - ChemE/Inorganic Systems Engineering)

A.V. Kalikadien – Mentor (TU Delft - ChemE/Inorganic Systems Engineering)

Faculty
Applied Sciences
Research Group
ChemE/Inorganic Systems Engineering
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
25-02-2025
Awarding Institution
Delft University of Technology
Programme
['Chemical Engineering']
Faculty
Applied Sciences
Research Group
ChemE/Inorganic Systems Engineering
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Bidentate ligand-coordinated transition metal complexes are often used as homogeneous catalysts, as they have the ability to produce enantioselective compounds. These compounds are of high interest in the pharmaceutical and food industries. However, identifying high performing catalysts relies on trial-and-error approaches, which is time-consuming and costly. The use of data-driven predictive models could improve this process significantly by shifting most of the work from experimental work to computational work. Previous work from the group has attempted to develop such a predictive model using Machine Learning (ML), a representation of a manually generated static structure, and a database generated through High-Throughput Experimentation (HTE). However, these models faced challenges in terms of model performance and consistency between different substrates. This research aims to enhance these models by improving the representations used in ML to achieve more accurate predictions. To bring the representations closer to reality, both dynamic and new static approaches are tested, using conformer ensembles (CEs) generated by CREST. These structures were then used in DFT calculations to obtain accurate properties of these complexes. Additionally, new HTE data, which is closer to the complexes used in the simulation, was incorporated to improve training data for the ML models. The investigated reaction is the hydrogenation of norbornadiene (NBD) using Rh-NBD complexes. The performance of both classification and regression was compared across different representations: a cheap topological connectivity fingerprint (ECFP), semi-empirical DFT representations, and expensive fully DFT-optimized representations. The results conclude that none of the DFT-based representations outperforms the cheap topological fingerprint for this specific reaction. The study also highlights the importance of high-quality data in training the models. Ultimately, while the representation was improved, the much simpler topological method was the most effective for prediction of catalyst performance.

Files

MEP_Tai_Hong_Chow_final.pdf
(pdf | 10.2 Mb)
License info not available