L. Schulze Balhorn | TU Delft Repository

Graph-to-SFILES

Control structure prediction from process topologies using generative artificial intelligence

Journal article (2025) - Lukas Schulze Balhorn, Kevin Degens, Artur M. Schweidtmann

Control structure design is an important but tedious step in P&ID development. Generative artificial intelligence (AI) promises to reduce P&ID development time by supporting engineers. Previous research on generative AI in chemical process design mainly represented processes by sequences. However, graphs offer a promising alternative because of their permutation invariance. We propose the Graph-to-SFILES model, a generative AI method to predict control structures from flowsheet topologies. The Graph-to-SFILES model takes the flowsheet topology as a graph input and returns a control-extended flowsheet as a sequence in the SFILES 2.0 notation. We compare four different graph encoder architectures, one of them being a graph neural network (GNN) proposed in this work. The Graph-to-SFILES model achieves a top-5 accuracy of 73.2% when trained on 10,000 flowsheet topologies. In addition, the proposed GNN performs best among the encoder architectures. Compared to a purely sequence-based approach, the Graph-to-SFILES model improves the top-5 accuracy for a relatively small training dataset of 1,000 flowsheets from 0.9% to 28.4%. However, the sequence-based approach performs better on a large-scale dataset of 100,000 flowsheets. These results highlight the potential of graph-based AI models to accelerate P&ID development in small-data regimes but their effectiveness on industry relevant case studies still needs to be investigated. ...

Environmental impacts prediction using graph neural networks on molecular graphs

Journal article (2025) - Qinghe Gao, Lukas Schulze Balhorn, Alessandro Laera, Raoul Meys, Jonas Goßen, Jana M. Weber, Gregor Wernet, Artur M. Schweidtmann

The chemical industry needs to undergo a significant transformation towards more sustainable and circular production systems. To guide this transformation, estimating the environmental impacts of chemical production at early product screening or development stages is highly desirable. This study leverages the molecular structure of the process products with graph neural networks (GNNs) for early-stage environmental impact approximation of chemical processes. Specifically, we use end-to-end GNN models to predict fifteen environmental impact categories, utilizing a CarbonMinds dataset of 51,905 processes producing 791 molecules produced in 91 countries, augmented with country-specific energy mix data. Our analysis begins with a comparison of Quantitative Structure-Property Relationship (QSPR) and GNN models for the climate change impact category. Specifically, we develop three different GNN models: (i) GNN with only molecular structure, (ii) GNN with molecular structure and additional geographical features, and (iii) GNN with molecular structure and additional energy mix features. The results indicate that the three GNN models show an improvement over the QSPR models. Furthermore, benchmarking our GNN models against the existing literature in the climate change impact category reveals that our models perform comparably. We then extend our approach by developing both single- and multi-task GNN models to predict all fifteen impact categories. The findings indicate that multi-task learning can improve model performance in complex environmental impact predictions compared to single-task GNNs. Therefore, we recommend using a multi-task GNN for predicting multiple impact categories, with single-task models applied to fine-tune performance on underperforming categories. Although our proposed approach shows improvements over previous models, the prediction of environmental impacts solely based on molecular information remains a rough approximation. ...

The chemical industry needs to undergo a significant transformation towards more sustainable and circular production systems. To guide this transformation, estimating the environmental impacts of chemical production at early product screening or development stages is highly desirable. This study leverages the molecular structure of the process products with graph neural networks (GNNs) for early-stage environmental impact approximation of chemical processes. Specifically, we use end-to-end GNN models to predict fifteen environmental impact categories, utilizing a CarbonMinds dataset of 51,905 processes producing 791 molecules produced in 91 countries, augmented with country-specific energy mix data. Our analysis begins with a comparison of Quantitative Structure-Property Relationship (QSPR) and GNN models for the climate change impact category. Specifically, we develop three different GNN models: (i) GNN with only molecular structure, (ii) GNN with molecular structure and additional geographical features, and (iii) GNN with molecular structure and additional energy mix features. The results indicate that the three GNN models show an improvement over the QSPR models. Furthermore, benchmarking our GNN models against the existing literature in the climate change impact category reveals that our models perform comparably. We then extend our approach by developing both single- and multi-task GNN models to predict all fifteen impact categories. The findings indicate that multi-task learning can improve model performance in complex environmental impact predictions compared to single-task GNNs. Therefore, we recommend using a multi-task GNN for predicting multiple impact categories, with single-task models applied to fine-tune performance on underperforming categories. Although our proposed approach shows improvements over previous models, the prediction of environmental impacts solely based on molecular information remains a rough approximation.

Empirical assessment of ChatGPT’s answering capabilities in natural science and engineering

Journal article (2024) - Lukas Schulze Balhorn, Jana M. Weber, Stefan Buijsman, Julian R. Hildebrandt, Martina Ziefle, Artur M. Schweidtmann

ChatGPT is a powerful language model from OpenAI that is arguably able to comprehend and generate text. ChatGPT is expected to greatly impact society, research, and education. An essential step to understand ChatGPT’s expected impact is to study its domain-specific answering capabilities. Here, we perform a systematic empirical assessment of its abilities to answer questions across the natural science and engineering domains. We collected 594 questions on natural science and engineering topics from 198 faculty members across five faculties at Delft University of Technology. After collecting the answers from ChatGPT, the participants assessed the quality of the answers using a systematic scheme. Our results show that the answers from ChatGPT are, on average, perceived as “mostly correct”. Two major trends are that the rating of the ChatGPT answers significantly decreases (i) as the educational level of the question increases and (ii) as we evaluate skills beyond scientific knowledge, e.g., critical attitude. ...

Toward autocorrection of chemical process flowsheets using large language models

Conference paper (2024) - Lukas Schulze Balhorn, Marc Caballero, Artur M. Schweidtmann

The process engineering domain widely uses Process Flow Diagrams (PFDs) and Process and Instrumentation Diagrams (P&IDs) to represent process flows and equipment configurations. However, the P&IDs and PFDs, hereafter called flowsheets, can contain errors causing safety hazards, inefficient operation, and unnecessary expenses. Correcting and verifying flowsheets is a tedious, manual process. We propose a novel generative AI methodology for automatically identifying errors in flowsheets and suggesting corrections to the user, i.e., autocorrecting flowsheets. Inspired by the breakthrough of Large Language Models (LLMs) for grammatical autocorrection of human language, we investigate LLMs for the autocorrection of flowsheets. The input to the model is a potentially erroneous flowsheet and the output of the model are suggestions for a corrected flowsheet. We train our autocorrection model on a synthetic dataset in a supervised manner. The model achieves a top-1 accuracy of 80% and a top-5 accuracy of 84% on an independent test dataset of synthetically generated flowsheets. The results suggest that the model can learn to autocorrect the synthetic flowsheets. We envision that flowsheet autocorrection will become a useful tool for chemical engineers. ...

SFILES 2.0

An extended text-based flowsheet representation

Journal article (2023) - Gabriel Vogel, Edwin Hirtreiter, Lukas Schulze Balhorn, Artur M. Schweidtmann

SFILES are a text-based notation for chemical process flowsheets. They were originally proposed by d’Anterroches (Process flow sheet generation & design through a group contribution approach) who was inspired by the text-based SMILES notation for molecules. The text-based format has several advantages compared to flowsheet images regarding the storage format, computational accessibility, and eventually for data analysis and processing. However, the original SFILES version cannot describe essential flowsheet configurations unambiguously, such as the distinction between top and bottom products. Neither is it capable of describing the control structure required for the safe and reliable operation of chemical processes. Also, there is no publicly available software for decoding or encoding chemical process topologies to SFILES. We propose the SFILES 2.0 with a complete description of the extended notation and naming conventions. Additionally, we provide open-source software for the automated conversion between flowsheet graphs and SFILES 2.0 strings. This way, we hope to encourage researchers and engineers to publish their flowsheet topologies as SFILES 2.0 strings. The ultimate goal is to set the standards for creating a FAIR database of chemical process flowsheets, which would be of great value for future data analysis and processing. ...

Digitization of chemical process flow diagrams using deep convolutional neural networks

Journal article (2023) - Maximilian F. Theisen, Kenji Nishizaki Flores, Lukas Schulze Balhorn, Artur M. Schweidtmann

Advances in deep convolutional neural networks led to breakthroughs in many computer vision applications. In chemical engineering, a number of tools have been developed for the digitization of Process and Instrumentation Diagrams. However, there is no framework for the digitization of process flow diagrams (PFDs). PFDs are difficult to digitize because of the large variability in the data, e.g., there are multiple ways to depict unit operations in PFDs. We propose a two-step framework for digitizing PFDs: (i) unit operations are detected using a deep learning powered object detection model, (ii) the connectivities between unit operations are detected using a pixel-based search algorithm. To ensure robustness, we collect and label over 1000 PFDs from diversified sources including various scientific journals and books. To cope with the high intra-class variability in the data, we define 47 distinct classes that account for different drawing styles of unit operations. Our algorithm delivers accurate and robust results on an independent test set. We report promising results for line and unit operation detection with an Average Precision at 50 percent (AP50) of 88% and an Average Precision (AP) of 68% for the detection of unit operations. ...

Learning from flowsheets

A generative transformer model for autocompletion of flowsheets

Journal article (2023) - Gabriel Vogel, Lukas Schulze Balhorn, Artur M. Schweidtmann

We propose a novel method enabling autocompletion of chemical flowsheets. This idea is inspired by the autocompletion of text. We represent flowsheets as strings using the text-based SFILES 2.0 notation and learn the grammatical structure of the SFILES 2.0 language and common patterns in flowsheets using a transformer-based language model. We pre-train our model on synthetically generated flowsheet topologies to learn the flowsheet language grammar. Then, we fine-tune our model in a transfer learning step on real flowsheet topologies. Finally, we use the trained model for causal language modeling to autocomplete flowsheets. Eventually, the proposed method can provide chemical engineers with recommendations during interactive flowsheet synthesis. The results demonstrate a high potential of this approach for future AI-assisted process synthesis but also reveal the limitations at the present state and the next steps that need to be taken to deploy this technique in realistic flowsheet synthesis scenarios. ...

Data augmentation for machine learning of chemical process flowsheets

Book chapter (2023) - Lukas Schulze Balhorn, Edwin Hirtreiter, Lynn Luderer, Artur M. Schweidtmann

Artificial intelligence has great potential for accelerating the design and engineering of chemical processes. Recently, we have shown that transformer-based language models can learn to auto-complete chemical process flowsheets using the SFILES 2.0 string notation. Also, we showed that language translation models can be used to translate Process Flow Diagrams (PFDs) into Process and Instrumentation Diagrams (P&IDs). However, artificial intelligence methods require big data and flowsheet data is currently limited. To mitigate this challenge of limited data, we propose a new data augmentation methodology for flowsheet data that is represented in the SFILES 2.0 notation. We show that the proposed data augmentation improves the performance of artificial intelligence-based process design models. In our case study flowsheet data augmentation improved the prediction uncertainty of the flowsheet autocompletion model by 14.7%. In the future, our flowsheet data augmentation can be used for other machine learning algorithms on chemical process flowsheets that are based on SFILES notation. ...

Toward automatic generation of control structures for process flow diagrams with large language models

Journal article (2023) - Edwin Hirtreiter, Lukas Schulze Balhorn, Artur M. Schweidtmann

Developing Piping and Instrumentation Diagrams (P&IDs) is a crucial step during process development. We propose a data-driven method for the prediction of control structures. Our methodology is inspired by end-to-end transformer-based human language translation models. We cast the control structure prediction as a translation task where Process Flow Diagrams (PFDs) without control structures are translated to PFDs with control structures. We represent the topology of PFDs as strings using the SFILES 2.0 notation. We pretrain our model using generated PFDs to learn the grammatical structure. Thereafter, the model is fine-tuned leveraging transfer learning on real PFDs. The model achieved a top-5 accuracy of 74.8% on 10,000 generated PFDs and 89.2% on 100,000 generated PFDs. These promising results show great potential for AI-assisted process engineering. The tests on a dataset of 312 real PFDs indicate the need for a larger PFD dataset for industry applications and hybrid artificial intelligence solutions. ...

Flowsheet Recognition using Deep Convolutional Neural Networks

Book chapter (2022) - Lukas Schulze Balhorn, Qinghe Gao, Dominik Goldstein, Artur M. Schweidtmanna

Flowsheets are the most important building blocks to define and communicate the structure of chemical processes. Gaining access to large data sets of machine-readable chemical flowsheets could significantly enhance process synthesis through artificial intelligence. A large number of these flowsheets are publicly available in the scientific literature and patents but hidden among innumerable other figures. Therefore, an automatic program is needed to recognize flowsheets. In this paper, we present a deep convolutional neural network (CNN) that can identify flowsheets within images from literature. We use a transfer learning approach to initialize the CNN's parameter. The CNN reaches an accuracy of 97.9% on an independent test set. The presented algorithm can be combined with publication mining algorithms to enable an autonomous flowsheet mining. This will eventually result in big chemical process databases. ...

An additively-manufactured molten salt-to-supercritical carbon di-oxide primary heat exchanger for solar thermal power generation – Design and techno-economic performance

Journal article (2022) - Ines-Noelly Tano, Erfan Rasouli, Tracey Ziev, Ziheng Wu, Nicholas Lamprinakos, Junwon Seo, Lukas Schulze Balhorn, Parth Vaishnav, Anthony Rollett, More authors...

The design and techno-economic performance of a compact additively manufactured (AM) molten salt (MS)-to-supercritical carbon di-oxide (sCO ₂) primary heat exchanger (PHE) for solar thermal application is described. The PHE design consists of sCO ₂ flow through an array of microscale pin fins while the MS flows through mm-scale rectangular channels. Constraints imposed by AM using laser powder bed fusion method are considered in the design. Structural and fluid flow simulations are performed to arrive at a viable design of the core and headers. A simplified one-dimensional steady state model for the PHE is developed including the impact of surface roughness from the AM process. A process-based cost model is used to determine the tradeoff between thermofluidic design and manufacturing cost. A parametric study is performed using the thermo-fluidic and cost models to determine the set of geometrical and flow variables that result in high power density and low cost, while restricting the pressure drop on the sCO ₂ side to less than 2% of line pressure. Flow rates of MS and sCO ₂ were varied over heat capacity rate ratios ranging from 0.2 to 1. Results indicate that it is possible to design a low-pressure drop AM PHE with an effectiveness of 90% and a power density in excess of 10 MW/m ³ (including headers). Fabrication of representative nickel superalloy specimens are shown to demonstrate that low-porosity parts with the requisite dimensional tolerance of PHE core can be generated. ...