H.X. Lin | TU Delft Repository

Data-driven tools for assessing ecosystem health

Doctoral thesis (2026) - A. Spinosa, A.W. Heemink, H.X. Lin, G.Y.H. El Serafy

This thesis investigates how in situ and satellite remote sensing data, combined via statistical and data-driven approaches, can be used to monitor coastal and terrestrial ecosystems in a scalable, cost-efficient, and scientifically robust way. The main objective of this work was to develop tools supporting the assessment and understanding of ecosystem health by exploiting the growing availability of Earth observation data. This thesis work revolves around two main facets: (i) the development of cost-efficient spatially scalable tools and (ii) the investigation of data integration of different data sources.
The research builds on satellite remote sensing data from the Copernicus mission (Sentinel-1 and Sentinel-2 data), complemented by in situ measurements from other open source repositories (such as Integrated Carbon Observation Systems (ICOS) and the European Fluxes Database Cluster) and additional remotely sensed data. All the models and algorithms used or developed during the research are published and available as open source.
The thesis starts by demonstrating the potential of satellite data as a complementary alternative to traditional in situ measurements. This was done by constructing a modeling framework for the retrieval of the shoreline position from Sentinel-1 data. The model is based on the Otsu method, a global thresholding method optimal for the recognition of the water/land interface. The resulting shorelines were validated against video monitoring systems-derived shorelines, showing sub-pixel accuracy. The results highlighted that satellite data may represent a cost-effective and low-maintenance complementary alternative to in situ measurements, especially in areas lacking dense ground-based instrumentation.... ...

Modeling Urban Automated Mobility on-Demand Systems: an Agent-Based Approach

Doctoral thesis (2023) - S. Wang, H.X. Lin, G. Homem de Almeida Correia

Automated Mobility-on-Demand (AMoD) systems are expected to revolutionize urban mobility systems. However, there are uncertainties in the planning and operations of AMoD systems. We deem the agent-based approach as being well suited for modeling new phenomena in future AMoD systems and therefore shed some light on the uncertainties about the operation and the impacts of such systems. Recommendations to various stakeholders are provided through the different contributions. ...

Loss functions and neural networks

Comparing different loss functions for NLP neural networks

Bachelor thesis (2022) - J. Kirchner, H.X. Lin, P.R. van Nieuwenhuizen

Neural network is an active research field which involves many different (unsolved) issues, for example, different types of configuration of the network architectures, training strategies, etc. Amongst these active issues, the choice of loss (or cost) functions plays an important role in how a neural network model is to be optimized (trained) and how the model will perform after the training. Given the choice of measurement criteria, loss functions measure how far an estimated output is from its true value. And the measurement criteria can change depending on the task in hand and the goal to be met. The objective of this project is to understand the role of different loss functions and to evaluate the dependence of the performance on the loss functions using the language prediction problem. ...

Using Artificial Intelligence for Aerosol Data Assimilation

Master thesis (2022) - G. van Hemert, H.X. Lin, M.B. van Hoven, O. Hasekamp, A. Tsikerdekis

To study the aerosols in the atmosphere is an important aspect for getting a better understanding of climate change. Therefore, it is important to get accurate observations of aerosols in the atmosphere as well as accurate emission fluxes of aerosol species. Satellite instruments such as SPEXone are able to measure aerosol properties with a high accuracy. Unfortunately, the instrument has a low daily global coverage. To obtain full daily global coverage, methods such as data assimilation are used. However, these methods have a high computational cost. This report investigates the use of neural networks to obtain global daily coverage of aerosol properties and emission fluxes with a lower computational cost. Two networks are trained. One to get global coverages of the aerosol properties Aerosol Optical Depth at 550nm (AOD), Single Scattering Albedo at 550nm (SSA) and ̊Angstr ̈om Exponent between 550nm and 865nm(AE). The other network is trained for global emission fields of the species dimethylsulfide (DMS), sulfur dioxide (SO2), black carbon (BC), organic carbon (OC), sea salt (SS) and dust (DU). The results from these trained networks are compared to the results of a control experiment, which represents our prior knowledge on the aerosol fields and emissions, although not the truth. It is found that the network for aerosol properties has a significant decrease in errors compared to the control experiment. For both AOD and AE, the network has a large improvement, and for SSA the improvement is slightly smaller, likely due to a lower performance of the control experiment compared to AOD and AE. The network for the emissions also has a noticeable improvement over the control experiment for all species except DMS, where there is only a small improvement due to the already accurate DMS value for the control experiment. It is also found that the network for emissions overfits due to too little variation in training and testing data.
...

To study the aerosols in the atmosphere is an important aspect for getting a better understanding of climate change. Therefore, it is important to get accurate observations of aerosols in the atmosphere as well as accurate emission fluxes of aerosol species. Satellite instruments such as SPEXone are able to measure aerosol properties with a high accuracy. Unfortunately, the instrument has a low daily global coverage. To obtain full daily global coverage, methods such as data assimilation are used. However, these methods have a high computational cost. This report investigates the use of neural networks to obtain global daily coverage of aerosol properties and emission fluxes with a lower computational cost. Two networks are trained. One to get global coverages of the aerosol properties Aerosol Optical Depth at 550nm (AOD), Single Scattering Albedo at 550nm (SSA) and ̊Angstr ̈om Exponent between 550nm and 865nm(AE). The other network is trained for global emission fields of the species dimethylsulfide (DMS), sulfur dioxide (SO2), black carbon (BC), organic carbon (OC), sea salt (SS) and dust (DU). The results from these trained networks are compared to the results of a control experiment, which represents our prior knowledge on the aerosol fields and emissions, although not the truth. It is found that the network for aerosol properties has a significant decrease in errors compared to the control experiment. For both AOD and AE, the network has a large improvement, and for SSA the improvement is slightly smaller, likely due to a lower performance of the control experiment compared to AOD and AE. The network for the emissions also has a noticeable improvement over the control experiment for all species except DMS, where there is only a small improvement due to the already accurate DMS value for the control experiment. It is also found that the network for emissions overfits due to too little variation in training and testing data.

Advecting Superspecies

Reduced order modeling of organic aerosols in LOTOS-EUROS using machine learning

Master thesis (2021) - P.O. Sturm, H.X. Lin, C. Vuik, A. M. M. Manders-Groot, A. J. Segers

Chemical transport models (CTMs) are used to improve our understanding of the complex processes influencing atmospheric composition, as well as provide operational air quality forecasts and model potential future air quality scenarios. Numerical tracers in CTMs track the concentration of chemical species, while operators simulate various physical processes such as advection. One such CTM, LOTOS-EUROS, uses a volatility basis set (VBS) approach to represent the formation of organic aerosol (OA) in the atmosphere, which contributes to the concentration of total particulate matter. The added dimensionality of the VBS tracers in LOTOS-EUROS slowed down computation of the advection operator by a factor of two, limiting their representation in operational forecasts.
To keep the detailed process representation of OA formation, while reducing the computational costs, we develop an unsupervised machine learning method to compress the VBS tracers to a set of superspecies for use in advection, and subsequently decompress superspecies back to the tracer space for OA-relevant calculations. The focus of this machine learning method is physical interpretability, allowing for operators to resolve equations using the superspecies. This method conserves mass to machine precision and retains important information like phase (gas or aerosol) on compression. This data-driven approach reduces the dimensionality of the system more than a second proposed approach based on partitioning theory. The ML superspecies approach was integrated into LOTOS-EUROS for online calculations, showing numerical stability over a model simulation time of two weeks under various conditions. With the superspecies, the computation time for advection is reduced by 56% to 66% of the time for advection of the VBS tracers. The results of this approach show potential for use in accelerating air quality operational forecasts, as well as pathways forward for integration of ML box models of atmospheric chemistry into CTMs. ...

Chemical transport models (CTMs) are used to improve our understanding of the complex processes influencing atmospheric composition, as well as provide operational air quality forecasts and model potential future air quality scenarios. Numerical tracers in CTMs track the concentration of chemical species, while operators simulate various physical processes such as advection. One such CTM, LOTOS-EUROS, uses a volatility basis set (VBS) approach to represent the formation of organic aerosol (OA) in the atmosphere, which contributes to the concentration of total particulate matter. The added dimensionality of the VBS tracers in LOTOS-EUROS slowed down computation of the advection operator by a factor of two, limiting their representation in operational forecasts.
To keep the detailed process representation of OA formation, while reducing the computational costs, we develop an unsupervised machine learning method to compress the VBS tracers to a set of superspecies for use in advection, and subsequently decompress superspecies back to the tracer space for OA-relevant calculations. The focus of this machine learning method is physical interpretability, allowing for operators to resolve equations using the superspecies. This method conserves mass to machine precision and retains important information like phase (gas or aerosol) on compression. This data-driven approach reduces the dimensionality of the system more than a second proposed approach based on partitioning theory. The ML superspecies approach was integrated into LOTOS-EUROS for online calculations, showing numerical stability over a model simulation time of two weeks under various conditions. With the superspecies, the computation time for advection is reduced by 56% to 66% of the time for advection of the VBS tracers. The results of this approach show potential for use in accelerating air quality operational forecasts, as well as pathways forward for integration of ML box models of atmospheric chemistry into CTMs.

Numerical Methods for Large Thermo-Mechanical Systems

Master thesis (2021) - E.I. Maquelin, C. Vuik, Victor Dolk, H.X. Lin

Numerical methods are investigated for solving large-scale sparse linear systems of equations, that can be applied to thermo-mechanical models and wafer-slip models. This thesis examines efficient numerical methods, in terms of memory, number of iterations required for convergence, and computation time. To be more specific, algebraic multigrid (AMG) methods and deflation methods are considered as preconditioners for the conjugate gradient method. We investigate if smoothed aggregation AMG or adaptive smoothing and prolongation based AMG improve upon the classical Ruge-Stüben AMG. It is shown that Ruge-Stüben AMG needs fewer iterations for the test problems. However, smoothed aggregation AMG has a smaller data-size, which is of interest for situations with limited memory or large systems of equations. Moreover, the mechanical problems considered have a coefficient matrix with a block structure, which can be exploited by preconditioners like block Jacobi or the incomplete block Cholesky decomposition; but also the smoothed aggregation AMG can take the block structure into account when creating coarser grids. Further, we examine if the results of the conjugate gradient method can be improved by adding a deflation preconditioner based on the proper orthogonal decomposition or rigid body modes. They are combined with a direct or stationary iterative preconditioner, resulting in two-level preconditioned conjugate gradient methods. The various implementations of such methods are discussed, and the deflation preconditioner is shown to generally reduce the number of iterations compared to the single preconditioner. ...

GPU Implementation of Grid Search based Feature Selection

Using Machine Learning to Predict Hydrocarbons using High Dimensional Datasets

Master thesis (2020) - Tjalling Ament, Hai Xiang Lin, Chris te Stroet

To optimize the exploitation of oil and gas reservoirs both on- and offshore, Biodentfiy has developed a method to predict prospectivity of hydrocarbons before drilling. This method uses microbiological DNA analysis of shallow soil or seabed samples to detect vertical upward microseepage from hydrocarbon accumulations, which change the composition of microbes at the surface.
Microbiological DNA analysis of shallow soil or seabed samples results in a high-dimensional dataset, which is interpreted using machine learning. Using the machine learning method Elastic Net, features (microbes) are selected from an existing DNA database to classify new shallow soil or seabed samples. Multiple models, each with a different combination of externally set parameters (called hyperparameters), are trained to improve accuracy, essentially creating a grid of models. The aim of this thesis is to investigate if it is possible to accelerate feature selection on high-dimensional datasets by implementing a parallel design on a GPU to train this grid of models, and to investigate the performance of this GPU implementation. Inspired by an implementation called Shotgun, which is able to improve performance by exploiting parallelism across features when training a single model on a CPU, an implementation, named GPU Shotgun (GPU-SG) was devised, which could exploit parallelism across samples, features, and multiple models in the grid (of combinations of hyperparameters). Depending on the size of the grid and the hardware, using GPU-SG, a speedup of between 0.2 and 5.26 can be reached for sparse datasets (a datasets with lots of 0 values) when compared to standard CPU implementations. When considering dense datasets (a dataset with few 0 values), using GPU-SG, a speedup of between 0.5 and 10 can be achieved. The amount of memory available to store a dataset is lower for GPU's than for a CPU, and currently the design is limited by this, because the design does not allow a dataset that is larger than the memory available. GPU-SG can be used to design improved implementations, which reduce the time when the GPU or CPU is idle to improve performance. ...

To optimize the exploitation of oil and gas reservoirs both on- and offshore, Biodentfiy has developed a method to predict prospectivity of hydrocarbons before drilling. This method uses microbiological DNA analysis of shallow soil or seabed samples to detect vertical upward microseepage from hydrocarbon accumulations, which change the composition of microbes at the surface.
Microbiological DNA analysis of shallow soil or seabed samples results in a high-dimensional dataset, which is interpreted using machine learning. Using the machine learning method Elastic Net, features (microbes) are selected from an existing DNA database to classify new shallow soil or seabed samples. Multiple models, each with a different combination of externally set parameters (called hyperparameters), are trained to improve accuracy, essentially creating a grid of models. The aim of this thesis is to investigate if it is possible to accelerate feature selection on high-dimensional datasets by implementing a parallel design on a GPU to train this grid of models, and to investigate the performance of this GPU implementation. Inspired by an implementation called Shotgun, which is able to improve performance by exploiting parallelism across features when training a single model on a CPU, an implementation, named GPU Shotgun (GPU-SG) was devised, which could exploit parallelism across samples, features, and multiple models in the grid (of combinations of hyperparameters). Depending on the size of the grid and the hardware, using GPU-SG, a speedup of between 0.2 and 5.26 can be reached for sparse datasets (a datasets with lots of 0 values) when compared to standard CPU implementations. When considering dense datasets (a dataset with few 0 values), using GPU-SG, a speedup of between 0.5 and 10 can be achieved. The amount of memory available to store a dataset is lower for GPU's than for a CPU, and currently the design is limited by this, because the design does not allow a dataset that is larger than the memory available. GPU-SG can be used to design improved implementations, which reduce the time when the GPU or CPU is idle to improve performance.

Deep Learning Techniques for Low-Field MRI

Master thesis (2020) - Dilan Geçmen, Martin van Gijzen, Merel de Leeuw den Bouter, Rob Remis, Hai Xiang Lin

Delft University of Technology (TU Delft), Leiden University Medical Center (LUMC), Pennsylvania State University (PSU) and Mbarara University of Science and Technology (MUST) have an ongoing collaboration to create an affordable, portable and simplified version of the magnetic resonance imaging (MRI) scan for the CURE children’s hospital to diagnose children with hydrocephalus (water on the brain). As opposed to the conventional MRI scan, the low-field MRI prototype uses permanent magnets to create a magnetic field in the order of Milliteslas (mT). A downside of the low-field MRI application is the difficulty with spatial encoding due to small variations in the strength of magnetic field. This is a major problem for image reconstruction. The purpose of this research was to implement a deep learning (DL) network to overcome two of the major bottlenecks in image reconstruction for low-field MRI. These are the lack of real measured data for DL purposes, and the signal model associated with the low-field MRI. For DL purposes we generated synthetic data and acquired measured data. Each dataset consists of samples and each sample consist of an image and the corresponding signal. Due to technical limitations the measured dataset is small, 53 samples. To partially circumvent the problem, the data set was augmented to a total of 1908 samples. In addition, we used Transfer learning, which is a powerful method that applies knowledge gained from one problem to a different but related problem. We present three image reconstruction techniques, Model I, II, and III, based on convolutional and feedforward neural networks, which take MR signal data as input and directly and quickly outputs an image. We demonstrated that DL generates high quality images using synthetic data. In addition, we showed that Model III needs less training to reconstructs good quality images compared to Models I and III, respectively. Finally, Models I and III were unsuccessfully applied to real measured data. However, this study shows that neural networks are able to find a mapping between signal and image, therefore this idea can be extended to work on real measured data. ...

Delft University of Technology (TU Delft), Leiden University Medical Center (LUMC), Pennsylvania State University (PSU) and Mbarara University of Science and Technology (MUST) have an ongoing collaboration to create an affordable, portable and simplified version of the magnetic resonance imaging (MRI) scan for the CURE children’s hospital to diagnose children with hydrocephalus (water on the brain). As opposed to the conventional MRI scan, the low-field MRI prototype uses permanent magnets to create a magnetic field in the order of Milliteslas (mT). A downside of the low-field MRI application is the difficulty with spatial encoding due to small variations in the strength of magnetic field. This is a major problem for image reconstruction. The purpose of this research was to implement a deep learning (DL) network to overcome two of the major bottlenecks in image reconstruction for low-field MRI. These are the lack of real measured data for DL purposes, and the signal model associated with the low-field MRI. For DL purposes we generated synthetic data and acquired measured data. Each dataset consists of samples and each sample consist of an image and the corresponding signal. Due to technical limitations the measured dataset is small, 53 samples. To partially circumvent the problem, the data set was augmented to a total of 1908 samples. In addition, we used Transfer learning, which is a powerful method that applies knowledge gained from one problem to a different but related problem. We present three image reconstruction techniques, Model I, II, and III, based on convolutional and feedforward neural networks, which take MR signal data as input and directly and quickly outputs an image. We demonstrated that DL generates high quality images using synthetic data. In addition, we showed that Model III needs less training to reconstructs good quality images compared to Models I and III, respectively. Finally, Models I and III were unsuccessfully applied to real measured data. However, this study shows that neural networks are able to find a mapping between signal and image, therefore this idea can be extended to work on real measured data.

Deepfake Detection Using Convolutional Neural Networks

Working Towards Understanding the Effects of Design Choices

Master thesis (2020) - Bodine van Leeuwen, H.X. Lin, A.W. Heemink, M.B. van Gijzen, S.Q. Dijkhuis

When building a convolutional neural network, many design choices have to be made. In the case of Deepfake detection, there is no readily implementable recipe that guides these choices. This research aims to work towards understanding the effects of design choices in the case of Deepfake detection, using the Python library Keras and publicly available datasets. The choices analysed are dataset composition, preprocessing, dropout rate, batch size, network architecture, and specific dataset. We also analyse the difference between training a network on original images and DCT-residuals of images. Lastly, we analyse the networks' generalisation and robustness capabilities. The goal of these experiments is to work towards a readily implementable recipe for Deepfake detection algorithms. Furthermore, this research provides an overview of image manipulation algorithms, an overview of recent research into convolutional networks, and an extensive overview of the Deepfake detection research field. To analyse dataset composition, we used different subsets of FaceForensics++, with different numbers of frames per video. We trained a shallow network, containing only four convolutional layers, on all three datasets. The dataset with one frame per video was the only one that did not result in immediate overfitting, although it contained less than two thousand frames in total. We continued with the small dataset and tested different preprocessing settings and dropout rates on our shallow network. We found that preprocessing and dropout were not able to increase the maximum achievable accuracy, although they were able to curtail overfitting. It is possible that accuracy did not increase due to the high variety of artefacts in FaceForensics++. Batch size also does not have any effect on the maximum achievable accuracy. However, the runtime required for training a network increases considerably as the batch size decreases. We tested DenseNet-121, Inception-v3, ResNet-152, VGG16, VGG19, Xception, and our shallow network on three different datasets: FaceForensics++, Celeb-df, and DeeperForensics-1.0. The goal of this experiment was to find what type of network is most suited for detecting Deepfakes with publicly available datasets of small size but high variation. We also wanted to see if there were differences in performance achieved on the different datasets. Although all networks except the shallow one were pretrained on ImageNet, three of the different networks tested immediately overfitted on all three datasets used: DenseNet-121, Inception-v3, and XceptionNet. The other networks encountered most difficulties with Celeb-df, on which none of the networks managed to reach 70% accuracy before overfitting. The easiest network to train on was DeeperForensics-1.0, on which our shallow network achieved 92.5% accuracy. However, when testing the networks' robustness, none of the networks trained on DeeperForensics-1.0 reached an accuracy higher than random on Celeb-df or FaceForensics-1.0. Shallow networks might be better suited for Deepfake detection on our small dataset. The shallow model and VGG16 achieved the highest accuracies. VGG19's performance is close to VGG16's. However, ResNet-152 is the deepest network used in this research and performs better than the shallower DenseNet, Inception-v3, and Xception. Lastly, training our networks on DCT-residuals was supposed to help our network focus on statistical image content rather than semantical image content. However, performance on DCT-residuals was at best similar to performance on original images. Our suggestions for continuation of this research are to experiment with different input sizes and other types of residual filters, collect larger (and higher quality) datasets with high variety, and to use interframe detection. ...

When building a convolutional neural network, many design choices have to be made. In the case of Deepfake detection, there is no readily implementable recipe that guides these choices. This research aims to work towards understanding the effects of design choices in the case of Deepfake detection, using the Python library Keras and publicly available datasets. The choices analysed are dataset composition, preprocessing, dropout rate, batch size, network architecture, and specific dataset. We also analyse the difference between training a network on original images and DCT-residuals of images. Lastly, we analyse the networks' generalisation and robustness capabilities. The goal of these experiments is to work towards a readily implementable recipe for Deepfake detection algorithms. Furthermore, this research provides an overview of image manipulation algorithms, an overview of recent research into convolutional networks, and an extensive overview of the Deepfake detection research field. To analyse dataset composition, we used different subsets of FaceForensics++, with different numbers of frames per video. We trained a shallow network, containing only four convolutional layers, on all three datasets. The dataset with one frame per video was the only one that did not result in immediate overfitting, although it contained less than two thousand frames in total. We continued with the small dataset and tested different preprocessing settings and dropout rates on our shallow network. We found that preprocessing and dropout were not able to increase the maximum achievable accuracy, although they were able to curtail overfitting. It is possible that accuracy did not increase due to the high variety of artefacts in FaceForensics++. Batch size also does not have any effect on the maximum achievable accuracy. However, the runtime required for training a network increases considerably as the batch size decreases. We tested DenseNet-121, Inception-v3, ResNet-152, VGG16, VGG19, Xception, and our shallow network on three different datasets: FaceForensics++, Celeb-df, and DeeperForensics-1.0. The goal of this experiment was to find what type of network is most suited for detecting Deepfakes with publicly available datasets of small size but high variation. We also wanted to see if there were differences in performance achieved on the different datasets. Although all networks except the shallow one were pretrained on ImageNet, three of the different networks tested immediately overfitted on all three datasets used: DenseNet-121, Inception-v3, and XceptionNet. The other networks encountered most difficulties with Celeb-df, on which none of the networks managed to reach 70% accuracy before overfitting. The easiest network to train on was DeeperForensics-1.0, on which our shallow network achieved 92.5% accuracy. However, when testing the networks' robustness, none of the networks trained on DeeperForensics-1.0 reached an accuracy higher than random on Celeb-df or FaceForensics-1.0. Shallow networks might be better suited for Deepfake detection on our small dataset. The shallow model and VGG16 achieved the highest accuracies. VGG19's performance is close to VGG16's. However, ResNet-152 is the deepest network used in this research and performs better than the shallower DenseNet, Inception-v3, and Xception. Lastly, training our networks on DCT-residuals was supposed to help our network focus on statistical image content rather than semantical image content. However, performance on DCT-residuals was at best similar to performance on original images. Our suggestions for continuation of this research are to experiment with different input sizes and other types of residual filters, collect larger (and higher quality) datasets with high variety, and to use interframe detection.

Predicting the air quality by combining model simulations with machine learning

Master thesis (2020) - Rick Hegeman, Hai Xiang Lin, Arnold Heemink, Martin van Gijzen

Combating air pollution has proven to be a difficult task for countries with rapidly developing economies. Poor air quality can be hazardous to people doing any outdoor activities. So being able to make accurate, short term air quality predictions can be very useful. However, making these predictions has proven to be quite difficult, since there are a lot of different physical and chemical processes involved in the emission and transport of the various aerosols that contribute to air pollution. So instead of the more traditional Chemical Transport Models (CTMs) we will be using neural networks in order to make predictions of one of these aerosols, PM2.5. In particular, we will be using a Long Short Term Memory (LSTM) network. In addition, we will include the simulations results from a CTM, LOTOS-EUROS, as input data to the LSTM network to improve the performance of the neural network. One of the main drawbacks of the LSTM approach is that whenever the PM2.5 concentration changes a lot, the predictions made by the LSTM network take some time to change as well, causing a visible time delay when looking at the measurements and predictions in the same time series plot. We will also try a simpler type of neural network, a Feedforward Neural Network (FNN) and compare its performance to that of LSTM. We found that using the simulation data does indeed improve the LSTM network. Not only in terms of the loss function used by the neural network and, but in particular in the amount gross overestimations by the network, which we use to quantify the LSTM time delay problem. We also found that FNN outperforms the LSTM approach, in particular on samples of high PM2.5 concentrations, which we argue is primarily caused by a low amount of samples in our dataset. ...

Machine Learning Based Error Modeling for Surrogate Model in Oil Reservoir Problem

Master thesis (2019) - Jie Huang, Hai Xiang Lin, Arnold Heemink, Juan Juan Cai

This thesis focuses on the construction and optimization of a prediction model for the errors resulting from a model order reduction (MOR) procedure in oil reservoir simulation. MOR is a numerical technique that projects the physical based model, which is also called the high-fidelity model (HFM), into a lower dimension by using matrix decomposition, such that the computational speed can be greatly increased. The reduced order model (ROM) is also known as surrogate model. Obviously, error occurs during the projection process. We want to estimate this error and predict it through building an error model, and to fortify the surrogate model by adapting a parameter estimation. In this thesis, three statistical methods will be adapted to our problem, including least absolute shrinkage and selection operator (LASSO) and two machine learning (ML) methods: long short term memory (LSTM) and fully-connected recurrent neural network (RNN). The training data is the error of the ROM, which is defined as the difference between the ROM values and HFM values. Efforts have also been made to improve the performance of the error model, including the pre-processing of the data, and several model optimization techniques. The model order reduction method here is a non-intrusive subdomain POD-RBF algorithm, which treats subsurface oil-water flow data by adapting domain decomposition (DD), radial basis function (RBF) and proper orthogonal decomposition (POD). The high-fidelity model is generated by Matlab reservoir simulation toolbox (MRST). The error is defined as the difference between the HFM data and the ROM data. Through the comparison of several statistical models, this error can be best predicted by an optimized traditional recurrent neural network. ...

Dust storm emission inversion using data assimilation

Doctoral thesis (2019) - Jianbing Jin, Hai Xiang Lin, Arnold Heemink

Severe dust storms present great threats to the environment, property and human health over the areas in the downwind of arid regions. Several dynamical dust models have been developed to predict the dust concentrations in the atmosphere. Currently, the accuracy of these models is limited mainly due to the imperfect modeling of dust emissions. Along with the progress in the dust and aerosol modeling, the advances in sensor technologies have made large-scale aerosol measurements feasible. The rich measurements provide opportunities to estimate uncertain emission fields, and subsequently, to improve the forecast skill. Such process of emission optimization conditioned on measurements is usually referred as emission inversion. Here, the termof emission inversion specially represents the way of deriving estimates from observations through the use of an atmospheric chemical transport model and a data assimilationmethod. ...

Deep Learning Architectures for PM2.5 and Visibility Predictions

Master thesis (2018) - Yu Xie, Hai Xiang Lin, Jianbing Jin

Facing the severe air pollution phenomenon in urban areas and the subsequent low visibility event in airports, it is urgent to conduct air quality and visibility predictions to better reflect their changing trends. However, the variations of PM2.5 and visibility involve complicated physical and chemical processes, which make their accurate predictions challenging.
In this thesis, methodologies to predict PM2.5, PM10, and visibility using Long Short-Term Memory Neural Networks (LSTM NN) were investigated. The first step of the proposed methodology was dataset analysis and preprocessing, which is an important step in almost all machine learning problems. Because missing data and confusing or incorrect data are common in large datasets, noise and errors were corrected and missing rates were calculated at first. Afterward, datasets were visualized to evaluate the missing phenomenon of different features. Due to the explored strong spatiotemporal correlations, for air quality features with high missing rates, linear interpolations were implemented when the missing granularity is small and k-Nearest Neighbor (kNN) imputations were used when the missing interval is large.
Furthermore, the PM2.5 or PM10 prediction is usually considered as a regression task and aimed at minimizing the mean squared error (MSE) between the predicted values and measured ones. However, due to the high variability and explored ‘class-imbalance’ phenomenon of visibility data, that is, most of the data we have are related to 'normal' situations and extreme conditions are rare events, its predictions can be better dealt with as a classification problem. Because the most interesting cases to be predicted are those rare extreme events, the target was adapted to minimize the weighted cross-entropy.
The second step of the proposed methodology was to configure the frameworks. For PM2.5 predictions, feature engineering was employed to the select appropriate features and some model hyperparameters were set through grid searches and coordinate descent. A coarse-to-fine sampling scheme was used to determine the weights in the loss function of visibility predictions.
The third step of our research was performance evaluation. For PM2.5 predictions, the proposed spatiotemporal LSTM framework can overcome the systematic underestimation that Lotos-Euros (a chemical transport models (CTMs) based system) generally produces by analyzing their scatter plots and confusion matrices. Additionally, it performs better than an LSTM-based prediction framework (Fan J et al. (2017) [9]) that also considers spatial correlations among stations and performs a similar task in a similar region when comparing their rooted mean square errors (RMSE) and mean absolute errors (MAE). Differences between the hyperparameters of these two frameworks were analyzed.
As for PM10 predictions, the training efficiency can be improved significantly by transferring knowledge from PM2.5 predictions to PM10 predictions through model fine-tuning. Compared with Lotos-Euros, the LSTM framework also has competitive performance in PM10 predictions. As the first attempt at applying LSTM NN to predict visibility, forecasts are acceptable in practice. The total accuracy rate reaches 90.61%. The recall rate of the normal situation (L1) is 93% while its precision rate is 96%, indicating its superior prediction performance in the normal situations. Besides, for each visibility level, the number of correct predictions is larger than that of negative predictions.
...

Facing the severe air pollution phenomenon in urban areas and the subsequent low visibility event in airports, it is urgent to conduct air quality and visibility predictions to better reflect their changing trends. However, the variations of PM2.5 and visibility involve complicated physical and chemical processes, which make their accurate predictions challenging.
In this thesis, methodologies to predict PM2.5, PM10, and visibility using Long Short-Term Memory Neural Networks (LSTM NN) were investigated. The first step of the proposed methodology was dataset analysis and preprocessing, which is an important step in almost all machine learning problems. Because missing data and confusing or incorrect data are common in large datasets, noise and errors were corrected and missing rates were calculated at first. Afterward, datasets were visualized to evaluate the missing phenomenon of different features. Due to the explored strong spatiotemporal correlations, for air quality features with high missing rates, linear interpolations were implemented when the missing granularity is small and k-Nearest Neighbor (kNN) imputations were used when the missing interval is large.
Furthermore, the PM2.5 or PM10 prediction is usually considered as a regression task and aimed at minimizing the mean squared error (MSE) between the predicted values and measured ones. However, due to the high variability and explored ‘class-imbalance’ phenomenon of visibility data, that is, most of the data we have are related to 'normal' situations and extreme conditions are rare events, its predictions can be better dealt with as a classification problem. Because the most interesting cases to be predicted are those rare extreme events, the target was adapted to minimize the weighted cross-entropy.
The second step of the proposed methodology was to configure the frameworks. For PM2.5 predictions, feature engineering was employed to the select appropriate features and some model hyperparameters were set through grid searches and coordinate descent. A coarse-to-fine sampling scheme was used to determine the weights in the loss function of visibility predictions.
The third step of our research was performance evaluation. For PM2.5 predictions, the proposed spatiotemporal LSTM framework can overcome the systematic underestimation that Lotos-Euros (a chemical transport models (CTMs) based system) generally produces by analyzing their scatter plots and confusion matrices. Additionally, it performs better than an LSTM-based prediction framework (Fan J et al. (2017) [9]) that also considers spatial correlations among stations and performs a similar task in a similar region when comparing their rooted mean square errors (RMSE) and mean absolute errors (MAE). Differences between the hyperparameters of these two frameworks were analyzed.
As for PM10 predictions, the training efficiency can be improved significantly by transferring knowledge from PM2.5 predictions to PM10 predictions through model fine-tuning. Compared with Lotos-Euros, the LSTM framework also has competitive performance in PM10 predictions. As the first attempt at applying LSTM NN to predict visibility, forecasts are acceptable in practice. The total accuracy rate reaches 90.61%. The recall rate of the normal situation (L1) is 93% while its precision rate is 96%, indicating its superior prediction performance in the normal situations. Besides, for each visibility level, the number of correct predictions is larger than that of negative predictions.

PM2.5 concentration prediction and early warning system of extreme conditions based on the LSTM

Master thesis (2018) - Siyu Guan, Hai Xiang Lin, Juanjuan Cai

This thesis project developed an alternative PM2.5 concentration prediction model and early warning system of extreme air pollution based on the long short-term memory (LSTM) and achieved satisfying performance. To research more deeply, we divided the task into two parts. The first task was predicting the PM2.5 concentration of next 24 hours and another one was building early warning system of extreme air pollution of next 12 hours.
To solve the first task, we started from the 1-hour prediction problem, that was predicting PM2.5 of next hour based on the last hours’ data. We did parameter optimization to derive the best network architecture and we got a RMSE of 19.7863. We then successfully built 24-hour prediction model that was predicting PM2.5 concentration of next 24 hours according to the optimal 1-hour prediction model. The proposed 24-hour prediction model exhibited satisfactory performance, including the 13-24 h prediction task which is predicting the mean PM2.5 concentration among next 13-24 hours (RMSE=49.41).
Although we got a satisfying RMSE for the PM2.5 prediction problem, we didn’t get accurate prediction for extreme conditions and that’s why we continued to focus on the second task. We regarded the highest PM2.5 value among 12 hours as the extreme air pollution of this period and we divided the warning level into 4 parts. Then we built the early warning system based on the LSTM to predict the warning level of highest PM2.5 value of next 12 hours. As indicated by the ACC and AUC, our LSTM model achieved sound performance (ACC=86.7%, AUC=0.837).
To improve the prediction performance, we focused on several model optimization techniques for the 1-hour prediction model and each technique has effectively improved the accuracy. Moreover, we combined these optimization methods together, which leaded to the lowest RMSE of 14.1937. The combined optimization method performed better than any single optimization method, which suggested that we can use some effective optimization methods together to improve the prediction accuracy of LSTM model. In addition, we also compared our model with the random forest (RF) model and the comparison result proved that LSTM network worked better for both tasks.
...

This thesis project developed an alternative PM2.5 concentration prediction model and early warning system of extreme air pollution based on the long short-term memory (LSTM) and achieved satisfying performance. To research more deeply, we divided the task into two parts. The first task was predicting the PM2.5 concentration of next 24 hours and another one was building early warning system of extreme air pollution of next 12 hours.
To solve the first task, we started from the 1-hour prediction problem, that was predicting PM2.5 of next hour based on the last hours’ data. We did parameter optimization to derive the best network architecture and we got a RMSE of 19.7863. We then successfully built 24-hour prediction model that was predicting PM2.5 concentration of next 24 hours according to the optimal 1-hour prediction model. The proposed 24-hour prediction model exhibited satisfactory performance, including the 13-24 h prediction task which is predicting the mean PM2.5 concentration among next 13-24 hours (RMSE=49.41).
Although we got a satisfying RMSE for the PM2.5 prediction problem, we didn’t get accurate prediction for extreme conditions and that’s why we continued to focus on the second task. We regarded the highest PM2.5 value among 12 hours as the extreme air pollution of this period and we divided the warning level into 4 parts. Then we built the early warning system based on the LSTM to predict the warning level of highest PM2.5 value of next 12 hours. As indicated by the ACC and AUC, our LSTM model achieved sound performance (ACC=86.7%, AUC=0.837).
To improve the prediction performance, we focused on several model optimization techniques for the 1-hour prediction model and each technique has effectively improved the accuracy. Moreover, we combined these optimization methods together, which leaded to the lowest RMSE of 14.1937. The combined optimization method performed better than any single optimization method, which suggested that we can use some effective optimization methods together to improve the prediction accuracy of LSTM model. In addition, we also compared our model with the random forest (RF) model and the comparison result proved that LSTM network worked better for both tasks.

High Performance Data Traversal

Cache Aware Computing With Space Filling Curve

Master thesis (2017) - Sagar Dolas, Kees Vuik, Matthias Möller, Hai Xiang Lin, V Galavi

"What Mathematics is to Physics, Data traversal is to High-performance computing." The world of Computational science has witnessed an exponential expansion of sophisticated numerical algorithms in the last few decades mainly to understand minute details and solve complex physical problems. It has established itself as the third pillar of science after theory and experimentation and has managed to gain immense popularity as a mainstream research work among academicians and scientists working in entirely different fields. The Computational Sciences has brought together Mathematicians and Computer Scientists to work in close collaboration on the variety of interdisciplinary research problems. The principal challenge to achieve high performance for computational researchers in near about every front is data traversal, data placement, and memory access pattern which mostly influences floating point performance and energy efficiency. The Data traversal is the soul of high-performance computing. Indeed it is the backbone; the way data travels to the CPU from main memory mainly influences the performance of particular computational kernel on specific machine architecture. The majority of modern computing devices are designed to deliver high performance if data traversal can utilize maximum bandwidth to main memory (DRAM) and make efficient use of hierarchical memory structure. Thus, a hardware optimal data access pattern should be designed to take advantage of the underlying hardware to scale and achieve performance, and that forms the central theme of this work. The more important point here is, expensive hardware or massive computational infrastructure does not naturally invoke high-performance computing but implementation of hardware auxiliary mathematical ideas, cache efficient data traversal strategies, sensible use of parallel programming paradigms and energy aware management of computational resources on machines ranging from very grass-root level primary NUMA system to entire million core server stack does. In this master's thesis, we will first focus on investigating the impact of data traversal patterns on the performance of several micro-benchmarks on Non-Uniform Memory Access machine, and in the second part, we will implement Morton-order Space Filling Curve to improve cache utilization for two numerical methods and analyze performance impact. ...

"What Mathematics is to Physics, Data traversal is to High-performance computing." The world of Computational science has witnessed an exponential expansion of sophisticated numerical algorithms in the last few decades mainly to understand minute details and solve complex physical problems. It has established itself as the third pillar of science after theory and experimentation and has managed to gain immense popularity as a mainstream research work among academicians and scientists working in entirely different fields. The Computational Sciences has brought together Mathematicians and Computer Scientists to work in close collaboration on the variety of interdisciplinary research problems. The principal challenge to achieve high performance for computational researchers in near about every front is data traversal, data placement, and memory access pattern which mostly influences floating point performance and energy efficiency. The Data traversal is the soul of high-performance computing. Indeed it is the backbone; the way data travels to the CPU from main memory mainly influences the performance of particular computational kernel on specific machine architecture. The majority of modern computing devices are designed to deliver high performance if data traversal can utilize maximum bandwidth to main memory (DRAM) and make efficient use of hierarchical memory structure. Thus, a hardware optimal data access pattern should be designed to take advantage of the underlying hardware to scale and achieve performance, and that forms the central theme of this work. The more important point here is, expensive hardware or massive computational infrastructure does not naturally invoke high-performance computing but implementation of hardware auxiliary mathematical ideas, cache efficient data traversal strategies, sensible use of parallel programming paradigms and energy aware management of computational resources on machines ranging from very grass-root level primary NUMA system to entire million core server stack does. In this master's thesis, we will first focus on investigating the impact of data traversal patterns on the performance of several micro-benchmarks on Non-Uniform Memory Access machine, and in the second part, we will implement Morton-order Space Filling Curve to improve cache utilization for two numerical methods and analyze performance impact.

Optimaliseren van de service van een taxisysteem met zelfrijdende voertuigen

Bachelor thesis (2017) - Irene Vooijs, Hai Xiang Lin, Jacob van der Woude, Emiel van Elderen, Leo van Iersel

Dit project behandelt de programmering en toepassing van een taxiservice, waarbij de passagiers zich vanaf of naar het treinstation willen verplaatsen. Met behulp van geheeltallig lineair programmeren wordt bepaald welke ritten door de taxi's moeten worden gereden, om de winst te maximaliseren. Dit model is gebaseerd op het model wat beschreven is in het verslag van Xiao Liang et al. uit 2016.
Er worden twee modellen vergeleken: in Model 1 is het systeem vrij om verzoeken te accepteren of te weigeren, terwijl in Model 2 per zone beslist wordt of alle ritten al dan niet worden. De taxiservice wordt eerst toegepast op kleine schaal, waarna enkele aanpassingen gedaan worden om het odel ook op grote schaal te kunnen toepassen. Bij de toepassing op kleine schaal wordt altijd verlies gemaakt, omdat de ratio taxi's per zone erg hoog is. Voor de toepassing op grote schaal blijkt dat het model voor veel taxi's sterk overeenkomt met het model van Liang, maar voor kleinere hoeveelheden taxi's minder omdat het genereren van ritten in Liang's model minder homogeen gedaan wordt. Het optimale aantal taxi's om te gebruiken is altijd 20 of 40. ...

Max-plus algebra en een toepassing in de luchtvaart

Bachelor thesis (2017) - Rolf Rengers, Jacob van der Woude, Hai Xiang Lin, Neil Budko, Emiel van Elderen

Dit bacheloreindverslag gaat over Max-plus Algebra. Dit is een algebraïsche structuur die gebruikt kan worden om roosterplanning te modelleren. In plaats van de normale optelling en vermenigvuldiging worden de operaties 'maximum nemen' en optellen gebruikt. Wanneer Max-plus Algebra wordt gebruikt in matrices, spelen de eigenwaarden en eigenvectoren van deze matrices een rol. Dit alles wordt toegepast in een voorbeeld in de luchtvaart, waar de max-plus algebra wordt gebruikt om een optimale dienstregeling te vinden in een netwerk met twee hubs op verschillende contintenten. ...