Novel Statistical Techniques for Complex Data Structures

Doctoral Thesis (2025)
Author(s)

S. Li (TU Delft - Statistics)

Contributor(s)

G. Jongbloed – Promotor (TU Delft - Statistics)

Y. Tian – Promotor (Beijing Institute of Technology)

P. Chen – Copromotor (TU Delft - Statistics)

Research Group
Statistics
More Info
expand_more
Publication Year
2025
Language
English
Research Group
Statistics
ISBN (electronic)
978-94-6518-125-7
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The main concern of this thesis is applying statistical methods to tackle some issues that arise from specific cases where data are hard to obtain, spatiotemporal, or non-normal. The range of such potential applications is huge. These hard-to-solve problems often necessitate innovative strategies, prompting the development of novel statistical models and methods. While the subjects covered in this thesis are diverse, they consistently revolve around the modeling methods in Statistics for complicated data structures, where traditional methods fail to provide viable solutions. Among the topics presented, we have the optimization for a computationally expensive black-box function based on limited data; predicting a spatiotemporal flow at a new input using the data stored as a tensor; the evaluation of a manufacturing process considering non-normal data and asymmetric tolerances; and, in the presence of dependence structures, defining the joint distribution of multivariate Poisson counts.

In this thesis, we separate all topics into two kinds of issues and propose several tailored statistical methods. First, we consider the case where the process of acquiring data is intricate.
Two specific examples are considered, including the optimization for black-box functions and the prediction for spatiotemporal flows. In these two cases, data are both expensive and time-consuming to obtain. Therefore, the key is to achieve specific goals while minimizing collection costs. We design proper inputs to feed and apply the Gaussian process model as a surrogate model. Then, some customized methods are employed. For black-box functions, where no explicit expression is available, the optimal approach is to select inputs that are as close as possible to the true optimum. Thus, a sequential design based on Gaussian process models is conducted.
It will explore the whole feasible region and find a local area close to the true optimum, only based on limited inputs. Then, more points will be chosen over this region to find the optimum.
When predicting flows, the scenario is notably distinct due to the multi-scale nature of the limited data in both spatial and temporal dimensions, which are typically collected and represented as a tensor. Directly applying surrogate models is not feasible because there are plenty of data observations for training, leading to extremely high computational complexity.
To handle such data, a tailored multi-output Gaussian process via a compression method is used to extract spatial and temporal basis functions, along with low-dimensional input-dependent coefficients, respectively. Then, surrogate modelling is implemented based on input-dependent features, allowing for accurate and fast predictions of unobserved flows.

Next, we address the challenge of analysing non-normal data, which includes both continuous and discrete scenarios. The former occurs when assessing the manufacturing processes, and the latter involves analysis of multiple dependent Poisson counts. We handle non-normal distributions based on the following fact: applying the cumulative distribution function (CDF) of a random variable to that random variable yields a uniformly distributed random variable on $[0, 1]$, assuming that this CDF is continuous. When evaluating a manufacturing process, some data on a characteristic of interest are collected first. Then, the process capability indices (PCIs) can be calculated for the quality control. We focus on the case where there are asymmetric tolerances to define PCIs, and the characteristic follows an unknown continuous univariate distribution.
Two tailored PCIs are proposed, with an inverse transformation introduced to handle non-normal data. We estimate the underlying CDF of the data via B-splines, transforming the data to be normal. Then, our PCIs can be used to assess the capability of an in-control production process in manufacturing conforming products based on non-normal data. Furthermore, when dealing with multiple Poisson counts, it is essential for both the marginal and joint model specifications to accurately capture the random behaviour of the variables. Although there has been a growing interest in this type of data across various scientific fields, models that account for multivariate Poisson distributions remain relatively uncommon. To address that, we introduce a novel multivariate Poisson distribution, leveraging multivariate reduction technology and copula methods, which offers unprecedented flexibility in modelling joint distributions and dependence structures. Various probabilistic properties and several estimation techniques are also explored.

In summary, this thesis makes significant contributions to the field by advancing statistical modelling techniques, particularly in addressing complex and challenging data scenarios.

Files

License info not available