Transformer-Based Synthetic Relational Data
Closing the Gap Between Diffusion-Based and Transformer-Based Synthetic Relational Data Generation
G.W.K. Paardekooper (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Lydia Y. Chen – Mentor (TU Delft - Data-Intensive Systems)
Jeroen Galjaard – Mentor (TU Delft - Data-Intensive Systems)
Christoph Lofi – Graduation committee member (TU Delft - Web Information Systems)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Data sharing for research and industrial applications faces significant challenges due to privacy constraints and regulatory requirements, driving the need for high-quality synthetic alternatives.
Recent advances in synthetic data generation have demonstrated considerable success for single-table datasets, with emerging research extending these capabilities to multi-table relational scenarios.
While transformer and diffusion architectures achieve state-of-the-art performance in single-table generation, a notable performance gap emerges when applied to relational data, where diffusion approaches consistently outperform transformer-based methods.
This thesis examines the factors contributing to this performance difference, conducting an evaluation using multiple baselines across both single and relational tabular datasets, with REaLTabformer and ClavaDDPM as state-of-the-art transformer- and diffusion-based approaches, respectively.
Our investigation reveals that the performance can mainly be attributed to the inadequate processing of contextual relationships and suboptimal strategies for representing inter-table dependencies in transformer-based models.
To close this gap, we introduce two changes for transformer-based models: layer sharing to enhance parameter utilization and contextual encoding to better preserve the relational structure.
These changes provide insight into the key design principles behind effective synthetic relational data generation using transformer-based models, particularly the need for architectures that account for context and facilitate practical knowledge transfer.
The proposed methods result in substantial performance improvements, with a 1.52-fold improvement in Logistic Detection and a 1.94-fold reduction in the Discriminator Measure metric.