Transformer-Based Synthetic Relational Data

None, None

Transformer-Based Synthetic Relational Data

Closing the Gap Between Diffusion-Based and Transformer-Based Synthetic Relational Data Generation

Master Thesis (2025)

Author(s)

G.W.K. Paardekooper (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Lydia Y. Chen – Mentor (TU Delft - Data-Intensive Systems)

Jeroen Galjaard – Mentor (TU Delft - Data-Intensive Systems)

Christoph Lofi – Graduation committee member (TU Delft - Web Information Systems)

Faculty

Electrical Engineering, Mathematics and Computer Science

Transformers Synthetic data Relational data

To reference this document use:

https://resolver.tudelft.nl/uuid:48c61882-86d7-4d5c-b695-a455968ce8c0

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

11-07-2025

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Data sharing for research and industrial applications faces significant challenges due to privacy constraints and regulatory requirements, driving the need for high-quality synthetic alternatives.
Recent advances in synthetic data generation have demonstrated considerable success for single-table datasets, with emerging research extending these capabilities to multi-table relational scenarios.
While transformer and diffusion architectures achieve state-of-the-art performance in single-table generation, a notable performance gap emerges when applied to relational data, where diffusion approaches consistently outperform transformer-based methods.

This thesis examines the factors contributing to this performance difference, conducting an evaluation using multiple baselines across both single and relational tabular datasets, with REaLTabformer and ClavaDDPM as state-of-the-art transformer- and diffusion-based approaches, respectively.

Our investigation reveals that the performance can mainly be attributed to the inadequate processing of contextual relationships and suboptimal strategies for representing inter-table dependencies in transformer-based models.
To close this gap, we introduce two changes for transformer-based models: layer sharing to enhance parameter utilization and contextual encoding to better preserve the relational structure.
These changes provide insight into the key design principles behind effective synthetic relational data generation using transformer-based models, particularly the need for architectures that account for context and facilitate practical knowledge transfer.
The proposed methods result in substantial performance improvements, with a 1.52-fold improvement in Logistic Detection and a 1.94-fold reduction in the Discriminator Measure metric.

Files

MSc_Thesis_Synthetic_Relationa... (pdf)

(pdf | 7.11 Mb)

License info not available