On the Gap Between Diffusion and Transformer Multi-Tabular Generation
Gijs Paardekooper (Cross Options)
J.M. Galjaard (TU Delft - Data-Intensive Systems)
Lydia Y. Chen (University of Neuchâtel)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Shareable tabular data is of high importance in industry and research. While generating synthetic records is well-studied, research has only recently extended to relational data synthesis. In the tabular generation setting, diffusion and transformer models exhibit superior performance over prior art. However, in the relational setting, diffusion models outperform transformers. This work focuses on the performance gap between tabular transformers and diffusion models in single (tabular) and multi-table (relational) settings, using REaLTabformer and ClavaDDPM as representative state-of-the-art models. We evaluate these architectures on a set of single- and multi-table datasets, highlighting the gap's root causes between the methods. In our experiments, we attribute this difference to the influence of contextual information and data representation. To bridge the gap in the relational setting, we propose two seemingly simple strategies: layer sharing and contextual cues. This work1 offers insights into key design considerations for single- and multitable generative models, including the incorporation of contextual information and the reuse of existing knowledge. With the proposed methods, we achieve improvements of 1.52× and 1.94× for the Logistic Detection and Discriminator Measure metrics, respectively.