T-REST: A watermark for autoregressive tabular large language models

None, None

T-REST: A watermark for autoregressive tabular large language models

Bachelor Thesis (2024)

Author(s)

Minh Hieu Nguyen Hoang Minh (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Y. Chen – Mentor (TU Delft - Data-Intensive Systems)

Jeroen Galjaard – Mentor (TU Delft - Data-Intensive Systems)

C. Zhu – Mentor (TU Delft - Data-Intensive Systems)

R. Hai – Graduation committee member (TU Delft - Web Information Systems)

Faculty

Electrical Engineering, Mathematics and Computer Science

Tabular data Autoregressive Large language models Watermark

To reference this document use:

https://resolver.tudelft.nl/uuid:566481e7-5dda-43c6-94f3-eab8a31eed31

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

28-06-2024

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Tabular data is one of the most common forms of data in the industry and science. Recent research on synthetic data generation employs auto-regressive generative large language models (LLMs) to create highly realistic tabular data samples. With the increasing use of LLMs, there is a need to govern the data generated by these models, for instance, by watermarking the model output. While the state-of-the-art Soft Red List watermarking framework has shown impressive results on standard language models, it can not be seamlessly applied to models fine-tuned for generating tabular data due to i) column permutation and ii) the task’s nature of generating low entropy sequences. We propose Tabular Red GrEen LiST (T-REST), an adaptation of the Soft Red List watermarking algorithm on tabular LLMs that is agnostic to column permutation and improves detection efficiency by employing a weighted count method that favors columns with higher entropy. Our experiments on 4 real-world datasets demonstrate that T-REST introduces a nonsignificant drop of 3% in the synthetic data quality compared to the non-watermarked data, using the resemblance and downstream machine learning efficiency metrics, while achieving high detection accuracy with AUROC of over 0.98. T-REST is insusceptible to any column or row permutation and is robust against post-editing attacks on categorical columns by maintaining a True Positive Rate (TPR) of over 0.85 when 50% of categorical values are modified.

Files

CSE3000_Minh_Nguyen_5591937_6_... (pdf)

(pdf | 1.5 Mb)

License info not available