Synthetic Data for Robust Language Modeling

Doctoral Thesis (2026)
Author(s)

P. Lippmann (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

G.J.P.M. Houben – Promotor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

J. Yang – Copromotor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Research Group
Web Information Systems
DOI related publication
https://doi.org/10.4233/uuid:bea358f8-ff6f-43be-a065-a6e1a0b3bc5b Final published version
More Info
expand_more
Publication Year
2026
Language
English
Defense Date
01-06-2026
Awarding Institution
Delft University of Technology
Research Group
Web Information Systems
ISBN (electronic)
978-94-6518-333-6
Downloads counter
51
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

How do we ensure large language models are genuinely robust, rather than just performing well on benchmarks? This work investigates the critical vulnerabilities of modern LLMs—from their tendency to mimic reasoning styles without logical substance, to their susceptibility to high-confidence blind spots. By introducing targeted synthetic data generation, agent-guided knowledge injection, and value-sensitive escalation policies, this thesis offers a holistic approach to AI reliability. It provides actionable frameworks to localize brittleness, correct unknown unknowns, and navigate uncertain, high-stakes deployments with auditable, human-aligned decision-making.

Files

Main_print_copy.pdf
(pdf | 18.7 Mb)
License info not available