Synthetic Data for Robust Language Modeling
P. Lippmann (TU Delft - Electrical Engineering, Mathematics and Computer Science)
G.J.P.M. Houben – Promotor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
J. Yang – Copromotor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
How do we ensure large language models are genuinely robust, rather than just performing well on benchmarks? This work investigates the critical vulnerabilities of modern LLMs—from their tendency to mimic reasoning styles without logical substance, to their susceptibility to high-confidence blind spots. By introducing targeted synthetic data generation, agent-guided knowledge injection, and value-sensitive escalation policies, this thesis offers a holistic approach to AI reliability. It provides actionable frameworks to localize brittleness, correct unknown unknowns, and navigate uncertain, high-stakes deployments with auditable, human-aligned decision-making.