Circular Image

J. Yang

info

Please Note

77 records found

Quantitative Evaluation of Retrieval-Augmented Generation using the NHG-Guidelines

Bachelor thesis (2026) - A. Tageldin, J. Yang, Yannick ter Heerdt, P.K. Murukannaiah
Large Language Models (LLMs) integrated with Retrieval-Augmented Generation (RAG) offer promising clinical decision support capabilities. However, utilizing closed-source models for this purpose requires transmitting sensitive patient data to external servers, creating severe GDPR compliance and privacy risks. Conversely, open-source models can be securely hosted locally, but their clinical reasoning capabilities in non-English medical contexts remain unproven. This research quantitatively benchmarks the performance of three closed-source and three open-source LLMs operating over the Dutch NHG-guidelines. Using a standardized RAG pipeline and an automated "LLM-as-a-Judge" evaluation framework (RAGChecker), we analyze the exact trade-offs between clinical accuracy, computational cost, and inference speed. The results reveal a significant paradigm shift: top-tier open-source models (specifically DeepSeek V4 Pro) not only match but outperform closed-source models (GPT-5.5) in clinical accuracy and speed at a fraction of the cost, offering a highly viable, privacy-preserving alternative for Dutch healthcare institutions. ...

Finding the most common failure modes of RAG systems with finetuning approaches

This study introduces a systematic, metric-driven failure taxonomy to identify and quantify errors across the document chunking, retrieval, and generation stages in retrieval-augmented generation (RAG) systems. We evaluate this framework on a benchmark derived from the Nederlandse Huisartsen Genootschap (NHG) protocols, comparing factual and clinical query settings. Our results show a substantial reduction in error-free performance when moving from factual tasks (137 error-free queries) to clinical scenarios (75 error-free queries). We further observe a shift in dominant failure modes: generation-level fabrications are most common in factual queries (14%), whereas clinical queries are dominated by missed retrievals (31%). Co-occurrence analysis reveals a strong association between retrieval failures and downstream generation errors, suggesting cascading effects across the pipeline. These findings highlight retrieval quality as the main bottleneck in clinical settings and motivate domain-specific retriever fine-tuning for safer deployment in Dutch primary care. ...
Bachelor thesis (2026) - A.H.C. Straathof, J. Yang, Yannick ter Heerdt, P.K. Murukannaiah
This thesis investigates the use of Large Language Models (LLMs) to automatically generate and evaluate synthetic clinical question-answer benchmarks based on Dutch NHG guidelines. The goal is to build a reliable and reproducible Key Feature Question (KFQ) dataset for testing clinical reasoning. In the first phase, different prompting strategies were tested using gpt-4o-mini across a subset of guideline text. The results show that while baseline model extraction is highly stable, a hybrid few shot chain of thought strategy performs best, achieving the highest optimization score and strong factual grounding. With this prompting a strategy a final benchmark dataset of 375 fully traceable Dutch QA pairs was constructed.
In the second phase, the feasibility of automating the benchmark evaluation was tested by comparing out-of-the-box frameworks RAGAS and RAGChecker directly against the grading of a licensed general practitioner. A significant judgement gap was found between the automated tools and human expert judgment. RAGAS systematically overestimated safety because it relies on literal word overlap, making it completely miss dangerous clinical errors like recommending a treatment that was explicitly stated to be failing. RAGChecker heavily penalized safe clinical paraphrasing and conditional reasoning due to its rigid token-level claim parsing. Ultimately, this work provides a functional pipeline for creating Dutch medical benchmarks, but highlights that standard automated evaluation toolkits require custom, domain-specific calibration before they can reliably replace human expert judgment. ...
Bachelor thesis (2026) - C.K. Bakker, Yannick ter Heerdt, J. Yang, P.K. Murukannaiah
Evaluating Retrieval-Augmented Generation systems in clinical domains requires
reliable benchmarks, yet constructing these manually is costly and infeasible at a large scale. This paper presents an automated pipeline for constructing and evaluating a factual question answering benchmark over Dutch primary care guidelines. The pipeline uses large language model based question-answer generation with few-shot and chain-of-thought prompting, combined with automated filtering using BERTScore grounding and round-trip consistency to produce high quality question-answer pairs. Human validation confirmed that the final benchmark of 192 question-answer pairs across 10 Nederlands Huisartsen Genootschap guidelines achieves factual correctness, retraceability and clinical relevance. The benchmark was integrated into a Retrieval-Augmented Generation pipeline to evaluate whether RAGChecker, a claim-level automated evaluation framework, could serve as a reliable alternative to human evaluation. RAGChecker
scores were consistent with human judgment though lower due to its strict claim-level checking. These results show that a reliable, automated benchmark can be constructed for Dutch primary care question answering and that RAGChecker serves as a reasonable but strict alternative for human evaluation of Retrieval-Augmented Generation systems in this domain. ...
Bachelor thesis (2026) - L. Bindt, Yannick ter Heerdt, J. Yang, P.K. Murukannaiah
As general practitioners currently experience high workloads, Large Language Models (LLMs) offer a promising opportunity to relieve some of this work by enabling faster searching of medical guidelines, saving doctors time and allowing them to deliver better care. This research aimed to answer the primary research question: How can a Retrieval-Augmented Generation (RAG) pipeline be constructed for Dutch NHG guidelines? By breaking this problem down into four distinct sub-questions focused on data processing, retrieval optimization, storage scalability, and model grounding, the research successfully demonstrates a complete, factually correct, and scalable system for general practitioners in The Netherlands. Specifically, the findings show that context-aware data splitting with minimized block sizes optimally preserves clinical cohesion while keeping costs low. For retrieval optimization, combining a traditional BM25 keyword search with an AI meaning-based vector search via Reciprocal Rank Fusion captures edge-case guidelines more effectively than either method alone. Storage scalability is achieved by pairing a Hierarchical Navigable Small World graph with memory-mapped storage, allowing the system to offload data to the disk while maintaining high throughput and low latency. Finally, the application of prompt instructions successfully enforces grounded refusal, preventing the AI from falling back on internal training data when valid clinical context is missing. ...

Steering LLM Toxicity Along User-Specified Directions

Bachelor thesis (2026) - M. Coroi, J. Yang, A. Arzberger, E. Liscio, C.E. Brandt
Toxic content is not universally defined: what one user finds offensive, another may find acceptable depending on cultural background, context, and purpose. Current LLM safety systems apply a single global toxicity threshold to every user, and adapting this behaviour after deployment is expensive. This paper asks whether a frozen LLM can instead be steered at inference time to follow individual users’ toxicity preferences across six toxicity dimensions, without retraining. A classifier-guided decoding framework driven by a per-user sensitivity vector is instantiated as three deployable strategies and evaluated on the PRISM preference dataset. All three strategies reduce per-user toxicity error by 15–21%, while preserving general-knowledge accuracy to within 0.7 pp of the unguided baseline. The central finding is directional steerability: the decoder responds to the shape of a user’s preference vector, producing category-specific reductions that align with per-user weights (median cosine similarity 0.845, p = 0.0097 above a permutation baseline). These results show that meaningful personalised toxicity control is achievable at deployment time, without retraining the model. ...

Personalized Safety Alignment via Logit Steering

Large Language Models are usually aligned toward broad preference averages, while users can differ in how they perceive toxic language. This paper studies whether training free in-decoding logit-difference can support such personalized toxicity alignment without changing model weights. The key idea is to use two internal generation behaviours: an expert generation branch that represents careful, respectful language and an anti-expert generation branch that represents language patterns to avoid. The resulting difference is added to the base model’s next-token scores during generation, with the toxicity steering category chosen from an inferred user sensitivity profile. Profiles are derived from PRISM, a participatory preference dataset, and Perspective API toxicity scores. On Llama 3.1 8B, I evaluate two methods, Anti-Expert Contrastive Decoding (ACD) and Expert–Anti-Expert Differential Steering (EADS). The results suggest that EADS gives the more balanced trade-off, showing that stronger steering reduces measured toxicity distance while preserving general MMLU utility better than ACD. EADS shows a 12.65% mean reduction in measured toxicity-distance, and a below 1% reduction in both Massive Multitask Language Understanding (MMLU) accuracy and generated answer perplexity. The findings remain limited by the use of automatic toxicity scores as a proxy and by the coarse user-profile representation. These results show that training-free logit-steering is a favorable alternative for personalized toxicity alignment, but it should be, in the future, validated using human evaluation. ...
Bachelor thesis (2026) - I. Slanina, A. Arzberger, E. Liscio, J. Yang
People differ in what they consider toxic, yet centralised alignment of large language models (LLMs) imposes a single global standard that cannot accommodate this disagreement. We propose a training-free post-decoding approach: for each prompt we generate N candidates from a fixed, pre-trained LLM and re-rank them against a perparticipant toxicity profile built from PRISM ratings. Post-decoding fits the problem because it decouples generation from scoring, so the same candidate pool can be re-ranked under different profiles to separate the effect of the profile from the effect of the candidate pool, something earlier inference-time interventions cannot do. We compare four scoring modules on four matched seeds: two LLMas-a-Judge rerankers (GPT, Claude) and two Detoxify-based geometric matchers (weighted L1, Ledoit–Wolf Mahalanobis), scored by toxicity-vector distance to each participant’s preferred PRISM response. All four reduce per-record error by 23–28% and tie at the top. The selection is genuinely personalised rather than the same generic shift toward safer text for every user: reductions concentrate on each participant’s most sensitive Perspective dimensions, the
toxicity types they most consistently rated down (p < 10−3 under a profile-shuffle null on every module), and replacing the per-user weighting with uniform weights significantly worsens fit on both geometric matchers (Wilcoxon p < 10−3). Because the effect is peruser, it surfaces on a per-user-sensitive measure (a boundary-violation rate, p < 10−3) rather than on aggregate mean error, which averages the per-user differences away. The next step is therefore per-usersensitive evaluation, not retraining. ...

Comparing URIAL and PBPO-Lite on PRISM User Prompts Without Fine-Tuning

Bachelor thesis (2026) - A. Florea, A. Arzberger, E. Liscio, J. Yang, C.E. Brandt
Large Language Models (LLMs) often rely on one general safety standard, but this is limited because toxicity is subjective: what one user finds offensive, another user may not. At the same time, creating personalized safety by fine-tuning a model for every user is expensive and impractical. To address this, my research studies pre-decoding interventions, which means modifying the user’s input prompt before the model generates a response. This offers a flexible and low-cost way to personalize alignment without changing the model’s weights. I evaluate two training-free approaches on the PRISM dataset using Qwen and Llama target models: an Untuned LLMs with Restyled In-context ALignment (URIAL)-inspired method, which adds personalized safety examples to the prompt, and a Personalized Black-Box Prompt Optimization Lite (PBPO-Lite) method, which uses a secondary model to rewrite the prompt based on a user’s toxicity profile. These methods are useful because they can adapt to a user’s needs at inference time without permanent model changes. The results show that both interventions bring the outputs closer to the highest rated PRISM answers, with URIAL achieving the strongest toxicity alignment: approximately 51% on Llama and 31% on Qwen. While the methods improve fluency compared with the base models, they can reduce performance on structured knowledge tasks. Overall, the findings suggest that personalized predecoding is a promising low-cost approach for toxicity alignment, provided that safety gains are balanced against possible losses in knowledge-task performance. ...

A Symptom–Sign Framework and Integrated Toolkit for Practitioners

Master thesis (2026) - J.S. Beekman, J. Yang, L. Corti, K.W. Song
Large language models (LLMs) are increasingly deployed in consequential settings, yet their failures remain challenging to understand. Unlike traditional software bugs, such undesirable behavior emerges from distributed, context-dependent interactions that resist straightforward debugging. While XAI methods can surface signals about individual predictions, they do not directly support the hypothesis-driven investigative process that characterizes diagnosis in practice: forming expectations, gathering evidence, and identifying recurring failure patterns.

This thesis addresses this gap by introducing (1) a diagnostic framework that structures diagnosis through symptoms (observed undesirable outputs), signs (evidence from interpretability methods), and failure patterns (recurring, explainable combinations of symptoms and signs), complemented by a Should-Know/Really-Know lens that distinguishes task expectation from actual model knowledge; and (2) a prototype diagnostic toolkit that operationalizes this framework through integrated evaluation, run comparison, and state externalization.

An evaluation study with eight practitioners using codebook thematic analysis reveals three core findings. First, practitioners universally adopt a baseline-first strategy, building diagnostic confidence through initial evaluation before deeper probing. Second, they triangulate across samples, metrics, and interpretability outputs rather than relying on single signals, using comparison as a central sense-making operation for hypothesis testing. Third, diagnostic depth is systematically gated by three factors: interpretation friction (insufficient guidance on what methods reveal and how to act on their outputs), missing workflow glue (the absence of affordances for iterative refinement), and execution constraints (opaque platform limits that disrupt sustained diagnostic progress).

These findings reframe diagnostic tooling as infrastructure for iterative, hypothesis-driven reasoning, extending beyond the provision of isolated analytical methods. Effective diagnostic support must scaffold the full investigative cycle: from expectation formation and baseline calibration through evidence triangulation, hypothesis testing via comparison, and state externalization. This scaffolding must also account for the gating factors that shape the depth of diagnostic progress. This positions diagnosis as a knowledge and workflow challenge, with implications for tooling design, framework development, and empirical research into practitioner diagnostic workflows.
...
Doctoral thesis (2026) - P. Lippmann, G.J.P.M. Houben, J. Yang
How do we ensure large language models are genuinely robust, rather than just performing well on benchmarks? This work investigates the critical vulnerabilities of modern LLMs—from their tendency to mimic reasoning styles without logical substance, to their susceptibility to high-confidence blind spots. By introducing targeted synthetic data generation, agent-guided knowledge injection, and value-sensitive escalation policies, this thesis offers a holistic approach to AI reliability. It provides actionable frameworks to localize brittleness, correct unknown unknowns, and navigate uncertain, high-stakes deployments with auditable, human-aligned decision-making. ...
Master thesis (2025) - A.S. Kuiper, J. Yang, C. Lofi, P.K. Murukannaiah
Applying Large Language Models (LLMs) to high-stakes classification tasks like systematic review screening is challenged by prompt sensitivity and a lack of transparency. We introduce IMAPR (Iterative Multi-signal Adaptive Prompt Refinement), a novel framework where a single LLM uses its own internal signals to iteratively refine its prompts, improving classification robustness and reliability. Unlike black-box optimizers that tune the prompts using only external scores, IMAPR is a white-box approach that diagnoses why a prediction failed using three internal signals: model confidence, a rationale, and a knowledge alignment score that checks whether the evidence cited in the rationale actually covers the user-defined inclusion criteria. We evaluate IMAPR on a real-world biomedical screening task, comparing it against strong baselines including GPO and StraGo. IMAPR outperforms the best baseline (GPO) by 8.8% in Macro-F1 while maintaining high, stable recall across runs. Across seven LLMs, IMAPR yields an average 9.2% improvement in Macro-F1 An ablation shows that knowledge-alignment acts as a recall safeguard: removing it leaves Macro-F1 similar but degrades recall, reducing reliability for screening. These results suggest that diagnostic, signal-driven prompt refinement is a practical alternative to black-box optimization for transparent, dependable LLM screening systems. ...

Tracing Hermeneutical Injustice in ADHD Narratives Generated by Large Language Models

Bachelor thesis (2025) - D. Zhang, J. Yang, A. Arzberger, M.L. Tielman
This study investigates how large language models (LLMs) narrate ADHD-related experiences and whether their narrative forms give rise to hermeneutical injustice. Rather than comparing experience itself, this study analyzes how experiences are narrated. Using a hybrid coding strategy based on Reflexive Thematic Analysis, it compares LLM-generated outputs with first-person narratives from ADHD communities. The analysis identifies several recurring misnarration patterns, Truncated Subjectivity, One-Way Definition, Illocutionary Disablement, and Skewed Style Replacement. Each of these patterns constrains the interpretive space for expressing ADHD experience. Sub-themes are developed to further reveal injustice embedded in LLMs. These patterns are linked to both the training data and the optimization process. In addition, the underlying mechanism of LLMs lacks the différance structure that characterizes human narration. ...
Bachelor thesis (2025) - A. Turgut, A. Arzberger, J. Yang, M.L. Tielman
Generative AI can contribute to the misunderstanding or erasure of marginalized groups due to the insufficient nuanced data on their lived experiences. This limits the shared un- derstanding of their perspectives and contributes to a phenomenon called hermeneutical epistemic injustice. This study seeks to reduce this injustice by enabling real-life users from these groups to provide feedback that corrects the behavior of the model. However, victims of hermeneutical injustice struggle with articulating themselves, and current prac- tices lack sufficient support for user expression. Overcoming these challenges, we designed an interface to enable users to give feedback on the accuracy of the model, supported by a data processing workflow to ensure feasibility and scalability. We conducted a user study with 8 individuals with ADHD to evaluate whether the interface facilitates the extraction of accurate data, and found that it enables users to provide more concrete and precise feedback than existing methods, as it includes more guidance and control for the user. ...

An Empirical Study on ADHD-Related Causal Reasoning

Large Language Models are increasingly integrated into everyday applications, but their responses often reflect dominant cultural narratives, which can lead to misrepresentation of marginalized communities. This paper addresses the underexplored issue of hermeneutical epistemic injustice (HEI) in LLM outputs, particularly how these systems fail to accurately represent the lived experiences of people with ADHD when answering causal questions, and whether different prompting techniques can influence and improve the justice reflected in their responses. We introduce a practical framework for measuring HEI based on four proxies: intelligibility, conceptual fit, recognition of structural barriers, and expression style. Through a within-subjects user study with seven adults with ADHD, we evaluated three prompting strategies: Vanilla (baseline), Step-Back, and Human Persona + System 2. Our findings show that Human Persona + System 2 prompting stood out for its empathetic tone, balanced perspectives, and non-judgmental framing, thereby improving fairness across multiple HEI dimensions. Surprisingly, Vanilla prompts performed comparably well overall, while Step-Back responses offered clear practical information and contextual relevance, but were limited by an impassive, matter-of-fact tone. These results suggest that prompt design can meaningfully affect how well LLMs represent marginalized experiences. We conclude that advancing epistemic justice in generative AI requires thoughtful prompt design and may benefit from deeper engagement with affected communities to more accurately and respectfully represent their realities. ...

How are hermeneutical injustices encoded in Reinforcement Learning from Human Feedback (RLHF) in the context of LLMs?

Bachelor thesis (2025) - I. Mockaitytė, A. Arzberger, J. Yang, M.L. Tielman
This study investigates how hermeneutical injustices can become encoded in the Reinforcement Learning from Human Feedback processes used to fine-tune large language models (LLMs). While current research on fairness in LLMs has focused on bias and fairness, there remains a significant gap concerning subtler harms such as hermeneutical injustice. Using adults diagnosed with ADHD as a case study, this research explores how their unique communication and cognitive patterns may be misrepresented or excluded from the RLHF pipeline. The research adopts a qualitative literature review methodology, focusing specifically on real-world RLHF implementations by AI companies. The RLHF pipeline was divided into stages of human feedback collection, reward modeling, and policy optimization. Then, these stages of the RLHF were analyzed through the lens of hermeneutical injustice using interpretive desiderata: representation, flexibility, and authenticity. The findings highlight several conceptual risks. Limited annotator diversity and restrictive feedback formats may exclude neurodivergent voices. Reward models can unintentionally suppress atypical expressions, while policy optimization strategies, especially those prone to mode collapse, can erase some communication styles. Overall, the study shows that without deliberate attention to epistemic inclusion, RLHF processes may perpetuate hermeneutical injustices and undermining the epistemic fairness of LLMs. ...

Reinforcing Hermeneutical Justice in Annotation Design for ADHD Voices

Bachelor thesis (2025) - A. Yotkov, J. Yang, A. Arzberger, M.L. Tielman
The main way large language models (LLMs) learn to represent and interpret various experiences is through the process of supervised fine-tuning (SFT). However, current practices are not designed to be inclusive for people with ADHD, which leads to generative hermeneutical ignorance due to misrepresentation. Several ADHD characteristics clash with modern annotation task structures, so those voices remain underrepresented. We performed a literature-driven gap analysis, derived five design requirements and evaluation criteria and built an annotation interface that embodied those requirements. Consequently, a mixed approach user study with seven self-identified ADHD participants was conducted to measure behavioral metrics and collect post-task reflections. The results indicated that three of five design criteria were met, which is promising. However, the average mislabeling rate remained quite high, meaning that accuracy is still an open issue. Finally, our study demonstrated that small design adjustments accommodate a more diverse annotator pool, thus, we offer a framework that can be used to reinforce hermeneutical epistemic justice in annotation practices. ...
The use of research assistants has increased significantly, providing support and automation for researchers. However, there is limited research on researchers using research assistants and what assistance researchers require for each research stage.
We interview researchers to gather insights into their opinions and usage, and afterwards develop a prototype. Researchers highlight that the most difficult part of research is the experiment design, which is reflected in the lack of literature.
Participants use research assistants for reading literature and writing, but request more support. The research assistant must be factual, transparent, and correct, and including a conversation allows for feedback and discussion.
We evaluate the prototype for the experiment design phase, highlighting the effectiveness of the component architecture by generating correct experiments. ...
This thesis addresses the semantic gap in visual understanding, improving visual models with semantic reasoning capabilities so they can handle tasks like image captioning, question-answering, and scene understanding. The main focus is on integrating visual and textual data, leveraging human cognitive insights, and developing a robust multi-modal foundation model. The research starts with the exploration of multi-modal data integration to enhance semantic and contextual reasoning in fine-grained scene recognition. The proposed multi-modal models, which combine visual and textual inputs, outperform traditional models that rely solely on visuals. This is particularly true in complex urban environments where visual ambiguities often occur. This method emphasizes the significance of semantic enrichment through multi-modal integration, which helps resolve visual ambiguities and improve scene understanding. ...
Large language models (LLMs) are widely used tools that assist us by answering various questions. Humans implicitly use contrast as a natural way to think about and seek explanations (i.e., "Why A and not B?"). Explainability is a challenging aspect of LLMs, as we do not truly understand how good the LLM answers are. The challenge is understanding to what extent LLMs can generate effective contrastive self-explanations for users. We introduce the Contrastive Self-Explanation Method (CoSEM) to narrow the gap between LLMs and explainability. It generates contrastive self- explanations and evaluates them through automation and a user study on generality, usefulness, readability, and relevance. Our results indicate that LLMs are capable of generating effective contrastive self-explanations. Lexical analysis of contrastive explanation indicates that explanations are not less general than the text those explain, and semantic analysis shows that more complex models generalize self-explanations more consistently. Although it is challenging to evaluate contrast in self-explanations semantically, user study shows that some models (Llama3-8B) help understand the contrast. Moreover, task selection affects how readable users find the explanations, where more self-explanations on general topics (movie reviews) are more readable than more specific topics (medical diagnoses). Lastly, some models, such as Llama3-8B, excel at generating contrastive self-explanations that contain relevant information regarding input text. ...