J. Yang
Please Note
77 records found
1
Benchmarking Open-Source vs. Closed-Source LLMs for Dutch Medical Guidelines
Quantitative Evaluation of Retrieval-Augmented Generation using the NHG-Guidelines
Failure analysis of RAG in healthcare
Finding the most common failure modes of RAG systems with finetuning approaches
In the second phase, the feasibility of automating the benchmark evaluation was tested by comparing out-of-the-box frameworks RAGAS and RAGChecker directly against the grading of a licensed general practitioner. A significant judgement gap was found between the automated tools and human expert judgment. RAGAS systematically overestimated safety because it relies on literal word overlap, making it completely miss dangerous clinical errors like recommending a treatment that was explicitly stated to be failing. RAGChecker heavily penalized safe clinical paraphrasing and conditional reasoning due to its rigid token-level claim parsing. Ultimately, this work provides a functional pipeline for creating Dutch medical benchmarks, but highlights that standard automated evaluation toolkits require custom, domain-specific calibration before they can reliably replace human expert judgment. ...
In the second phase, the feasibility of automating the benchmark evaluation was tested by comparing out-of-the-box frameworks RAGAS and RAGChecker directly against the grading of a licensed general practitioner. A significant judgement gap was found between the automated tools and human expert judgment. RAGAS systematically overestimated safety because it relies on literal word overlap, making it completely miss dangerous clinical errors like recommending a treatment that was explicitly stated to be failing. RAGChecker heavily penalized safe clinical paraphrasing and conditional reasoning due to its rigid token-level claim parsing. Ultimately, this work provides a functional pipeline for creating Dutch medical benchmarks, but highlights that standard automated evaluation toolkits require custom, domain-specific calibration before they can reliably replace human expert judgment.
Automated Benchmark Construction for Factual Question Answering over NHG Guidelines
A Foundation for RAG Evaluation in Dutch Primary Care
reliable benchmarks, yet constructing these manually is costly and infeasible at a large scale. This paper presents an automated pipeline for constructing and evaluating a factual question answering benchmark over Dutch primary care guidelines. The pipeline uses large language model based question-answer generation with few-shot and chain-of-thought prompting, combined with automated filtering using BERTScore grounding and round-trip consistency to produce high quality question-answer pairs. Human validation confirmed that the final benchmark of 192 question-answer pairs across 10 Nederlands Huisartsen Genootschap guidelines achieves factual correctness, retraceability and clinical relevance. The benchmark was integrated into a Retrieval-Augmented Generation pipeline to evaluate whether RAGChecker, a claim-level automated evaluation framework, could serve as a reliable alternative to human evaluation. RAGChecker
scores were consistent with human judgment though lower due to its strict claim-level checking. These results show that a reliable, automated benchmark can be constructed for Dutch primary care question answering and that RAGChecker serves as a reasonable but strict alternative for human evaluation of Retrieval-Augmented Generation systems in this domain. ...
reliable benchmarks, yet constructing these manually is costly and infeasible at a large scale. This paper presents an automated pipeline for constructing and evaluating a factual question answering benchmark over Dutch primary care guidelines. The pipeline uses large language model based question-answer generation with few-shot and chain-of-thought prompting, combined with automated filtering using BERTScore grounding and round-trip consistency to produce high quality question-answer pairs. Human validation confirmed that the final benchmark of 192 question-answer pairs across 10 Nederlands Huisartsen Genootschap guidelines achieves factual correctness, retraceability and clinical relevance. The benchmark was integrated into a Retrieval-Augmented Generation pipeline to evaluate whether RAGChecker, a claim-level automated evaluation framework, could serve as a reliable alternative to human evaluation. RAGChecker
scores were consistent with human judgment though lower due to its strict claim-level checking. These results show that a reliable, automated benchmark can be constructed for Dutch primary care question answering and that RAGChecker serves as a reasonable but strict alternative for human evaluation of Retrieval-Augmented Generation systems in this domain.
Personalised Classifier-Guided Decoding
Steering LLM Toxicity Along User-Specified Directions
STEER-Away
Personalized Safety Alignment via Logit Steering
toxicity types they most consistently rated down (p < 10−3 under a profile-shuffle null on every module), and replacing the per-user weighting with uniform weights significantly worsens fit on both geometric matchers (Wilcoxon p < 10−3). Because the effect is peruser, it surfaces on a per-user-sensitive measure (a boundary-violation rate, p < 10−3) rather than on aggregate mean error, which averages the per-user differences away. The next step is therefore per-usersensitive evaluation, not retraining. ...
toxicity types they most consistently rated down (p < 10−3 under a profile-shuffle null on every module), and replacing the per-user weighting with uniform weights significantly worsens fit on both geometric matchers (Wilcoxon p < 10−3). Because the effect is peruser, it surfaces on a per-user-sensitive measure (a boundary-violation rate, p < 10−3) rather than on aggregate mean error, which averages the per-user differences away. The next step is therefore per-usersensitive evaluation, not retraining.
Personalized Pre-Decoding Alignment for Training-Free Toxicity Reduction
Comparing URIAL and PBPO-Lite on PRISM User Prompts Without Fine-Tuning
Diagnosing Failure Patterns in Large Language Models
A Symptom–Sign Framework and Integrated Toolkit for Practitioners
This thesis addresses this gap by introducing (1) a diagnostic framework that structures diagnosis through symptoms (observed undesirable outputs), signs (evidence from interpretability methods), and failure patterns (recurring, explainable combinations of symptoms and signs), complemented by a Should-Know/Really-Know lens that distinguishes task expectation from actual model knowledge; and (2) a prototype diagnostic toolkit that operationalizes this framework through integrated evaluation, run comparison, and state externalization.
An evaluation study with eight practitioners using codebook thematic analysis reveals three core findings. First, practitioners universally adopt a baseline-first strategy, building diagnostic confidence through initial evaluation before deeper probing. Second, they triangulate across samples, metrics, and interpretability outputs rather than relying on single signals, using comparison as a central sense-making operation for hypothesis testing. Third, diagnostic depth is systematically gated by three factors: interpretation friction (insufficient guidance on what methods reveal and how to act on their outputs), missing workflow glue (the absence of affordances for iterative refinement), and execution constraints (opaque platform limits that disrupt sustained diagnostic progress).
These findings reframe diagnostic tooling as infrastructure for iterative, hypothesis-driven reasoning, extending beyond the provision of isolated analytical methods. Effective diagnostic support must scaffold the full investigative cycle: from expectation formation and baseline calibration through evidence triangulation, hypothesis testing via comparison, and state externalization. This scaffolding must also account for the gating factors that shape the depth of diagnostic progress. This positions diagnosis as a knowledge and workflow challenge, with implications for tooling design, framework development, and empirical research into practitioner diagnostic workflows.
...
This thesis addresses this gap by introducing (1) a diagnostic framework that structures diagnosis through symptoms (observed undesirable outputs), signs (evidence from interpretability methods), and failure patterns (recurring, explainable combinations of symptoms and signs), complemented by a Should-Know/Really-Know lens that distinguishes task expectation from actual model knowledge; and (2) a prototype diagnostic toolkit that operationalizes this framework through integrated evaluation, run comparison, and state externalization.
An evaluation study with eight practitioners using codebook thematic analysis reveals three core findings. First, practitioners universally adopt a baseline-first strategy, building diagnostic confidence through initial evaluation before deeper probing. Second, they triangulate across samples, metrics, and interpretability outputs rather than relying on single signals, using comparison as a central sense-making operation for hypothesis testing. Third, diagnostic depth is systematically gated by three factors: interpretation friction (insufficient guidance on what methods reveal and how to act on their outputs), missing workflow glue (the absence of affordances for iterative refinement), and execution constraints (opaque platform limits that disrupt sustained diagnostic progress).
These findings reframe diagnostic tooling as infrastructure for iterative, hypothesis-driven reasoning, extending beyond the provision of isolated analytical methods. Effective diagnostic support must scaffold the full investigative cycle: from expectation formation and baseline calibration through evidence triangulation, hypothesis testing via comparison, and state externalization. This scaffolding must also account for the gating factors that shape the depth of diagnostic progress. This positions diagnosis as a knowledge and workflow challenge, with implications for tooling design, framework development, and empirical research into practitioner diagnostic workflows.
Unheard and Misunderstood
Tracing Hermeneutical Injustice in ADHD Narratives Generated by Large Language Models
Incorporating User Feedback into Post-Training LLM Improvement to Promote Hermeneutical Justice
An interface to amplify marginalized voices
Prompt Engineering for Hermeneutical Justice in LLMs
An Empirical Study on ADHD-Related Causal Reasoning
Unheard and Misunderstood: Addressing Injustice in LLMs
How are hermeneutical injustices encoded in Reinforcement Learning from Human Feedback (RLHF) in the context of LLMs?
Unheard and Misunderstood
Reinforcing Hermeneutical Justice in Annotation Design for ADHD Voices
We interview researchers to gather insights into their opinions and usage, and afterwards develop a prototype. Researchers highlight that the most difficult part of research is the experiment design, which is reflected in the lack of literature.
Participants use research assistants for reading literature and writing, but request more support. The research assistant must be factual, transparent, and correct, and including a conversation allows for feedback and discussion.
We evaluate the prototype for the experiment design phase, highlighting the effectiveness of the component architecture by generating correct experiments. ...
We interview researchers to gather insights into their opinions and usage, and afterwards develop a prototype. Researchers highlight that the most difficult part of research is the experiment design, which is reflected in the lack of literature.
Participants use research assistants for reading literature and writing, but request more support. The research assistant must be factual, transparent, and correct, and including a conversation allows for feedback and discussion.
We evaluate the prototype for the experiment design phase, highlighting the effectiveness of the component architecture by generating correct experiments.