IntroductionPaediatric palliative care (PPC) aims to optimise the quality of life of children with life-limiting or life-threatening conditions by addressing physical, psychosocial, emotional and spiritual needs of children as well as their family members. Advance care planning (ACP) is a central element of PPC, as it helps children and family members formulate values, needs, and goals for future care. However, ACP documentation is time-consuming and burdensome for healthcare professionals (HCPs). Large Language Models (LLMs) may support this process by automatically extracting and structuring ACP outcomes. This study explored the support of open-source LLMs summarising ACP outcomes from Individual Care Plans (ICPs) in a Dutch PPC setting.
MethodsWe constructed a pseudonymised dataset of 38 ICPs, with reference ACP summaries structured around three guiding questions: (1) Who are you?, (2) What is important to you?, and (3) What are your goals and wishes for future care and treatment? Two open-source decoder-only LLMs were selected: Llama-3.1-8B-instruct (Llama-3.1) and Fietje-2-instruct (Fietje-2). We evaluated their performance under zero-shot prompting, in-context learning (ICL) with up to eight examples, and QLoRA fine-tuning on 30 training samples. Outputs were assessed with automatic metrics (BLEU, ROUGE-L, BERTScore, MEDCON), complemented by textual analysis and a human reader study.
Results
Automatic metrics indicated comparable overall performance of both models across conditions, with semantic similarity exceeding syntactic similarity.
For in-context learning (ICL), Llama-3.1 showed a slight increase in BLEU and BERT scores at ICL-1, whereas Fietje-2 performed best under zero-shot prompting. However, textual analysis revealed that both models frequently copied content from the example reference text rather than the source ICPs, resulting in limited structural improvements. Hallucinations and interpretation shifts were observed in all generated texts, with Fietje-2 producing fewer structured answers and omitting key details such as age and condition.
For QLoRA fine-tuning, automatic metrics showed only minor improvements, mainly in recall, without meaningful gains in precision or overall quality. Textual analysis indicated that Llama-3.1 produced summaries in the correct structure more consistently than Fietje-2, though it often misplaced information or repeated segments. Fietje-2 generated shorter, less repetitive texts, but with more hallucinations and nonsensical statements. In the human reader evaluation, experts unanimously preferred the reference summaries over both models. When comparing model outputs, most readers favoured Llama-3.1 for completeness, while Fietje-2 was preferred for conciseness. No clear preference was expressed for correctness.
Conclusion
This study demonstrates that while open-source LLMs can extract some relevant ACP outcomes from ICPs, their outputs remain incomplete and unreliable for clinical use. Neither ICL nor QLoRA substantially improved summarisation quality under current data and computational constraints. The limited dataset size, single-reference summaries, and complex prompts likely constrained model performance. Future research should focus on larger, clinically representative datasets derived from transcribed ACP conversations, the inclusion of multiple expert-written references, and systematic prompt optimisation in collaboration with ACP experts, PPC patients and their family members. With careful dataset construction, iterative fine-tuning, and human evaluation, LLMs may in the future contribute to reducing administrative workload and supporting ACP implementation in PPC.