A well-functioning democracy depends on an informed population. To help informing citizens, summaries of arguments in political transcripts can be made. An approach to argument summarization is the creation of summaries through distillation of the arguments into higher-level key
...
A well-functioning democracy depends on an informed population. To help informing citizens, summaries of arguments in political transcripts can be made. An approach to argument summarization is the creation of summaries through distillation of the arguments into higher-level key points. In this approach, mapping arguments to key points is an important subtask. This study examines how model selection, prompting strategy, choice of domain, and input batching influence the performance of large language models (LLMs) in matching arguments to key points. We introduce a self-annotated dataset from U.S. Congress committee transcripts and evaluate both generative and embedding-based models on this task. Generative LLMs (GPT-3.5-turbo, o4-mini) outperform both untuned and fine-tuned RoBERTa in zero-shot argument-to-keypoint mapping (up to 0.880 macro-F1), while sparse two-shot prompting yields no gains. Moderate batching (n=32) boosts throughput without losing accuracy. These results show that a fully automated KPA pipeline—argument extraction, key-point generation, and mapping—is achievable with current LLMs.