TimelineQA: Benchmark for List And Temporal Understanding

More Info
expand_more

Abstract

Multiple benchmarks for question answering (QA) systems often under-represent questions that require lists to be answered, referred to in this work as ListQA. This type of question can provide valuable insights into the system’s ability to structure its internal knowledge. In this work, we introduce TimelineQA (TLQA), a specialized subset of ListQA that requires both list comprehension and temporal understanding to generate correct answers. We automatically curate a benchmark for TLQA and test it against generative large language models (LLMs) commonly employed in QA systems. Our findings reveal significant shortcomings in current models, particularly in their inability to provide complete answers and accurately align facts temporally. To address these issues, we explore how Retrieval Augmented Generation (RAG) can enhance performance in this task. Our results indicate that while RAG improves performance, especially in retrieving relevant contextual information, there remains considerable room for further improvement. Both the benchmark and the evaluation of the models can be publicly accessed here.