TimelineQA: Benchmark for List And Temporal Understanding

None, None

TimelineQA: Benchmark for List And Temporal Understanding

Master Thesis (2024)

Author(s)

A. Dumitru (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Anand – Mentor (TU Delft - Web Information Systems)

N.M. Gürel – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Faculty

Electrical Engineering, Mathematics and Computer Science

Large Language Models (LLMs) Benchmark Question Answering Retrieval-Augmented Generation

To reference this document use:

https://resolver.tudelft.nl/uuid:8ac5d8b5-828e-4ec6-8d8b-7ed8faf20b37

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

03-07-2024

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Multiple benchmarks for question answering (QA) systems often under-represent questions that require lists to be answered, referred to in this work as ListQA. This type of question can provide valuable insights into the system’s ability to structure its internal knowledge. In this work, we introduce TimelineQA (TLQA), a specialized subset of ListQA that requires both list comprehension and temporal understanding to generate correct answers. We automatically curate a benchmark for TLQA and test it against generative large language models (LLMs) commonly employed in QA systems. Our findings reveal significant shortcomings in current models, particularly in their inability to provide complete answers and accurately align facts temporally. To address these issues, we explore how Retrieval Augmented Generation (RAG) can enhance performance in this task. Our results indicate that while RAG improves performance, especially in retrieving relevant contextual information, there remains considerable room for further improvement. Both the benchmark and the evaluation of the models can be publicly accessed here.

Files

Master_s_Thesis_Alexandru_Dumi... (pdf)

(pdf | 8 Mb)

License info not available