Jonas Wallat | TU Delft Repository

TempRetriever

Fusion-based Temporal Dense Passage Retrieval for Time-Sensitive Questions

Conference paper (2026) - Abdelrahman Abdallah, Bhawna Piryani, Jonas Wallat, Avishek Anand, Adam Jatowt

Temporal information is crucial for information retrieval, yet most dense retrieval systems focus exclusively on semantic similarity while neglecting temporal alignment between queries and documents. We propose TempRetriever, a lightweight framework that explicitly incorporates temporal information into dense passage retrieval through learned fusion techniques. Unlike existing approaches requiring extensive architectural modifications or specialized pre-training, TempRetriever enhances standard dense retrievers by combining semantic embeddings with temporal representations using four fusion strategies: Feature Stacking, Vector Summation, Relative Embeddings, and Element-Wise Interaction. Our approach introduces a learned temporal encoder and time-based negative sampling strategy to address temporal misalignment during training. We evaluate TempRetriever on three temporal question answering datasets (ArchivalQA, ChroniclingAmericaQA, NobelPrize) spanning altogether years from 1800 to 2022. TempRetriever achieves substantial improvements over standard DPR: 6.86% on ArchivalQA (Recall@1) and 4.40% on ChroniclingAmericaQA (Recall@1). Our method also outperforms state-of-the-art temporal retrieval systems, obtaining 9.62% improvement over BiTimeBERT and 5.16% over TS-Retriever. Notably, TempRetriever's fusion techniques can enhance existing temporal methods, improving BiTimeBERT by 5.12% and TS-Retriever by 6.17%, demonstrating modularity and practical value. Zero-shot evaluation confirms strong generalization across domains, and integration with retrieval-augmented generation shows consistent end-to-end improvements. ...

Correctness is not Faithfulness in Retrieval Augmented Generation Attributions

Conference paper (2025) - Jonas Wallat, Maria Heuss, Maarten De Rijke, Avishek Anand

Large language models (LLMs) have transformed information retrieval through chat interfaces, but their hallucination tendencies pose significant risks. While Retrieval Augmented Generation (RAG) with citations has emerged as a solution by allowing users to verify responses through source attribution, current evaluation approaches focus primarily on citation correctness - whether cited documents support the corresponding statements. This is insufficient and we introduce citation faithfulness - whether the model's reliance on cited documents is genuine rather than post-rationalized to fit pre-existing knowledge. Our contributions are threefold: (i) we introduce coherent notions of attribution and introduce the concept of citation faithfulness; (ii) we propose desiderata for citations beyond correctness and accuracy needed for trustworthy systems; and (iii) we emphasize evaluating citation faithfulness by studying post-rationalization. Through experimentation, we reveal prevalent post-rationalization issues, finding that up to 57% of citations lack faithfulness. This undermines reliable attribution and may result in misplaced trust, highlighting a critical gap in current LLM-based IR systems. We demonstrate why both citation correctness and faithfulness must be considered when deploying LLMs in IR applications, contributing to a broader discussion of building more reliable and transparent information access systems. ...

Temporal Blind Spots in Large Language Models

Conference paper (2024) - Jonas Wallat, Adam Jatowt, Avishek Anand

Large language models (LLMs) have recently gained significant attention due to their unparalleled zero-shot performance on various natural language processing tasks. However, the pre-Training data utilized in LLMs is often confined to a specific corpus, resulting in inherent freshness and temporal scope limitations. Consequently, this raises concerns regarding the effectiveness of LLMs for tasks involving temporal intents. In this study, we aim to investigate the underlying limitations of general-purpose LLMs when deployed for tasks that require a temporal understanding. We pay particular attention to handling factual temporal knowledge through three popular temporal QA datasets. Specifically, we observe low performance on detailed questions about the past and, surprisingly, for rather new information. In manual and automatic testing, we find multiple temporal errors and characterize the conditions under which QA performance deteriorates. Our analysis contributes to understanding LLM limitations and offers valuable insights into developing future models that can better cater to the demands of temporally-oriented tasks. The code is available https://github.com/jwallat/temporalblindspots. ...

Causal Probing for Dual Encoders

Conference paper (2024) - Jonas Wallat, Hauke Hinrichs, Avishek Anand

Dual encoders are highly effective and widely deployed in the retrieval phase for passage and document ranking, question answering, or retrieval-augmented generation (RAG) setups. Most dual-encoder models use transformer models like BERT to map input queries and output targets to a common vector space encoding the semantic similarity. Despite their prevalence and impressive performance, little is known about the inner workings of dense encoders for retrieval. We investigate neural retrievers using the probing paradigm to identify well-understood IR properties that causally result in ranking performance. Unlike existing works that have probed cross-encoders to show query-document interactions, we provide a principled approach to probe dual-encoders. Importantly, we employ causal probing to avoid correlation effects that might be artefacts of vanilla probing. We conduct extensive experiments on one such dual encoder (TCT-ColBERT) to check for the existence and relevance of six properties: term importance, lexical matching (BM25), semantic matching, question classification, and the two linguistic properties of named entity recognition and coreference resolution. Our layer-wise analysis shows important differences between re-rankers and dual encoders, establishing which tasks are not only understood by the model but also used for inference. ...

Probing BERT for Ranking Abilities

Conference paper (2023) - Jonas Wallat, Fabian Beringer, Abhijit Anand, Avishek Anand

Contextual models like BERT are highly effective in numerous text-ranking tasks. However, it is still unclear as to whether contextual models understand well-established notions of relevance that are central to IR. In this paper, we use probing, a recent approach used to analyze language models, to investigate the ranking abilities of BERT-based rankers. Most of the probing literature has focussed on linguistic and knowledge-aware capabilities of models or axiomatic analysis of ranking models. In this paper, we fill an important gap in the information retrieval literature by conducting a layer-wise probing analysis using four probes based on lexical matching, semantic similarity as well as linguistic properties like coreference resolution and named entity recognition. Our experiments show an interesting trend that BERT-rankers better encode ranking abilities at intermediate layers. Based on our observations, we train a ranking model by augmenting the ranking data with the probe data to show initial yet consistent performance improvements (The code is available at https://github.com/yolomeus/probing-search/ ). ...