G. Penha | TU Delft Repository

Do the Findings of Document and Passage Retrieval Generalize to the Retrieval of Responses for Dialogues?

Conference paper (2023) - Gustavo Penha, Claudia Hauff

A number of learned sparse and dense retrieval approaches have recently been proposed and proven effective in tasks such as passage retrieval and document retrieval. In this paper we analyze with a replicability study if the lessons learned generalize to the retrieval of responses for dialogues, an important task for the increasingly popular field of conversational search. Unlike passage and document retrieval where documents are usually longer than queries, in response ranking for dialogues the queries (dialogue contexts) are often longer than the documents (responses). Additionally, dialogues have a particular structure, i.e. multiple utterances by different users. With these differences in mind, we here evaluate how generalizable the following major findings from previous works are: (F1) query expansion outperforms a no-expansion baseline; (F2) document expansion outperforms a no-expansion baseline; (F3) zero-shot dense retrieval underperforms sparse baselines; (F4) dense retrieval outperforms sparse baselines; (F5) hard negative sampling is better than random sampling for training dense models. Our experiments (https://github.com/Guzpenha/transformer_rankers/tree/full_rank_retrieval_dialogues.)—based on three different information-seeking dialogue datasets—reveal that four out of five findings (F2–F5) generalize to our domain. ...

Designing and Diagnosing Models for Conversational Search and Recommendation

Doctoral thesis (2023) - G. Penha

Conversational search is a sub-field of Information Retrieval (IR) that focuses on solving information needs through natural language conversations. Searching for information is an inherently interactive task, and conversations offer a promising solution. One that might change the current search paradigm. In this thesis, we focus on retrieval and ranking approaches for conversational search systems, which are core IR technologies that have been progressing for decades. First, we contribute with resources we created and which are used throughout the thesis. Namely, we introduce a novel dataset of information-seeking dialogues: MANtIS, as well as a library to train and evaluate models for the task of conversation response ranking: transformer-rankers. Considering a two-stage pipeline for conversational search, we propose approaches for retrieval and also for re-ranking responses. We start by empirically comparing sparse and dense approaches for the first-stage retrieval of responses for dialogues. Next, we go to the second stage of the pipeline and use notions of difficulty to improve response re-rankers. We start with a curriculum learning approach that starts with easy dialogues and moves progressively to harder ones during training. We also investigate how difficult a dialogue can be when predicting the relevance of responses, by proposing models which allow for estimating their uncertainty. Finally, we move on to evaluating what is the behavior and limitations of retrieval and ranking models for conversational search. We start by evaluating what is the effect of categories of language variations of queries in retrieval pipelines. Additionally, we evaluate what are the capabilities of heavily pre-trained language models for different conversational recommendation tasks. With this thesis, we make scientific contributions to the field by providing resources, improving retrieval and re-rankers, and enabling a better understanding of models. We hope our contributions can be used as a foundation for future work in conversational search, enabling agents that can improve information-seeking interactions. ...

Conversational search is a sub-field of Information Retrieval (IR) that focuses on solving information needs through natural language conversations. Searching for information is an inherently interactive task, and conversations offer a promising solution. One that might change the current search paradigm. In this thesis, we focus on retrieval and ranking approaches for conversational search systems, which are core IR technologies that have been progressing for decades. First, we contribute with resources we created and which are used throughout the thesis. Namely, we introduce a novel dataset of information-seeking dialogues: MANtIS, as well as a library to train and evaluate models for the task of conversation response ranking: transformer-rankers. Considering a two-stage pipeline for conversational search, we propose approaches for retrieval and also for re-ranking responses. We start by empirically comparing sparse and dense approaches for the first-stage retrieval of responses for dialogues. Next, we go to the second stage of the pipeline and use notions of difficulty to improve response re-rankers. We start with a curriculum learning approach that starts with easy dialogues and moves progressively to harder ones during training. We also investigate how difficult a dialogue can be when predicting the relevance of responses, by proposing models which allow for estimating their uncertainty. Finally, we move on to evaluating what is the behavior and limitations of retrieval and ranking models for conversational search. We start by evaluating what is the effect of categories of language variations of queries in retrieval pipelines. Additionally, we evaluate what are the capabilities of heavily pre-trained language models for different conversational recommendation tasks. With this thesis, we make scientific contributions to the field by providing resources, improving retrieval and re-rankers, and enabling a better understanding of models. We hope our contributions can be used as a foundation for future work in conversational search, enabling agents that can improve information-seeking interactions.

Evaluating the Robustness of Retrieval Pipelines with Query Variation Generators

Conference paper (2022) - Gustavo Penha, Arthur Câmara, Claudia Hauff

Heavily pre-trained transformers for language modeling, such as BERT, have shown to be remarkably effective for Information Retrieval (IR) tasks, typically applied to re-rank the results of a first-stage retrieval model. IR benchmarks evaluate the effectiveness of retrieval pipelines based on the premise that a single query is used to instantiate the underlying information need. However, previous research has shown that (I) queries generated by users for a fixed information need are extremely variable and, in particular, (II) neural models are brittle and often make mistakes when tested with modified inputs. Motivated by those observations we aim to answer the following question: how robust are retrieval pipelines with respect to different variations in queries that do not change the queries’ semantics? In order to obtain queries that are representative of users’ querying variability, we first created a taxonomy based on the manual annotation of transformations occurring in a dataset (UQV100) of user-created query variations. For each syntax-changing category of our taxonomy, we employed different automatic methods that when applied to a query generate a query variation. Our experimental results across two datasets for two IR tasks reveal that retrieval pipelines are not robust to these query variations, with effectiveness drops of ≈ 20 % on average. The code and datasets are available at https://github.com/Guzpenha/query_variation_generators. ...

The Seventh Workshop on Search-Oriented Conversational Artificial Intelligence (SCAI'22)

Conference paper (2022) - Gustavo Penha, Svitlana Vakulenko, Ondrej Dusek, Leigh Clark, Vaishali Pal, Vaibhav Adlakha

The goal of the seventh edition of SCAI (https: //scai.info) is to bring together and further grow a community of researchers and practitioners interested in conversational systems for information access. The previous iterations of the workshop already demonstrated the breadth and multidisciplinarity inherent in the design and development of conversational search agents. The proposed shift from traditional web search to search interfaces enabled via human-like dialogue leads to a number of challenges, and although such challenges have received more attention in the recent years, there are many pending research questions that should be addressed by the information retrieval community and can largely benefit from a collaboration with other research fields, such as natural language processing, machine learning, human-computer interaction and dialogue systems. This workshop is intended as a platform enabling a continuous discussion of the major research challenges that surround the design of search-oriented conversational systems. This year, participants have the opportunity to meet in person and have more in-depth interactive discussions with a full-day onsite workshop. ...

Helping Voice Shoppers Make Purchase Decisions

Conference paper (2022) - Gustavo Penha, Eyal Krikon, Vanessa Murdock, Sandeep Avula

Online shoppers have a lot of information at their disposal when making a purchase decision. They can look at images of the product, read reviews, make comparisons with other products, do research online, read expert reviews, and more. Voice shopping (purchasing items via a Voice assistant such as Amazon Alexa or Google Assistant) is different. Voice introduces novel challenges as the communication channel is limited in terms of the amount of information people can and are willing to absorb. Because of this, the system should choose the single most effective nugget of information to help the customer, and present the information succinctly. In this paper we report on a within-subject user study (N = 24), in which we employed three template-based methods that use information from customer reviews, product attributes and search relevance signals to generate helpful supporting information. Our results suggest that: (1) supporting information from customer reviews significantly improves participants perception of system effectiveness (helping them make good decisions); (2) supporting information based on search relevance signals improves user perception of system transparency (providing insight into how the system works). We discuss the implications of our findings for providing supporting information for customers shopping by Voice. ...

Pairwise review-based explanations for voice product search

Conference paper (2022) - Gustavo Penha, Eyal Krikon, Vanessa Murdock

Explanations describe product recommendations in a human interpretable way in order to achieve a goal, e.g. persuade users to buy. Unlike web product search, where users have access to diverse information as to why the products might be suitable for their needs, in the voice product search domain the amount of information that can be disclosed is inherently limited. Users in general evaluate a maximum of two products and usually buy low consideration products when using the voice channel [3]. In order to enable decision making in voice product searches we propose here a framework for generating pointwise and pairwise review-based explanations that disclose further information about the products. The POINTWISE method selects a helpful sentence from the top review of the recommended product based on a BERT-based model and uses the extracted sentence to fill a response template. The PAIRWISE method first selects a diverse pair of products - in terms of their review-based representations - from the top-k ranked products for a query, then chooses a helpful review sentence for each product in the pair, and finally fills a template with the sentences. Besides further describing the product, the PAIRWISE method gives a reference point to the users and enables a comparison of the recommendations based on two diverse products for the same information need. Our crowd-sourced evaluation of explanations based on queries from a widely used e-commerce platform shows that the proposed pairwise explanations provide statistically significant improvements compared to the POINTWISE and BASELINE methods for two goals: Effectiveness, i.e. helping users to make good decisions, and Transparency, i.e. explaining how the system works. The gains of PAIRWISE over POINTWISE and BASELINE are consistent for different subsets of data based on the diversity of the selected pairs, average product price associated with the query and the query ambiguity. ...

Weakly Supervised Label Smoothing

Conference paper (2021) - Gustavo Penha, Claudia Hauff

We study Label Smoothing (LS), a widely used regularization technique, in the context of neural learning to rank (L2R) models. LS combines the ground-truth labels with a uniform distribution, encouraging the model to be less confident in its predictions. We analyze the relationship between the non-relevant documents—specifically how they are sampled—and the effectiveness of LS, discussing how LS can be capturing “hidden similarity knowledge” between the relevant and non-relevant document classes. We further analyze LS by testing if a curriculum-learning approach, i.e., starting with LS and after a number of iterations using only ground-truth labels, is beneficial. Inspired by our investigation of LS in the context of neural L2R models, we propose a novel technique called Weakly Supervised Label Smoothing (WSLS) that takes advantage of the retrieval scores of the negative sampled documents as a weak supervision signal in the process of modifying the ground-truth labels. WSLS is simple to implement, requiring no modification to the neural ranker architecture. Our experiments across three retrieval tasks—passage retrieval, similar question retrieval and conversation response ranking—show that WSLS for pointwise BERT-based rankers leads to consistent effectiveness gains. The source code is available at https://github.com/Guzpenha/transformer_rankers/tree/wsls. ...

On the Calibration and Uncertainty of Neural Learning to Rank Models for Conversational Search

Conference paper (2021) - Gustavo Penha, Claudia Hauff

According to the Probability Ranking Principle (PRP), ranking documents in decreasing order of their probability of relevance leads to an optimal document ranking for ad-hoc retrieval. The PRP holds when two conditions are met: [C1] the models are well calibrated, and, [C2] the probabilities of relevance are reported with certainty. We know however that deep neural networks (DNNs) are often not well calibrated and have several sources of uncertainty, and thus [C1] and [C2] might not be satisfied by neural rankers. Given the success of neural Learning to Rank (L2R) approaches-and here, especially BERT-based approaches-we first analyze under which circumstances deterministic neural rankers are calibrated for conversational search problems. Then, motivated by our findings we use two techniques to model the uncertainty of neural rankers leading to the proposed stochastic rankers, which output a predictive distribution of relevance as opposed to point estimates. Our experimental results on the ad-hoc retrieval task of conversation response ranking ¹ reveal that (i) BERT-based rankers are not robustly calibrated and that stochastic BERT-based rankers yield better calibration; and (ii) uncertainty estimation is beneficial for both risk-aware neural ranking, i.e. taking into account the uncertainty when ranking documents, and for predicting unanswerable conversational contexts. ...

Exploiting Performance Estimates for Augmenting Recommendation Ensembles

Conference paper (2020) - Gustavo Penha, Rodrygo L.T. Santos

Ensembling multiple recommender systems via stacking has shown to be effective at improving collaborative recommendation. Recent work extends stacking to use additional user performance predictors (e.g., the total number of ratings made by the user) to help determine how much each base recommender should contribute to the ensemble. Nonetheless, despite the cost of handcrafting discriminative predictors, which typically requires deep knowledge of the strengths and weaknesses of each recommender in the ensemble, only minor improvements have been observed. To overcome this limitation, instead of engineering complex features to predict the performance of different recommenders for a given user, we propose to directly estimate these performances by leveraging the user's own historical ratings. Experiments on real-world datasets from multiple domains demonstrate that using performance estimates as additional features can significantly improve the accuracy of state-of-the-art ensemblers, achieving nDCG@20 improvements by an average of 23% over not using them. ...

Challenges in the evaluation of conversational search systems

Conference paper (2020) - Gustavo Penha, Claudia Hauff

The area of conversational search has gained significant traction in the IR research community, motivated by the widespread use of personal assistants. An often researched task in this setting is conversation response ranking, that is, to retrieve the best response for a given ongoing conversation from a corpus of historic conversations. While this is intuitively an important step towards (retrieval-based) conversational search, the empirical evaluation currently employed to evaluate trained rankers is very far from this setup: typically, an extremely small number (e.g., 10) of non-relevant responses and a single relevant response are presented to the ranker. In a real-world scenario, a retrieval-based system has to retrieve responses from a large (e.g., several millions) pool of responses or determine that no appropriate response can be found. In this paper we aim to highlight these critical issues in the offline evaluation schemes for tasks related to conversational search. With this paper, we argue that the currently in-use evaluation schemes have critical limitations and simplify the conversational search tasks to a degree that makes it questionable whether we can trust the findings they deliver. ...

Curriculum Learning Strategies for IR

An Empirical Study on Conversation Response Ranking

Conference paper (2020) - Gustavo Penha, Claudia Hauff

Neural ranking models are traditionally trained on a series of random batches, sampled uniformly from the entire training set. Curriculum learning has recently been shown to improve neural models’ effectiveness by sampling batches non-uniformly, going from easy to difficult instances during training. In the context of neural Information Retrieval (IR) curriculum learning has not been explored yet, and so it remains unclear (1) how to measure the difficulty of training instances and (2) how to transition from easy to difficult instances during training. To address both challenges and determine whether curriculum learning is beneficial for neural ranking models, we need large-scale datasets and a retrieval task that allows us to conduct a wide range of experiments. For this purpose, we resort to the task of conversation response ranking: ranking responses given the conversation history. In order to deal with challenge (1), we explore scoring functions to measure the difficulty of conversations based on different input spaces. To address challenge (2) we evaluate different pacing functions, which determine the velocity in which we go from easy to difficult instances. We find that, overall, by just intelligently sorting the training data (i.e., by performing curriculum learning) we can improve the retrieval effectiveness by up to 2% (The source code is available at https://github.com/Guzpenha/transformers_cl.). ...

What does BERT know about books, movies and music? Probing BERT for Conversational Recommendation

Conference paper (2020) - Gustavo Penha, Claudia Hauff

Heavily pre-trained transformer models such as BERT have recently shown to be remarkably powerful at language modelling, achieving impressive results on numerous downstream tasks. It has also been shown that they implicitly store factual knowledge in their parameters after pre-training. Understanding what the pre-training procedure of LMs actually learns is a crucial step for using and improving them for Conversational Recommender Systems (CRS). We first study how much off-the-shelf pre-trained BERT "knows"about recommendation items such as books, movies and music. In order to analyze the knowledge stored in BERT's parameters, we use different probes (i.e., tasks to examine a trained model regarding certain properties) that require different types of knowledge to solve, namely content-based and collaborative-based. Content-based knowledge is knowledge that requires the model to match the titles of items with their content information, such as textual descriptions and genres. In contrast, collaborative-based knowledge requires the model to match items with similar ones, according to community interactions such as ratings. We resort to BERT's Masked Language Modelling (MLM) head to probe its knowledge about the genre of items, with cloze style prompts. In addition, we employ BERT's Next Sentence Prediction (NSP) head and representations' similarity (SIM) to compare relevant and non-relevant search and recommendation query-document inputs to explore whether BERT can, without any fine-tuning, rank relevant items first. Finally, we study how BERT performs in a conversational recommendation downstream task. To this end, we fine-tune BERT to act as a retrieval-based CRS. Overall, our experiments show that: (i) BERT has knowledge stored in its parameters about the content of books, movies and music; (ii) it has more content-based knowledge than collaborative-based knowledge; and (iii) fails on conversational recommendation when faced with adversarial data. ...

Heavily pre-trained transformer models such as BERT have recently shown to be remarkably powerful at language modelling, achieving impressive results on numerous downstream tasks. It has also been shown that they implicitly store factual knowledge in their parameters after pre-training. Understanding what the pre-training procedure of LMs actually learns is a crucial step for using and improving them for Conversational Recommender Systems (CRS). We first study how much off-the-shelf pre-trained BERT "knows"about recommendation items such as books, movies and music. In order to analyze the knowledge stored in BERT's parameters, we use different probes (i.e., tasks to examine a trained model regarding certain properties) that require different types of knowledge to solve, namely content-based and collaborative-based. Content-based knowledge is knowledge that requires the model to match the titles of items with their content information, such as textual descriptions and genres. In contrast, collaborative-based knowledge requires the model to match items with similar ones, according to community interactions such as ratings. We resort to BERT's Masked Language Modelling (MLM) head to probe its knowledge about the genre of items, with cloze style prompts. In addition, we employ BERT's Next Sentence Prediction (NSP) head and representations' similarity (SIM) to compare relevant and non-relevant search and recommendation query-document inputs to explore whether BERT can, without any fine-tuning, rank relevant items first. Finally, we study how BERT performs in a conversational recommendation downstream task. To this end, we fine-tune BERT to act as a retrieval-based CRS. Overall, our experiments show that: (i) BERT has knowledge stored in its parameters about the content of books, movies and music; (ii) it has more content-based knowledge than collaborative-based knowledge; and (iii) fails on conversational recommendation when faced with adversarial data.

Online learning to rank for sequential music recommendation

Conference paper (2019) - Bruno L. Pereira, Alberto Ueda, Gustavo Penha, Rodrygo L.T. Santos, Nivio Ziviani

The prominent success of music streaming services has brought increasingly complex challenges for music recommendation. In particular, in a streaming setting, songs are consumed sequentially within a listening session, which should cater not only for the user's historical preferences, but also for eventual preference drifts, triggered by a sudden change in the user's context. In this paper, we propose a novel online learning to rank approach for music recommendation aimed to continuously learn from the user's listening feedback. In contrast to existing online learning approaches for music recommendation, we leverage implicit feedback as the only signal of the user's preference. Moreover, to adapt rapidly to preference drifts over millions of songs, we represent each song in a lower dimensional feature space and explore multiple directions in this space as duels of candidate recommendation models. Our thorough evaluation using listening sessions from Last.fm demonstrates the efectiveness of our approach at learning faster and better compared to state-of-the-art online learning approaches. ...

Document performance prediction for automatic text classification

Conference paper (2019) - Gustavo Penha, Raphael Campos, Sérgio Canuto, Marcos André Gonçalves, Rodrygo L.T. Santos

Query performance prediction (QPP) is a fundamental task in information retrieval, which concerns predicting the effectiveness of a ranking model for a given query in the absence of relevance information. Despite being an active research area, this task has not yet been explored in the context of automatic text classification. In this paper, we study the task of predicting the effectiveness of a classifier for a given document, which we refer to as document performance prediction (DPP). Our experiments on several text classification datasets for both categorization and sentiment analysis attest the effectiveness and complementarity of several DPP inspired by related QPP approaches. Finally, we also explore the usefulness of DPP for improving the classification itself, by using them as additional features in a classification ensemble. ...