Investigating the case of weak baselines in Ad-hoc Retrieval and Question Answering

More Info
expand_more

Abstract

Weak baselines have been present in Information Retrieval (IR) for
decades. They have been associated with IR progress stagnation, baseline
selection bias to publish results more readily, and models’ effectiveness
reproducibility issues that hinder the validation of results by independent
research teams. Weak baselines have been studied by the IR community;
however, the focus has been almost exclusive on ad-hoc retrieval, the most
popular IR task, leaving outside other IR tasks and datasets recently de-
veloped. Current deep neural IR research is particularly vulnerable to the
issues with weak baselines due to the hype surrounding deep learning.
In this thesis we investigate the cases of weak baselines in ad-hoc
retrieval and question answering (QA), two representative IR tasks among
13 cases of weak baselines we found in current deep neural IR research from
EMNLP 2018 conference. In particular, we study whether the recently
introduced deep neural IR models are actually significantly more effective
than the reported IR baselines or than LambdaMART, the Learning to
Rank (LTR) model we propose plus hyperparameter optimization (HPO).
We also benchmark two HPO methods: RS and BOHB, to determine which
method is more efficient to retrieve a good hyperparameter configuration.
Throughout our experiments we show that the effectiveness of the
novel deep neural IR models can be difficult to replicate, it might be lower
than reported, and that it is not necessarily significantly higher than the
baseliness. Furthermore, we demonstrate that BOHB is more efficient
than RS, but the HPO process not always improves the effectiveness of
LambdaMART significantly.