Large Language Models (LLMs) have demonstrated impressive capabilities on wide range of tasks including tasks that entail complex reasoning. They also demonstrate the ability to adapt to new tasks without further training but with the help of exemplars demonstrating how to solve
...
Large Language Models (LLMs) have demonstrated impressive capabilities on wide range of tasks including tasks that entail complex reasoning. They also demonstrate the ability to adapt to new tasks without further training but with the help of exemplars demonstrating how to solve the complex reasoning task. This is due to emergent capabilities such as In-Context Learning (ICL) where the model learns the skills required for a task through the demonstration samples provided. These methods can be categorized as static methods where exemplars are selected offline or dynamic (Instance-level) where exemplars are selected on a per test query basis. Dynamic, instance-level exemplar selection has been shown to be more accurate than static, task-level methods, but it is hard to use in practice because it requires a lot of computing power. In order to mitigate this issue we propose a novel perspective for selecting exemplars by casting it into a ranking problem and use LTR models trained on automatically generated BERTScore-based relevance labels to assign utility to the exemplars. However, randomly selecting and receiving llm feedback for exemplars may not yield the best data to train LTR models. Hence, principled exploration of the exemplar space is critical to learn a selection policy offline that can be easily employed for dynamic exemplar selection during inference. We tackle with these problems in this paper by proposing CASE Rank, a novel non-linear gap-index bandit framework that cuts down on inference-time overhead by learning an exemplar utility estimator offline without hurting performance. CASE Rank solves these problems by combining a gap-index based bandit framework and LTR using PiRank, a lightweight neural ranking model, as a non-linear surrogate loss function within the bandit framework. CASE Rank is a bandit based selection approach to judiciously sample LLM feedback and learn offline policy using a differentiable sorting algorithm. This approach allows for quick and tailored selection of exemplars for each instance during inference. Experiments conducted on datasets such as GSM8K, AQUA-RAT, and WMT19 indicate that CASE Rank enhances reasoning performance compared to previous methods, while also substantially lowering computational requirements. Our results highlight that principled, efficient exemplar selection can be achieved through a combination of exploration strategies and learning-to-rank models tailored to LLM response behavior.