Exploring Bandit Algorithms in User-Interactive Systems

Influence of Delay on Contextual Multi-Armed Bandits

Bachelor Thesis (2024)
Author(s)

D.C. Arsene (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Julia Olkhovskaya – Mentor (TU Delft - Sequential Decision Making)

Rangarao Venkatesha Prasad – Graduation committee member (TU Delft - Networked Systems)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2024
Language
English
Graduation Date
25-06-2024
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Delay is a frequently encountered phenomenon in Multi-armed bandit problems that affects the accuracy of choosing the optimal arm. One example of this phenomenon is online shopping, where there is a delay between a user being recommended a product and placing the order. This study investigates the influence of delay in contextual multi-armed bandit settings, focusing on the performance of the OTFLinUCB algorithm, an adaptation of the Lin- UCB algorithm designed to handle delayed feed- back. The cumulative regret associated with various bandit algorithms, including UCB, Exp3, Lin- UCB, and OTFLinUCB, is examined under different delayed environments. Experiments con- ducted using artificial data generated to simulate real-world scenarios reveal that while delay cannot be entirely mitigated, its impact can be minimized through the strategic selection of hyperparameters, particularly the time window size for OTFLinUCB. The findings indicate that OTFLinUCB’s performance is highly dependent on the chosen window size, which balances the trade-off between estimator accuracy and memory consumption. A method for determining the optimal window size is proposed by introducing an intermediary distribution- agnostic term, conversion rate. Empirical evidence suggests that maintaining a conversion rate of around 70% achieves a balance between minimizing cumulative regret and optimizing memory usage. As a result, this research contributes to the understanding of the delays’ effect on bandit algorithms in contextual environments and offers practical guidelines for tuning OTFLinUCB’s time window.

Files

Final_RP_Paper_Dragos.pdf
(pdf | 1.32 Mb)
License info not available