Exploring Bandit Algorithms in User-Interactive Systems

Influence of Delay on Contextual Multi-Armed Bandits

More Info
expand_more

Abstract

Delay is a frequently encountered phenomenon in Multi-armed bandit problems that affects the accuracy of choosing the optimal arm. One example of this phenomenon is online shopping, where there is a delay between a user being recommended a product and placing the order. This study investigates the influence of delay in contextual multi-armed bandit settings, focusing on the performance of the OTFLinUCB algorithm, an adaptation of the Lin- UCB algorithm designed to handle delayed feed- back. The cumulative regret associated with various bandit algorithms, including UCB, Exp3, Lin- UCB, and OTFLinUCB, is examined under different delayed environments. Experiments con- ducted using artificial data generated to simulate real-world scenarios reveal that while delay cannot be entirely mitigated, its impact can be minimized through the strategic selection of hyperparameters, particularly the time window size for OTFLinUCB. The findings indicate that OTFLinUCB’s performance is highly dependent on the chosen window size, which balances the trade-off between estimator accuracy and memory consumption. A method for determining the optimal window size is proposed by introducing an intermediary distribution- agnostic term, conversion rate. Empirical evidence suggests that maintaining a conversion rate of around 70% achieves a balance between minimizing cumulative regret and optimizing memory usage. As a result, this research contributes to the understanding of the delays’ effect on bandit algorithms in contextual environments and offers practical guidelines for tuning OTFLinUCB’s time window.