Timur Akhunov
Wabash College
Title
Hypoelliptic PDE-based characterization of Thompson sampling
Abstract
The multi-armed bandit is a foundational problem in reinforcement learning with extensive industrial applications. Named after a gambler seeking the most rewarding slot machine, it epitomizes the exploration-exploitation tradeoff. Thompson sampling is a classic strategy to navigate this tradeoff and minimize the selections of suboptimal arms (regret) over the long run. In classical applications, expected rewards of the arms differ by a fixed, if unknown, amount. In recent problems motivated by the large samples used in machine learning applications, however, such rewards diminish over time, leaving few foundational results known.
While the bandit games are inherently discrete, in joint work with Vlad Kobzar, we have characterized the continuous-time limit of this problem by a degenerate parabolic PDE. We verified that this PDE satisfies Hörmander's bracket condition and is hypoelliptic. Is hypoellipticity enough to guarantee regret bounds for our gambler? We will share known results and open questions from our work in progress.