Back to Search Start Over

Smooth Contextual Bandits: Bridging the Parametric and Nondifferentiable Regret Regimes.

Authors :
Hu, Yichun
Kallus, Nathan
Mao, Xiaojie
Source :
Operations Research; Nov/Dec2022, Vol. 70 Issue 6, p3261-3281, 21p, 5 Graphs
Publication Year :
2022

Abstract

Dynamic Personalized Decision Making Beyond the Super-Extrapolatable and Super-Local Cases Contextual bandit problems model the inherent trade-off between exploration and exploitation in personalized decision making in marketing, healthcare, revenue management, and more. Specifically, the trade-off is characterized by the optimal growth rate of the regret. Intuitively, the optimal rate should depend on how complex the underlying supervised learning problem is, namely, how much can observing reward in one context tell us about mean rewards in another. To formalize this intuitive relationship, Hu, Kallus, and Mao study in "Smooth Contextual Bandits: Bridging the Parametric and Nondifferentiable Regimes" a nonparametric contextual bandit problem in which mean reward functions are β-times differentiable (more generally, Hölder β-smooth). This interpolates between two extremes previously studied in isolation: nondifferentiable bandits (β ≤ 1), with which running separated noncontextual bandits in different context regions achieves rate-optimal regret, and parametric-response bandits (β = ∞), with which rate-optimal regret can be achieved with minimal or no exploration because of infinite extrapolatability across contexts. The authors develop a rate-optimal algorithm that operates neither fully locally nor fully globally, revealing the optimal regret rate in this in-between smooth setting and shedding light on the crucial interplay of functional complexity and regret in dynamic personalized decision making. We study a nonparametric contextual bandit problem in which the expected reward functions belong to a Hölder class with smoothness parameter β. We show how this interpolates between two extremes that were previously studied in isolation: nondifferentiable bandits (β at most 1), with which rate-optimal regret is achieved by running separate noncontextual bandits in different context regions, and parametric-response bandits (infinite β), with which rate-optimal regret can be achieved with minimal or no exploration because of infinite extrapolatability. We develop a novel algorithm that carefully adjusts to all smoothness settings, and we prove its regret is rate-optimal by establishing matching upper and lower bounds, recovering the existing results at the two extremes. In this sense, our work bridges the gap between the existing literature on parametric and nondifferentiable contextual bandit problems and between bandit algorithms that exclusively use global or local information, shedding light on the crucial interplay of complexity and regret in contextual bandits. Funding: N. Kallus acknowledges support from the National Science Foundation [Grant 1846210]. X. Mao acknowledges support from the National Science Foundation of China [Grant 72201150, Grant 72293561]. Supplemental Material: The online appendix is available at https://doi.org/10.1287/opre.2021.2237. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
0030364X
Volume :
70
Issue :
6
Database :
Complementary Index
Journal :
Operations Research
Publication Type :
Academic Journal
Accession number :
161062983
Full Text :
https://doi.org/10.1287/opre.2021.2237