Start Over

Bandits with Global Convex Constraints and Objective.

Authors :: Agrawal, Shipra
Devanur, Nikhil R.
Source :: Operations Research; Sep/Oct2019, Vol. 67 Issue 5, p1486-1502, 17p
Publication Year :: 2019
Abstract: Multiarmed bandit (MAB) is a classic model for capturing the exploration–exploitation trade-off inherent in many sequential decision-making problems. The classic MAB framework, however, only allows "local" constraints on decisions and "sum of rewards" as objective. In many real-world applications, there are multiple complex constraints on resources that are consumed during the entire decision process, and performance may be evaluated through nonlinear utility functions on aggregate rewards. This article presents a new MAB framework that allows such "global" convex constraints and concave objective functions along with new algorithmic techniques with provably near-optimal performance bounds. The authors discuss applications in several domains, such as network revenue management, crowdsourcing, and pay-per-click advertising, which benefit from the new more general framework by admitting richer models and more efficient risk-averse solutions. We consider a very general model for managing the exploration–exploitation trade-off, which allows global convex constraints and concave objective on the aggregate decisions over time in addition to the customary limitation on the time horizon. This model provides a natural framework to study many sequential decision-making problems with long-term convex constraints and concave utility and subsumes the classic multiarmed bandit (MAB) model and the bandits with knapsacks problem as special cases. We demonstrate that a natural extension of the upper confidence bound family of algorithms for MAB provides a polynomial time algorithm with near-optimal regret guarantees for this substantially more general model. We also provide computationally more efficient algorithms by establishing interesting connections between this problem and other well-studied problems/algorithms, such as the Blackwell approachability problem, online convex optimization, and the Frank–Wolfe technique for convex optimization. We give several concrete examples of applications, particularly in risk-sensitive revenue management under unknown demand distributions, in which this more general bandit model of sequential decision making allows for richer formulations and more efficient solutions of the problem. [ABSTRACT FROM AUTHOR]