Back to Search
Start Over
Regret Analysis of a Markov Policy Gradient Algorithm for Multiarm Bandits.
- Source :
- Mathematics of Operations Research; Aug2023, Vol. 48 Issue 3, p1553-1588, 36p
- Publication Year :
- 2023
-
Abstract
- We consider a policy gradient algorithm applied to a finite-arm bandit problem with Bernoulli rewards. We allow learning rates to depend on the current state of the algorithm rather than using a deterministic time-decreasing learning rate. The state of the algorithm forms a Markov chain on the probability simplex. We apply Foster–Lyapunov techniques to analyze the stability of this Markov chain. We prove that, if learning rates are well-chosen, then the policy gradient algorithm is a transient Markov chain, and the state of the chain converges on the optimal arm with logarithmic or polylogarithmic regret. [ABSTRACT FROM AUTHOR]
- Subjects :
- MARKOV processes
ROBBERS
REGRET
POLICY analysis
ALGORITHMS
Subjects
Details
- Language :
- English
- ISSN :
- 0364765X
- Volume :
- 48
- Issue :
- 3
- Database :
- Complementary Index
- Journal :
- Mathematics of Operations Research
- Publication Type :
- Academic Journal
- Accession number :
- 169834707
- Full Text :
- https://doi.org/10.1287/moor.2022.1311