Back to Search Start Over

Policy Optimization with Stochastic Mirror Descent

Authors :
Yang, Long
Zhang, Yu
Zheng, Gang
Zheng, Qian
Li, Pengfei
Huang, Jianhang
Wen, Jun
Pan, Gang
Source :
AAAI2022
Publication Year :
2019

Abstract

Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes $\mathtt{VRMPO}$ algorithm: a sample efficient policy gradient method with stochastic mirror descent. In $\mathtt{VRMPO}$, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed $\mathtt{VRMPO}$ needs only $\mathcal{O}(\epsilon^{-3})$ sample trajectories to achieve an $\epsilon$-approximate first-order stationary point, which matches the best sample complexity for policy optimization. The extensive experimental results demonstrate that $\mathtt{VRMPO}$ outperforms the state-of-the-art policy gradient methods in various settings.

Details

Database :
arXiv
Journal :
AAAI2022
Publication Type :
Report
Accession number :
edsarx.1906.10462
Document Type :
Working Paper