Policy Optimization with Stochastic Mirror Descent

Authors :: Yang, Long
Zhang, Yu
Zheng, Gang
Zheng, Qian
Li, Pengfei
Huang, Jianhang
Wen, Jun
Pan, Gang
Source :: AAAI2022
Publication Year :: 2019
Abstract: Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper proposes $\mathtt{VRMPO}$ algorithm: a sample efficient policy gradient method with stochastic mirror descent. In $\mathtt{VRMPO}$, a novel variance-reduced policy gradient estimator is presented to improve sample efficiency. We prove that the proposed $\mathtt{VRMPO}$ needs only $\mathcal{O}(\epsilon^{-3})$ sample trajectories to achieve an $\epsilon$-approximate first-order stationary point, which matches the best sample complexity for policy optimization. The extensive experimental results demonstrate that $\mathtt{VRMPO}$ outperforms the state-of-the-art policy gradient methods in various settings.