Start Over

Deep reinforcement learning using least‐squares truncated temporal‐difference.

Authors :: Ren, Junkai
Lan, Yixing
Xu, Xin
Zhang, Yichuan
Fang, Qiang
Zeng, Yujun
Source :: CAAI Transactions on Intelligence Technology; Apr2024, Vol. 9 Issue 2, p425-439, 15p
Publication Year :: 2024
Abstract: Policy evaluation (PE) is a critical sub‐problem in reinforcement learning, which estimates the value function for a given policy and can be used for policy improvement. However, there still exist some limitations in current PE methods, such as low sample efficiency and local convergence, especially on complex tasks. In this study, a novel PE algorithm called Least‐Squares Truncated Temporal‐Difference learning (LST2D) is proposed. In LST2D, an adaptive truncation mechanism is designed, which effectively takes advantage of the fast convergence property of Least‐Squares Temporal Difference learning and the asymptotic convergence property of Temporal Difference learning (TD). Then, two feature pre‐training methods are utilised to improve the approximation ability of LST2D. Furthermore, an Actor‐Critic algorithm based on LST2D and pre‐trained feature representations (ACLPF) is proposed, where LST2D is integrated into the critic network to improve learning‐prediction efficiency. Comprehensive simulation studies were conducted on four robotic tasks, and the corresponding results illustrate the effectiveness of LST2D. The proposed ACLPF algorithm outperformed DQN, ACER and PPO in terms of sample efficiency and stability, which demonstrated that LST2D can be applied to online learning control problems by incorporating it into the actor‐critic architecture. [ABSTRACT FROM AUTHOR]