Author: "Yang, Zhuoran" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Yang, Zhuoran"' showing total 647 results

Start Over Author "Yang, Zhuoran"

647 results on '"Yang, Zhuoran"'

51. Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes

Author: Lu, Miao, Min, Yifei, Wang, Zhaoran, and Yang, Zhuoran
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Mathematics - Statistics Theory, Statistics - Methodology, Statistics - Machine Learning
Abstract: We study offline reinforcement learning (RL) in partially observable Markov decision processes. In particular, we aim to learn an optimal policy from a dataset collected by a behavior policy which possibly depends on the latent state. Such a dataset is confounded in the sense that the latent state simultaneously affects the action and the observation, which is prohibitive for existing offline RL algorithms. To this end, we propose the \underline{P}roxy variable \underline{P}essimistic \underline{P}olicy \underline{O}ptimization (\texttt{P3O}) algorithm, which addresses the confounding bias and the distributional shift between the optimal and behavior policies in the context of general function approximation. At the core of \texttt{P3O} is a coupled sequence of pessimistic confidence regions constructed via proximal causal inference, which is formulated as minimax estimation. Under a partial coverage assumption on the confounded dataset, we prove that \texttt{P3O} achieves a $n^{-1/2}$-suboptimality, where $n$ is the number of trajectories in the dataset. To our best knowledge, \texttt{P3O} is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset., Comment: Updates. 52 pages
Published: 2022

52. Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency

Author: Wang, Lingxiao, Cai, Qi, Yang, Zhuoran, and Wang, Zhaoran
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Systems and Control, Statistics - Machine Learning
Abstract: Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.~(i) For each step, ETC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, ETC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, ETC attains an $O(1/\epsilon^2)$ sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here $\epsilon$ is the optimality gap. To our best knowledge, ETC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces., Comment: Accepted by ICLR 2022
Published: 2022

53. Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation

Author: Chen, Xiaoyu, Zhong, Han, Yang, Zhuoran, Wang, Zhaoran, and Wang, Liwei
Subjects: Computer Science - Machine Learning
Abstract: We study human-in-the-loop reinforcement learning (RL) with trajectory preferences, where instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer. The goal of the agent is to learn the optimal policy which is most preferred by the human overseer. Despite the empirical successes, the theoretical understanding of preference-based RL (PbRL) is only limited to the tabular case. In this paper, we propose the first optimistic model-based algorithm for PbRL with general function approximation, which estimates the model using value-targeted regression and calculates the exploratory policies by solving an optimistic planning problem. Our algorithm achieves the regret of $\tilde{O} (\operatorname{poly}(d H) \sqrt{K} )$, where $d$ is the complexity measure of the transition and preference model depending on the Eluder dimension and log-covering numbers, $H$ is the planning horizon, $K$ is the number of episodes, and $\tilde O(\cdot)$ omits logarithmic terms. Our lower bound indicates that our algorithm is near-optimal when specialized to the linear setting. Furthermore, we extend the PbRL problem by formulating a novel problem called RL with $n$-wise comparisons, and provide the first sample-efficient algorithm for this new setting. To the best of our knowledge, this is the first theoretical result for PbRL with (general) function approximation.
Published: 2022

54. Pessimism meets VCG: Learning Dynamic Mechanism Design via Offline Reinforcement Learning

Author: Lyu, Boxiang, Wang, Zhaoran, Kolar, Mladen, and Yang, Zhuoran
Subjects: Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory, Statistics - Machine Learning
Abstract: Dynamic mechanism design has garnered significant attention from both computer scientists and economists in recent years. By allowing agents to interact with the seller over multiple rounds, where agents' reward functions may change with time and are state-dependent, the framework is able to model a rich class of real-world problems. In these works, the interaction between agents and sellers is often assumed to follow a Markov Decision Process (MDP). We focus on the setting where the reward and transition functions of such an MDP are not known a priori, and we are attempting to recover the optimal mechanism using an a priori collected data set. In the setting where the function approximation is employed to handle large state spaces, with only mild assumptions on the expressiveness of the function class, we are able to design a dynamic mechanism using offline reinforcement learning algorithms. Moreover, learned mechanisms approximately have three key desiderata: efficiency, individual rationality, and truthfulness. Our algorithm is based on the pessimism principle and only requires a mild assumption on the coverage of the offline data set. To the best of our knowledge, our work provides the first offline RL algorithm for dynamic mechanism design without assuming uniform coverage., Comment: 52 pages
Published: 2022

55. Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency

Author: Cai, Qi, Yang, Zhuoran, and Wang, Zhaoran
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: We study reinforcement learning for partially observed Markov decision processes (POMDPs) with infinite observation and state spaces, which remains less investigated theoretically. To this end, we make the first attempt at bridging partial observability and function approximation for a class of POMDPs with a linear structure. In detail, we propose a reinforcement learning algorithm (Optimistic Exploration via Adversarial Integral Equation or OP-TENET) that attains an $\epsilon$-optimal policy within $O(1/\epsilon^2)$ episodes. In particular, the sample complexity scales polynomially in the intrinsic dimension of the linear structure and is independent of the size of the observation and state spaces. The sample efficiency of OP-TENET is enabled by a sequence of ingredients: (i) a Bellman operator with finite memory, which represents the value function in a recursive manner, (ii) the identification and estimation of such an operator via an adversarial integral equation, which features a smoothed discriminator tailored to the linear structure, and (iii) the exploration of the observation and state spaces via optimism, which is based on quantifying the uncertainty in the adversarial integral equation.
Published: 2022

56. Tuning the stiffness and stretchability of micro-scale-structured polymer membrane simultaneously by integrating mechanical modeling and spatiotemporal controllable photolithograph

Author: Shi, Huibin, Yan, Ziming, Shi, Jiaxin, Yang, Zhuoran, Gao, Yue, Xu, Jun, and Liu, Zhanli
Published: 2024
Full Text: View/download PDF

57. Creep-fatigue interaction of soft adhesive under shear loading: Damage diagram and life prediction model

Author: Xia, Yan, Yao, Chengbin, Li, Zheyu, Yang, Zhuoran, Jiang, Han, and Zhu, Zhongmeng
Published: 2024
Full Text: View/download PDF

58. Magnetically recyclable three-dimensional aerogel water evaporator for organic pollutant removal via solar-powered interfacial water evaporation cooperated with photocatalysis

Author: Kan, Hao, Li, Wenting, Yang, Chunyu, Zhang, Xiaotong, Yang, Zhuoran, and Guo, Wei
Published: 2024
Full Text: View/download PDF

59. Learn to Match with No Regret: Reinforcement Learning in Markov Matching Markets

Author: Min, Yifei, Wang, Tianhao, Xu, Ruitu, Wang, Zhaoran, Jordan, Michael I., and Yang, Zhuoran
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Science and Game Theory, Mathematics - Statistics Theory
Abstract: We study a Markov matching market involving a planner and a set of strategic agents on the two sides of the market. At each step, the agents are presented with a dynamical context, where the contexts determine the utilities. The planner controls the transition of the contexts to maximize the cumulative social welfare, while the agents aim to find a myopic stable matching at each step. Such a setting captures a range of applications including ridesharing platforms. We formalize the problem by proposing a reinforcement learning framework that integrates optimistic value iteration with maximum weight matching. The proposed algorithm addresses the coupled challenges of sequential exploration, matching stability, and function approximation. We prove that the algorithm achieves sublinear regret., Comment: 40 pages
Published: 2022

60. The Best of Both Worlds: Reinforcement Learning with Logarithmic Regret and Policy Switches

Author: Velegkas, Grigoris, Yang, Zhuoran, and Karbasi, Amin
Subjects: Computer Science - Machine Learning
Abstract: In this paper, we study the problem of regret minimization for episodic Reinforcement Learning (RL) both in the model-free and the model-based setting. We focus on learning with general function classes and general model classes, and we derive results that scale with the eluder dimension of these classes. In contrast to the existing body of work that mainly establishes instance-independent regret guarantees, we focus on the instance-dependent setting and show that the regret scales logarithmically with the horizon $T$, provided that there is a gap between the best and the second best action in every state. In addition, we show that such a logarithmic regret bound is realizable by algorithms with $O(\log T)$ switching cost (also known as adaptivity complexity). In other words, these algorithms rarely switch their policy during the course of their execution. Finally, we complement our results with lower bounds which show that even in the tabular setting, we cannot hope for regret guarantees lower than $o(\log T)$.
Published: 2022

61. Learning Dynamic Mechanisms in Unknown Environments: A Reinforcement Learning Approach

Author: Qiu, Shuang, Lyu, Boxiang, Meng, Qinglin, Wang, Zhaoran, Yang, Zhuoran, and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Dynamic mechanism design studies how mechanism designers should allocate resources among agents in a time-varying environment. We consider the problem where the agents interact with the mechanism designer according to an unknown Markov Decision Process (MDP), where agent rewards and the mechanism designer's state evolve according to an episodic MDP with unknown reward functions and transition kernels. We focus on the online setting with linear function approximation and propose novel learning algorithms to recover the dynamic Vickrey-Clarke-Grove (VCG) mechanism over multiple rounds of interaction. A key contribution of our approach is incorporating reward-free online Reinforcement Learning (RL) to aid exploration over a rich policy space to estimate prices in the dynamic VCG mechanism. We show that the regret of our proposed method is upper bounded by $\tilde{\mathcal{O}}(T^{2/3})$ and further devise a lower bound to show that our algorithm is efficient, incurring the same $\tilde{\mathcal{O}}(T^{2 / 3})$ regret as the lower bound, where $T$ is the total number of rounds. Our work establishes the regret guarantee for online RL in solving dynamic mechanism design problems without prior knowledge of the underlying model., Comment: Minor Revision for JMLR. The first three authors contribute equally
Published: 2022

62. Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning

Author: Bai, Chenjia, Wang, Lingxiao, Yang, Zhuoran, Deng, Zhihong, Garg, Animesh, Liu, Peng, and Wang, Zhaoran
Subjects: Computer Science - Machine Learning
Abstract: Offline Reinforcement Learning (RL) aims to learn policies from previously collected datasets without exploring the environment. Directly applying off-policy algorithms to offline RL usually fails due to the extrapolation error caused by the out-of-distribution (OOD) actions. Previous methods tackle such problem by penalizing the Q-values of OOD actions or constraining the trained policy to be close to the behavior policy. Nevertheless, such methods typically prevent the generalization of value functions beyond the offline data and also lack precise characterization of OOD data. In this paper, we propose Pessimistic Bootstrapping for offline RL (PBRL), a purely uncertainty-driven offline algorithm without explicit policy constraints. Specifically, PBRL conducts uncertainty quantification via the disagreement of bootstrapped Q-functions, and performs pessimistic updates by penalizing the value function based on the estimated uncertainty. To tackle the extrapolating error, we further propose a novel OOD sampling method. We show that such OOD sampling and pessimistic bootstrapping yields provable uncertainty quantifier in linear MDPs, thus providing the theoretical underpinning for PBRL. Extensive experiments on D4RL benchmark show that PBRL has better performance compared to the state-of-the-art algorithms., Comment: ICLR 2022
Published: 2022

63. Sequential Information Design: Markov Persuasion Process and Its Efficient Reinforcement Learning

Author: Wu, Jibang, Zhang, Zixuan, Feng, Zhe, Wang, Zhaoran, Yang, Zhuoran, Jordan, Michael I., and Xu, Haifeng
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computer Science and Game Theory, Computer Science - Machine Learning, Economics - Theoretical Economics
Abstract: In today's economy, it becomes important for Internet platforms to consider the sequential information design problem to align its long term interest with incentives of the gig service providers. This paper proposes a novel model of sequential information design, namely the Markov persuasion processes (MPPs), where a sender, with informational advantage, seeks to persuade a stream of myopic receivers to take actions that maximizes the sender's cumulative utilities in a finite horizon Markovian environment with varying prior and utility functions. Planning in MPPs thus faces the unique challenge in finding a signaling policy that is simultaneously persuasive to the myopic receivers and inducing the optimal long-term cumulative utilities of the sender. Nevertheless, in the population level where the model is known, it turns out that we can efficiently determine the optimal (resp. $\epsilon$-optimal) policy with finite (resp. infinite) states and outcomes, through a modified formulation of the Bellman equation. Our main technical contribution is to study the MPP under the online reinforcement learning (RL) setting, where the goal is to learn the optimal signaling policy by interacting with with the underlying MPP, without the knowledge of the sender's utility functions, prior distributions, and the Markov transition kernels. We design a provably efficient no-regret learning algorithm, the Optimism-Pessimism Principle for Persuasion Process (OP4), which features a novel combination of both optimism and pessimism principles. Our algorithm enjoys sample efficiency by achieving a sublinear $\sqrt{T}$-regret upper bound. Furthermore, both our algorithm and theory can be applied to MPPs with large space of outcomes and states via function approximation, and we showcase such a success under the linear setting.
Published: 2022

64. Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets

Author: Zhong, Han, Xiong, Wei, Tan, Jiyuan, Wang, Liwei, Zhang, Tong, Wang, Zhaoran, and Yang, Zhuoran
Subjects: Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory, Statistics - Machine Learning
Abstract: We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. When the dataset does not have uniform coverage over all policy pairs, finding an approximate NE involves challenges in three aspects: (i) distributional shift between the behavior policy and the optimal policy, (ii) function approximation to handle large state space, and (iii) minimax optimization for equilibrium solving. We propose a pessimism-based algorithm, dubbed as pessimistic minimax value iteration (PMVI), which overcomes the distributional shift by constructing pessimistic estimates of the value functions for both players and outputs a policy pair by solving NEs based on the two value functions. Furthermore, we establish a data-dependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an information-theoretical lower bound, which suggests that the data-dependent term in the upper bound is intrinsic. Our theoretical results also highlight a notion of "relative uncertainty", which characterizes the necessary and sufficient condition for achieving sample efficiency in offline MGs. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation.
Published: 2022

65. Joint Differentiable Optimization and Verification for Certified Reinforcement Learning

Author: Wang, Yixuan, Zhan, Simon, Wang, Zhilu, Huang, Chao, Wang, Zhaoran, Yang, Zhuoran, and Zhu, Qi
Subjects: Computer Science - Machine Learning, Electrical Engineering and Systems Science - Systems and Control
Abstract: In model-based reinforcement learning for safety-critical control systems, it is important to formally certify system properties (e.g., safety, stability) under the learned controller. However, as existing methods typically apply formal verification \emph{after} the controller has been learned, it is sometimes difficult to obtain any certificate, even after many iterations between learning and verification. To address this challenge, we propose a framework that jointly conducts reinforcement learning and formal verification by formulating and solving a novel bilevel optimization problem, which is differentiable by the gradients from the value function and certificates. Experiments on a variety of examples demonstrate the significant advantages of our framework over the model-based stochastic value gradient (SVG) method and the model-free proximal policy optimization (PPO) method in finding feasible controllers with barrier functions and Lyapunov functions that ensure system safety and stability., Comment: This paper is accepted to International Conference on Cyber-Physical Systems
Published: 2022

66. Exponential Family Model-Based Reinforcement Learning via Score Matching

Author: Li, Gene, Li, Junbo, Kabra, Anmol, Srebro, Nathan, Wang, Zhaoran, and Yang, Zhuoran
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We propose an optimistic model-based algorithm, dubbed SMRL, for finite-horizon episodic reinforcement learning (RL) when the transition model is specified by exponential family distributions with $d$ parameters and the reward is bounded and known. SMRL uses score matching, an unnormalized density estimation technique that enables efficient estimation of the model parameter by ridge regression. Under standard regularity assumptions, SMRL achieves $\tilde O(d\sqrt{H^3T})$ online regret, where $H$ is the length of each episode and $T$ is the total number of interactions (ignoring polynomial dependence on structural scale parameters)., Comment: NeurIPS 2022
Published: 2021

67. Wasserstein Flow Meets Replicator Dynamics: A Mean-Field Analysis of Representation Learning in Actor-Critic

Author: Zhang, Yufeng, Chen, Siyu, Yang, Zhuoran, Jordan, Michael I., and Wang, Zhaoran
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Actor-critic (AC) algorithms, empowered by neural networks, have had significant empirical success in recent years. However, most of the existing theoretical support for AC algorithms focuses on the case of linear function approximations, or linearized neural networks, where the feature representation is fixed throughout training. Such a limitation fails to capture the key aspect of representation learning in neural AC, which is pivotal in practical problems. In this work, we take a mean-field perspective on the evolution and convergence of feature-based neural AC. Specifically, we consider a version of AC where the actor and critic are represented by overparameterized two-layer neural networks and are updated with two-timescale learning rates. The critic is updated by temporal-difference (TD) learning with a larger stepsize while the actor is updated via proximal policy optimization (PPO) with a smaller stepsize. In the continuous-time and infinite-width limiting regime, when the timescales are properly separated, we prove that neural AC finds the globally optimal policy at a sublinear rate. Additionally, we prove that the feature representation induced by the critic network is allowed to evolve within a neighborhood of the initial one., Comment: 41 pages, accepted to NeurIPS 2021, add acknowledgement
Published: 2021

68. Can Reinforcement Learning Find Stackelberg-Nash Equilibria in General-Sum Markov Games with Myopic Followers?

Author: Zhong, Han, Yang, Zhuoran, Wang, Zhaoran, and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory, Statistics - Machine Learning
Abstract: We study multi-player general-sum Markov games with one of the players designated as the leader and the other players regarded as followers. In particular, we focus on the class of games where the followers are myopic, i.e., they aim to maximize their instantaneous rewards. For such a game, our goal is to find a Stackelberg-Nash equilibrium (SNE), which is a policy pair $(\pi^*, \nu^*)$ such that (i) $\pi^*$ is the optimal policy for the leader when the followers always play their best response, and (ii) $\nu^*$ is the best response policy of the followers, which is a Nash equilibrium of the followers' game induced by $\pi^*$. We develop sample-efficient reinforcement learning (RL) algorithms for solving for an SNE in both online and offline settings. Our algorithms are optimistic and pessimistic variants of least-squares value iteration, and they are readily able to incorporate function approximation tools in the setting of large state spaces. Furthermore, for the case with linear function approximation, we prove that our algorithms achieve sublinear regret and suboptimality under online and offline setups respectively. To the best of our knowledge, we establish the first provably efficient RL algorithms for solving for SNEs in general-sum Markov games with myopic followers.
Published: 2021

69. ElegantRL-Podracer: Scalable and Elastic Library for Cloud-Native Deep Reinforcement Learning

Author: Liu, Xiao-Yang, Li, Zechu, Yang, Zhuoran, Zheng, Jiahao, Wang, Zhaoran, Walid, Anwar, Guo, Jian, and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Deep reinforcement learning (DRL) has revolutionized learning and actuation in applications such as game playing and robotic control. The cost of data collection, i.e., generating transitions from agent-environment interactions, remains a major challenge for wider DRL adoption in complex real-world problems. Following a cloud-native paradigm to train DRL agents on a GPU cloud platform is a promising solution. In this paper, we present a scalable and elastic library ElegantRL-podracer for cloud-native deep reinforcement learning, which efficiently supports millions of GPU cores to carry out massively parallel training at multiple levels. At a high-level, ElegantRL-podracer employs a tournament-based ensemble scheme to orchestrate the training process on hundreds or even thousands of GPUs, scheduling the interactions between a leaderboard and a training pool with hundreds of pods. At a low-level, each pod simulates agent-environment interactions in parallel by fully utilizing nearly 7,000 GPU CUDA cores in a single GPU. Our ElegantRL-podracer library features high scalability, elasticity and accessibility by following the development principles of containerization, microservices and MLOps. Using an NVIDIA DGX SuperPOD cloud, we conduct extensive experiments on various tasks in locomotion and stock trading and show that ElegantRL-podracer substantially outperforms RLlib. Our codes are available on GitHub., Comment: 9 pages, 7 figures
Published: 2021

70. Exponential Bellman Equation and Improved Regret Bounds for Risk-Sensitive Reinforcement Learning

Author: Fei, Yingjie, Yang, Zhuoran, Chen, Yudong, and Wang, Zhaoran
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: We study risk-sensitive reinforcement learning (RL) based on the entropic risk measure. Although existing works have established non-asymptotic regret guarantees for this problem, they leave open an exponential gap between the upper and lower bounds. We identify the deficiencies in existing algorithms and their analysis that result in such a gap. To remedy these deficiencies, we investigate a simple transformation of the risk-sensitive Bellman equations, which we call the exponential Bellman equation. The exponential Bellman equation inspires us to develop a novel analysis of Bellman backup procedures in risk-sensitive RL algorithms, and further motivates the design of a novel exploration mechanism. We show that these analytic and algorithmic innovations together lead to improved regret upper bounds over existing ones.
Published: 2021

71. False Correlation Reduction for Offline Reinforcement Learning

Author: Deng, Zhihong, Fu, Zuyue, Wang, Lingxiao, Yang, Zhuoran, Bai, Chenjia, Zhou, Tianyi, Wang, Zhaoran, and Jiang, Jing
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Offline reinforcement learning (RL) harnesses the power of massive datasets for resolving sequential decision problems. Most existing papers only discuss defending against out-of-distribution (OOD) actions while we investigate a broader issue, the false correlations between epistemic uncertainty and decision-making, an essential factor that causes suboptimality. In this paper, we propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm. We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL). The proposed algorithm introduces an annealing behavior cloning regularizer to help produce a high-quality estimation of uncertainty which is critical for eliminating false correlations from suboptimality. Theoretically, we justify the rationality of the proposed method and prove its convergence to the optimal policy with a sublinear rate under mild assumptions., Comment: 16 pages, 14 figures
Published: 2021

72. On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game

Author: Qiu, Shuang, Ye, Jieping, Wang, Zhaoran, and Yang, Zhuoran
Subjects: Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory, Statistics - Machine Learning
Abstract: To achieve sample efficiency in reinforcement learning (RL), it necessitates efficiently exploring the underlying environment. Under the offline setting, addressing the exploration challenge lies in collecting an offline dataset with sufficient coverage. Motivated by such a challenge, we study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. Then, given any extrinsic reward, the agent computes the policy via a planning algorithm with offline data collected in the exploration phase. Moreover, we tackle this problem under the context of function approximation, leveraging powerful function approximators. Specifically, we propose to explore via an optimistic variant of the value-iteration algorithm incorporating kernel and neural function approximations, where we adopt the associated exploration bonus as the exploration reward. Moreover, we design exploration and planning algorithms for both single-agent MDPs and zero-sum Markov games and prove that our methods can achieve $\widetilde{\mathcal{O}}(1 /\varepsilon^2)$ sample complexity for generating a $\varepsilon$-suboptimal policy or $\varepsilon$-approximate Nash equilibrium when given an arbitrary extrinsic reward. To the best of our knowledge, we establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators., Comment: ICML 2021
Published: 2021

73. Optimistic Policy Optimization is Provably Efficient in Non-stationary MDPs

Author: Zhong, Han, Yang, Zhuoran, Wang, Zhaoran, and Szepesvári, Csaba
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We study episodic reinforcement learning (RL) in non-stationary linear kernel Markov decision processes (MDPs). In this setting, both the reward function and the transition kernel are linear with respect to the given feature maps and are allowed to vary over time, as long as their respective parameter variations do not exceed certain variation budgets. We propose the $\underline{\text{p}}$eriodically $\underline{\text{r}}$estarted $\underline{\text{o}}$ptimistic $\underline{\text{p}}$olicy $\underline{\text{o}}$ptimization algorithm (PROPO), which is an optimistic policy optimization algorithm with linear function approximation. PROPO features two mechanisms: sliding-window-based policy evaluation and periodic-restart-based policy improvement, which are tailored for policy optimization in a non-stationary environment. In addition, only utilizing the technique of sliding window, we propose a value-iteration algorithm. We establish dynamic upper bounds for the proposed methods and a matching minimax lower bound which shows the (near-) optimality of the proposed methods. To our best knowledge, PROPO is the first provably efficient policy optimization algorithm that handles non-stationarity.
Published: 2021

74. Inducing Equilibria via Incentives: Simultaneous Design-and-Play Ensures Global Convergence

Author: Liu, Boyi, Li, Jiayang, Yang, Zhuoran, Wai, Hoi-To, Hong, Mingyi, Nie, Yu Marco, and Wang, Zhaoran
Subjects: Computer Science - Computer Science and Game Theory, Computer Science - Machine Learning
Abstract: To regulate a social system comprised of self-interested agents, economic incentives are often required to induce a desirable outcome. This incentive design problem naturally possesses a bilevel structure, in which a designer modifies the rewards of the agents with incentives while anticipating the response of the agents, who play a non-cooperative game that converges to an equilibrium. The existing bilevel optimization algorithms raise a dilemma when applied to this problem: anticipating how incentives affect the agents at equilibrium requires solving the equilibrium problem repeatedly, which is computationally inefficient; bypassing the time-consuming step of equilibrium-finding can reduce the computational cost, but may lead the designer to a sub-optimal solution. To address such a dilemma, we propose a method that tackles the designer's and agents' problems simultaneously in a single loop. Specifically, at each iteration, both the designer and the agents only move one step. Nevertheless, we allow the designer to gradually learn the overall influence of the incentives on the agents, which guarantees optimality after convergence. The convergence rate of the proposed scheme is also established for a broad class of games., Comment: Accepted by the Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS 2022)
Published: 2021

75. Group VIII (Fe, Co, and Ni) based water evaporators for solar driven interfacial water evaporation: Performance optimization, applications and recent advances

Author: Yang, Zhuoran, Li, Wenting, Liu, Menghan, Meng, Yao, Yang, Chunyu, Qu, Fengyu, and Guo, Wei
Published: 2024
Full Text: View/download PDF

76. Fatigue failure of soft adhesive systems: A state-of-the-art review

Author: Yao, Chengbin, Xia, Yan, Yang, Zhuoran, Zhu, Zhongmeng, Li, Zheyu, and Jiang, Han
Published: 2024
Full Text: View/download PDF

77. Fractional cyclic cohesive zone model for time-dependent fatigue behavior of soft adhesives under mode-II loading

Author: Yang, Zhuoran, Xia, Yan, Yao, Chengbin, Zhu, Zhongmeng, Liu, Zhanli, Jiang, Han, and Wang, Kaijie
Published: 2024
Full Text: View/download PDF

78. Provably Efficient Generative Adversarial Imitation Learning for Online and Offline Setting with Linear Function Approximation

Author: Liu, Zhihan, Zhang, Yufeng, Fu, Zuyue, Yang, Zhuoran, and Wang, Zhaoran
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: In generative adversarial imitation learning (GAIL), the agent aims to learn a policy from an expert demonstration so that its performance cannot be discriminated from the expert policy on a certain predefined reward set. In this paper, we study GAIL in both online and offline settings with linear function approximation, where both the transition and reward function are linear in the feature maps. Besides the expert demonstration, in the online setting the agent can interact with the environment, while in the offline setting the agent only accesses an additional dataset collected by a prior. For online GAIL, we propose an optimistic generative adversarial policy optimization algorithm (OGAP) and prove that OGAP achieves $\widetilde{\mathcal{O}}(H^2 d^{3/2}K^{1/2}+KH^{3/2}dN_1^{-1/2})$ regret. Here $N_1$ represents the number of trajectories of the expert demonstration, $d$ is the feature dimension, and $K$ is the number of episodes. For offline GAIL, we propose a pessimistic generative adversarial policy optimization algorithm (PGAP). For an arbitrary additional dataset, we obtain the optimality gap of PGAP, achieving the minimax lower bound in the utilization of the additional dataset. Assuming sufficient coverage on the additional dataset, we show that PGAP achieves $\widetilde{\mathcal{O}}(H^{2}dK^{-1/2} +H^2d^{3/2}N_2^{-1/2}+H^{3/2}dN_1^{-1/2} \ )$ optimality gap. Here $N_2$ represents the number of trajectories of the additional dataset with sufficient coverage., Comment: 54 pages, in submission
Published: 2021

79. Online Bootstrap Inference For Policy Evaluation in Reinforcement Learning

Author: Ramprasad, Pratik, Li, Yuantong, Yang, Zhuoran, Wang, Zhaoran, Sun, Will Wei, and Cheng, Guang
Subjects: Statistics - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Mathematics - Statistics Theory
Abstract: The recent emergence of reinforcement learning has created a demand for robust statistical inference methods for the parameter estimates computed using these algorithms. Existing methods for statistical inference in online learning are restricted to settings involving independently sampled observations, while existing statistical inference methods in reinforcement learning (RL) are limited to the batch setting. The online bootstrap is a flexible and efficient approach for statistical inference in linear stochastic approximation algorithms, but its efficacy in settings involving Markov noise, such as RL, has yet to be explored. In this paper, we study the use of the online bootstrap method for statistical inference in RL. In particular, we focus on the temporal difference (TD) learning and Gradient TD (GTD) learning algorithms, which are themselves special instances of linear stochastic approximation under Markov noise. The method is shown to be distributionally consistent for statistical inference in policy evaluation, and numerical experiments are included to demonstrate the effectiveness of this algorithm at statistical inference tasks across a range of real RL environments., Comment: To Appear in Journal of the American Statistical Association
Published: 2021

80. Towards General Function Approximation in Zero-Sum Markov Games

Author: Huang, Baihe, Lee, Jason D., Wang, Zhaoran, and Yang, Zhuoran
Subjects: Computer Science - Computer Science and Game Theory, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: This paper considers two-player zero-sum finite-horizon Markov games with simultaneous moves. The study focuses on the challenging settings where the value function or the model is parameterized by general function classes. Provably efficient algorithms for both decoupled and {coordinated} settings are developed. In the {decoupled} setting where the agent controls a single player and plays against an arbitrary opponent, we propose a new model-free algorithm. The sample complexity is governed by the Minimax Eluder dimension -- a new dimension of the function class in Markov games. As a special case, this method improves the state-of-the-art algorithm by a $\sqrt{d}$ factor in the regret when the reward function and transition kernel are parameterized with $d$-dimensional linear features. In the {coordinated} setting where both players are controlled by the agent, we propose a model-based algorithm and a model-free algorithm. In the model-based algorithm, we prove that sample complexity can be bounded by a generalization of Witness rank to Markov games. The model-free algorithm enjoys a $\sqrt{K}$-regret upper bound where $K$ is the number of episodes.
Published: 2021

81. A Unified Off-Policy Evaluation Approach for General Value Function

Author: Xu, Tengyu, Yang, Zhuoran, Wang, Zhaoran, and Liang, Yingbin
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Mathematics - Statistics Theory
Abstract: General Value Function (GVF) is a powerful tool to represent both the {\em predictive} and {\em retrospective} knowledge in reinforcement learning (RL). In practice, often multiple interrelated GVFs need to be evaluated jointly with pre-collected off-policy samples. In the literature, the gradient temporal difference (GTD) learning method has been adopted to evaluate GVFs in the off-policy setting, but such an approach may suffer from a large estimation error even if the function approximation class is sufficiently expressive. Moreover, none of the previous work have formally established the convergence guarantee to the ground truth GVFs under the function approximation settings. In this paper, we address both issues through the lens of a class of GVFs with causal filtering, which cover a wide range of RL applications such as reward variance, value gradient, cost in anomaly detection, stationary distribution gradient, etc. We propose a new algorithm called GenTD for off-policy GVFs evaluation and show that GenTD learns multiple interrelated multi-dimensional GVFs as efficiently as a single canonical scalar value function. We further show that unlike GTD, the learned GVFs by GenTD are guaranteed to converge to the ground truth GVFs as long as the function approximation power is sufficiently large. To our best knowledge, GenTD is the first off-policy GVF evaluation algorithm that has global optimality guarantee., Comment: submitted for publication
Published: 2021

82. Gap-Dependent Bounds for Two-Player Markov Games

Author: Dou, Zehao, Yang, Zhuoran, Wang, Zhaoran, and Du, Simon S.
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: As one of the most popular methods in the field of reinforcement learning, Q-learning has received increasing attention. Recently, there have been more theoretical works on the regret bound of algorithms that belong to the Q-learning class in different settings. In this paper, we analyze the cumulative regret when conducting Nash Q-learning algorithm on 2-player turn-based stochastic Markov games (2-TBSG), and propose the very first gap dependent logarithmic upper bounds in the episodic tabular setting. This bound matches the theoretical lower bound only up to a logarithmic term. Furthermore, we extend the conclusion to the discounted game setting with infinite horizon and propose a similar gap dependent logarithmic regret bound. Also, under the linear MDP assumption, we obtain another logarithmic regret for 2-TBSG, in both centralized and independent settings., Comment: 34 pages
Published: 2021

83. A unified framework for assessing interaction effects among environmental exposures in epidemiologic studies: A case study on temperature, air pollution, and kidney-related conditions in New York state

Author: Chu, Lingzhi, Chen, Kai, Yang, Zhuoran, Crowley, Susan, and Dubrow, Robert
Published: 2024
Full Text: View/download PDF

84. Revealing the shear fatigue damage mechanism of soft adhesive: Coexistence of viscoelastic and damage dissipation

Author: Xia, Yan, Jiang, Han, Yao, Chengbin, Yang, Zhuoran, Zhu, Zhongmeng, and Kang, Guozheng
Published: 2024
Full Text: View/download PDF

85. Randomized Exploration for Reinforcement Learning with General Value Function Approximation

Author: Ishfaq, Haque, Cui, Qiwen, Nguyen, Viet, Ayoub, Alex, Yang, Zhuoran, Wang, Zhaoran, Precup, Doina, and Yang, Lin F.
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm as well as the optimism principle. Unlike existing upper-confidence-bound (UCB) based approaches, which are often computationally intractable, our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. To attain optimistic value function estimation without resorting to a UCB-style bonus, we introduce an optimistic reward sampling procedure. When the value functions can be represented by a function class $\mathcal{F}$, our algorithm achieves a worst-case regret bound of $\widetilde{O}(\mathrm{poly}(d_EH)\sqrt{T})$ where $T$ is the time elapsed, $H$ is the planning horizon and $d_E$ is the $\textit{eluder dimension}$ of $\mathcal{F}$. In the linear setting, our algorithm reduces to LSVI-PHE, a variant of RLSVI, that enjoys an $\widetilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret. We complement the theory with an empirical evaluation across known difficult exploration tasks., Comment: 32 page, 5 figures, in Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 2021
Published: 2021

86. Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality

Author: Xu, Tengyu, Yang, Zhuoran, Wang, Zhaoran, and Liang, Yingbin
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Designing off-policy reinforcement learning algorithms is typically a very challenging task, because a desirable iteration update often involves an expectation over an on-policy distribution. Prior off-policy actor-critic (AC) algorithms have introduced a new critic that uses the density ratio for adjusting the distribution mismatch in order to stabilize the convergence, but at the cost of potentially introducing high biases due to the estimation errors of both the density ratio and value function. In this paper, we develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP, which can take advantage of learned nuisance functions to reduce estimation errors. Moreover, DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize, and is thus more sample efficient than prior algorithms that adopt either two timescale or nested-loop structure. We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $\epsilon$-accurate optimal policy. We also show that the overall convergence of DR-Off-PAC is doubly robust to the approximation errors that depend only on the expressive power of approximation functions. To the best of our knowledge, our study establishes the first overall sample complexity analysis for a single time-scale off-policy AC algorithm., Comment: Published in ICML 2021
Published: 2021

87. Instrumental Variable Value Iteration for Causal Offline Reinforcement Learning

Author: Liao, Luofeng, Fu, Zuyue, Yang, Zhuoran, Wang, Yixin, Kolar, Mladen, and Wang, Zhaoran
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: In offline reinforcement learning (RL) an optimal policy is learned solely from a priori collected observational data. However, in observational data, actions are often confounded by unobserved variables. Instrumental variables (IVs), in the context of RL, are the variables whose influence on the state variables is all mediated by the action. When a valid instrument is present, we can recover the confounded transition dynamics through observational data. We study a confounded Markov decision process where the transition dynamics admit an additive nonlinear functional form. Using IVs, we derive a conditional moment restriction through which we can identify transition dynamics based on observational data. We propose a provably efficient IV-aided Value Iteration (IVVI) algorithm based on a primal-dual reformulation of the conditional moment restriction. To our knowledge, this is the first provably efficient algorithm for instrument-aided offline RL., Comment: under review
Published: 2021

88. A Near-Optimal Algorithm for Stochastic Bilevel Optimization via Double-Momentum

Author: Khanduri, Prashant, Zeng, Siliang, Hong, Mingyi, Wai, Hoi-To, Wang, Zhaoran, and Yang, Zhuoran
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: This paper proposes a new algorithm -- the \underline{S}ingle-timescale Do\underline{u}ble-momentum \underline{St}ochastic \underline{A}pprox\underline{i}matio\underline{n} (SUSTAIN) -- for tackling stochastic unconstrained bilevel optimization problems. We focus on bilevel problems where the lower level subproblem is strongly-convex and the upper level objective function is smooth. Unlike prior works which rely on \emph{two-timescale} or \emph{double loop} techniques, we design a stochastic momentum-assisted gradient estimator for both the upper and lower level updates. The latter allows us to control the error in the stochastic gradient updates due to inaccurate solution to both subproblems. If the upper objective function is smooth but possibly non-convex, we show that {\aname}~requires $\mathcal{O}(\epsilon^{-3/2})$ iterations (each using ${\cal O}(1)$ samples) to find an $\epsilon$-stationary solution. The $\epsilon$-stationary solution is defined as the point whose squared norm of the gradient of the outer function is less than or equal to $\epsilon$. The total number of stochastic gradient samples required for the upper and lower level objective functions matches the best-known complexity for single-level stochastic gradient algorithms. We also analyze the case when the upper level objective function is strongly-convex., Comment: 36 Pages, 10 Figures
Published: 2021

89. Is Pessimism Provably Efficient for Offline RL?

Author: Jin, Ying, Yang, Zhuoran, and Wang, Zhaoran
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Mathematics - Optimization and Control, Mathematics - Statistics Theory, Statistics - Machine Learning
Abstract: We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori. Due to the lack of further interactions with the environment, offline RL suffers from the insufficient coverage of the dataset, which eludes most existing theoretical analysis. In this paper, we propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function. Such a penalty function simply flips the sign of the bonus function for promoting exploration in online RL, which makes it easily implementable and compatible with general function approximators. Without assuming the sufficient coverage of the dataset, we establish a data-dependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs). When specialized to linear MDPs, it matches the information-theoretic lower bound up to multiplicative factors of the dimension and horizon. In other words, pessimism is not only provably efficient but also minimax optimal. In particular, given the dataset, the learned policy serves as the "best effort" among all policies, as no other policies can do better. Our theoretical analysis identifies the critical role of pessimism in eliminating a notion of spurious correlation, which emerges from the "irrelevant" trajectories that are less covered by the dataset and not informative for the optimal policy., Comment: This version adds results on RKHS, and a data-splitting algorithm
Published: 2020

90. Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds Globally Optimal Policy

Author: Zhong, Han, Deng, Xun, Fang, Ethan X., Yang, Zhuoran, Wang, Zhaoran, and Li, Runze
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: While deep reinforcement learning has achieved tremendous successes in various applications, most existing works only focus on maximizing the expected value of total return and thus ignore its inherent stochasticity. Such stochasticity is also known as the aleatoric uncertainty and is closely related to the notion of risk. In this work, we make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria. In particular, we focus on a variance-constrained policy optimization problem where the goal is to find a policy that maximizes the expected value of the long-run average reward, subject to a constraint that the long-run variance of the average reward is upper bounded by a threshold. Utilizing Lagrangian and Fenchel dualities, we transform the original problem into an unconstrained saddle-point policy optimization problem, and propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable. When both the value and policy functions are represented by multi-layer overparameterized neural networks, we prove that our actor-critic algorithm generates a sequence of policies that finds a globally optimal policy at a sublinear rate. Further, We provide numerical studies of the proposed method using two real datasets to back up the theoretical results., Comment: 59 pages
Published: 2020

91. Variational Transport: A Convergent Particle-BasedAlgorithm for Distributional Optimization

Author: Yang, Zhuoran, Zhang, Yufeng, Chen, Yongxin, and Wang, Zhaoran
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Mathematics - Statistics Theory, Statistics - Machine Learning
Abstract: We consider the optimization problem of minimizing a functional defined over a family of probability distributions, where the objective functional is assumed to possess a variational form. Such a distributional optimization problem arises widely in machine learning and statistics, with Monte-Carlo sampling, variational inference, policy optimization, and generative adversarial network as examples. For this problem, we propose a novel particle-based algorithm, dubbed as variational transport, which approximately performs Wasserstein gradient descent over the manifold of probability distributions via iteratively pushing a set of particles. Specifically, we prove that moving along the geodesic in the direction of functional gradient with respect to the second-order Wasserstein distance is equivalent to applying a pushforward mapping to a probability distribution, which can be approximated accurately by pushing a set of particles. Specifically, in each iteration of variational transport, we first solve the variational problem associated with the objective functional using the particles, whose solution yields the Wasserstein gradient direction. Then we update the current distribution by pushing each particle along the direction specified by such a solution. By characterizing both the statistical error incurred in estimating the Wasserstein gradient and the progress of the optimization algorithm, we prove that when the objective function satisfies a functional version of the Polyak-\L{}ojasiewicz (PL) (Polyak, 1963) and smoothness conditions, variational transport converges linearly to the global minimum of the objective functional up to a certain statistical error, which decays to zero sublinearly as the number of particles goes to infinity., Comment: 58 pages, add acknowledgement
Published: 2020

92. On Function Approximation in Reinforcement Learning: Optimism in the Face of Large State Spaces

Author: Yang, Zhuoran, Jin, Chi, Wang, Zhaoran, Wang, Mengdi, and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Mathematics - Optimization and Control, Mathematics - Statistics Theory, Statistics - Machine Learning
Abstract: The classical theory of reinforcement learning (RL) has focused on tabular and linear representations of value functions. Further progress hinges on combining RL with modern function approximators such as kernel functions and deep neural networks, and indeed there have been many empirical successes that have exploited such combinations in large-scale applications. There are profound challenges, however, in developing a theory to support this enterprise, most notably the need to take into consideration the exploration-exploitation tradeoff at the core of RL in conjunction with the computational and statistical tradeoffs that arise in modern function-approximation-based learning systems. We approach these challenges by studying an optimistic modification of the least-squares value iteration algorithm, in the context of the action-value function represented by a kernel function or an overparameterized neural network. We establish both polynomial runtime complexity and polynomial sample complexity for this algorithm, without additional assumptions on the data-generating model. In particular, we prove that the algorithm incurs an $\tilde{\mathcal{O}}(\delta_{\mathcal{F}} H^2 \sqrt{T})$ regret, where $\delta_{\mathcal{F}}$ characterizes the intrinsic complexity of the function class $\mathcal{F}$, $H$ is the length of each episode, and $T$ is the total number of episodes. Our regret bounds are independent of the number of states, a result which exhibits clearly the benefit of function approximation in RL., Comment: 76 pages. The short version of this work appears in NeurIPS 2020
Published: 2020

93. Provable Fictitious Play for General Mean-Field Games

Author: Xie, Qiaomin, Yang, Zhuoran, Wang, Zhaoran, and Minca, Andreea
Subjects: Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory
Abstract: We propose a reinforcement learning algorithm for stationary mean-field games, where the goal is to learn a pair of mean-field state and stationary policy that constitutes the Nash equilibrium. When viewing the mean-field state and the policy as two players, we propose a fictitious play algorithm which alternatively updates the mean-field state and the policy via gradient-descent and proximal policy optimization, respectively. Our algorithm is in stark contrast with previous literature which solves each single-agent reinforcement learning problem induced by the iterates mean-field states to the optimum. Furthermore, we prove that our fictitious play algorithm converges to the Nash equilibrium at a sublinear rate. To the best of our knowledge, this seems the first provably convergent single-loop reinforcement learning algorithm for mean-field games based on iterative updates of both mean-field state and policy.
Published: 2020

94. Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning

Author: Qiu, Shuang, Yang, Zhuoran, Wei, Xiaohan, Ye, Jieping, and Wang, Zhaoran
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Temporal-Difference (TD) learning with nonlinear smooth function approximation for policy evaluation has achieved great success in modern reinforcement learning. It is shown that such a problem can be reformulated as a stochastic nonconvex-strongly-concave optimization problem, which is challenging as naive stochastic gradient descent-ascent algorithm suffers from slow convergence. Existing approaches for this problem are based on two-timescale or double-loop stochastic gradient algorithms, which may also require sampling large-batch data. However, in practice, a single-timescale single-loop stochastic algorithm is preferred due to its simplicity and also because its step-size is easier to tune. In this paper, we propose two single-timescale single-loop algorithms which require only one data point each step. Our first algorithm implements momentum updates on both primal and dual variables achieving an $O(\varepsilon^{-4})$ sample complexity, which shows the important role of momentum in obtaining a single-timescale algorithm. Our second algorithm improves upon the first one by applying variance reduction on top of momentum, which matches the best known $O(\varepsilon^{-3})$ sample complexity in existing works. Furthermore, our variance-reduction algorithm does not require a large-batch checkpoint. Moreover, our theoretical results for both algorithms are expressed in a tighter form of simultaneous primal and dual side convergence., Comment: 45 pages; initial draft submitted in Feb, 2020
Published: 2020

95. Global Convergence of Policy Gradient for Linear-Quadratic Mean-Field Control/Game in Continuous Time

Author: Wang, Weichen, Han, Jiequn, Yang, Zhuoran, and Wang, Zhaoran
Subjects: Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Reinforcement learning is a powerful tool to learn the optimal policy of possibly multiple agents by interacting with the environment. As the number of agents grow to be very large, the system can be approximated by a mean-field problem. Therefore, it has motivated new research directions for mean-field control (MFC) and mean-field game (MFG). In this paper, we study the policy gradient method for the linear-quadratic mean-field control and game, where we assume each agent has identical linear state transitions and quadratic cost functions. While most of the recent works on policy gradient for MFC and MFG are based on discrete-time models, we focus on the continuous-time models where some analyzing techniques can be interesting to the readers. For both MFC and MFG, we provide policy gradient update and show that it converges to the optimal solution at a linear rate, which is verified by a synthetic simulation. For MFG, we also provide sufficient conditions for the existence and uniqueness of the Nash equilibrium., Comment: 28 pages, 3 figures
Published: 2020

96. Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy

Author: Fu, Zuyue, Yang, Zhuoran, and Wang, Zhaoran
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: We study the global convergence and global optimality of actor-critic, one of the most popular families of reinforcement learning algorithms. While most existing works on actor-critic employ bi-level or two-timescale updates, we focus on the more practical single-timescale setting, where the actor and critic are updated simultaneously. Specifically, in each iteration, the critic update is obtained by applying the Bellman evaluation operator only once while the actor is updated in the policy gradient direction computed using the critic. Moreover, we consider two function approximation settings where both the actor and critic are represented by linear or deep neural networks. For both cases, we prove that the actor sequence converges to a globally optimal policy at a sublinear $O(K^{-1/2})$ rate, where $K$ is the number of iterations. To the best of our knowledge, we establish the rate of convergence and global optimality of single-timescale actor-critic with linear function approximation for the first time. Moreover, under the broader scope of policy optimization with nonlinear function approximation, we prove that actor-critic with deep neural network finds the globally optimal policy at a sublinear rate for the first time.
Published: 2020

97. Understanding Implicit Regularization in Over-Parameterized Single Index Model

Author: Fan, Jianqing, Yang, Zhuoran, and Yu, Mengxin
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Methodology
Abstract: In this paper, we leverage over-parameterization to design regularization-free algorithms for the high-dimensional single index model and provide theoretical guarantees for the induced implicit regularization phenomenon. Specifically, we study both vector and matrix single index models where the link function is nonlinear and unknown, the signal parameter is either a sparse vector or a low-rank symmetric matrix, and the response variable can be heavy-tailed. To gain a better understanding of the role played by implicit regularization without excess technicality, we assume that the distribution of the covariates is known a priori. For both the vector and matrix settings, we construct an over-parameterized least-squares loss function by employing the score function transform and a robust truncation step designed specifically for heavy-tailed data. We propose to estimate the true parameter by applying regularization-free gradient descent to the loss function. When the initialization is close to the origin and the stepsize is sufficiently small, we prove that the obtained solution achieves minimax optimal statistical rates of convergence in both the vector and matrix cases. In addition, our experimental results support our theoretical findings and also demonstrate that our methods empirically outperform classical methods with explicit regularization in terms of both $\ell_2$-statistical rate and variable selection consistency., Comment: major revision
Published: 2020

98. A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic

Author: Hong, Mingyi, Wai, Hoi-To, Wang, Zhaoran, and Yang, Zhuoran
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning
Abstract: This paper analyzes a two-timescale stochastic algorithm framework for bilevel optimization. Bilevel optimization is a class of problems which exhibit a two-level structure, and its goal is to minimize an outer objective function with variables which are constrained to be the optimal solution to an (inner) optimization problem. We consider the case when the inner problem is unconstrained and strongly convex, while the outer problem is constrained and has a smooth objective function. We propose a two-timescale stochastic approximation (TTSA) algorithm for tackling such a bilevel problem. In the algorithm, a stochastic gradient update with a larger step size is used for the inner problem, while a projected stochastic gradient update with a smaller step size is used for the outer problem. We analyze the convergence rates for the TTSA algorithm under various settings: when the outer problem is strongly convex (resp.~weakly convex), the TTSA algorithm finds an $\mathcal{O}(K^{-2/3})$-optimal (resp.~$\mathcal{O}(K^{-2/5})$-stationary) solution, where $K$ is the total iteration number. As an application, we show that a two-timescale natural actor-critic proximal policy optimization algorithm can be viewed as a special case of our TTSA framework. Importantly, the natural actor-critic algorithm is shown to converge at a rate of $\mathcal{O}(K^{-1/4})$ in terms of the gap in expected discounted reward compared to a global optimal policy., Comment: Minor revision
Published: 2020

99. Provably Efficient Neural Estimation of Structural Equation Model: An Adversarial Approach

Author: Liao, Luofeng, Chen, You-Lin, Yang, Zhuoran, Dai, Bo, Wang, Zhaoran, and Kolar, Mladen
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: Structural equation models (SEMs) are widely used in sciences, ranging from economics to psychology, to uncover causal relationships underlying a complex system under consideration and estimate structural parameters of interest. We study estimation in a class of generalized SEMs where the object of interest is defined as the solution to a linear operator equation. We formulate the linear operator equation as a min-max game, where both players are parameterized by neural networks (NNs), and learn the parameters of these neural networks using the stochastic gradient descent. We consider both 2-layer and multi-layer NNs with ReLU activation functions and prove global convergence in an overparametrized regime, where the number of neurons is diverging. The results are established using techniques from online learning and local linearization of NNs, and improve in several aspects the current state-of-the-art. For the first time we provide a tractable estimation procedure for SEMs based on NNs with provable convergence and without the need for sample splitting., Comment: - v1: Submitted to NeurIPS 2020. Under review - v2: Revised after NeurIPS reviews. Major updates: (i) clean presentation of consistency results; (ii) more references for conditional moment problems - v3: Add references
Published: 2020

100. Dynamic Regret of Policy Optimization in Non-stationary Environments

Author: Fei, Yingjie, Yang, Zhuoran, Wang, Zhaoran, and Xie, Qiaomin
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: We consider reinforcement learning (RL) in episodic MDPs with adversarial full-information reward feedback and unknown fixed transition kernels. We propose two model-free policy optimization algorithms, POWER and POWER++, and establish guarantees for their dynamic regret. Compared with the classical notion of static regret, dynamic regret is a stronger notion as it explicitly accounts for the non-stationarity of environments. The dynamic regret attained by the proposed algorithms interpolates between different regimes of non-stationarity, and moreover satisfies a notion of adaptive (near-)optimality, in the sense that it matches the (near-)optimal static regret under slow-changing environments. The dynamic regret bound features two components, one arising from exploration, which deals with the uncertainty of transition kernels, and the other arising from adaptation, which deals with non-stationary environments. Specifically, we show that POWER++ improves over POWER on the second component of the dynamic regret by actively adapting to non-stationarity through prediction. To the best of our knowledge, our work is the first dynamic regret analysis of model-free RL algorithms in non-stationary environments.
Published: 2020

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

647 results on '"Yang, Zhuoran"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources