Author: "Zhang, Kaiqing" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhang, Kaiqing"' showing total 349 results

Start Over Author "Zhang, Kaiqing"

349 results on '"Zhang, Kaiqing"'

1. Last-Iterate Convergence of Payoff-Based Independent Learning in Zero-Sum Stochastic Games

Author: Chen, Zaiwei, Zhang, Kaiqing, Mazumdar, Eric, Ozdaglar, Asuman, and Wierman, Adam
Subjects: Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory
Abstract: In this paper, we consider two-player zero-sum matrix and stochastic games and develop learning dynamics that are payoff-based, convergent, rational, and symmetric between the two players. Specifically, the learning dynamics for matrix games are based on the smoothed best-response dynamics, while the learning dynamics for stochastic games build upon those for matrix games, with additional incorporation of the minimax value iteration. To our knowledge, our theoretical results present the first finite-sample analysis of such learning dynamics with last-iterate guarantees. In the matrix game setting, the results imply a sample complexity of $O(\epsilon^{-1})$ to find the Nash distribution and a sample complexity of $O(\epsilon^{-8})$ to find a Nash equilibrium. In the stochastic game setting, the results also imply a sample complexity of $O(\epsilon^{-8})$ to find a Nash equilibrium. To establish these results, the main challenge is to handle stochastic approximation algorithms with multiple sets of coupled and stochastic iterates that evolve on (possibly) different time scales. To overcome this challenge, we developed a coupled Lyapunov-based approach, which may be of independent interest to the broader community studying the convergence behavior of stochastic approximation algorithms., Comment: A preliminary version [arXiv:2303.03100] of this paper, with a subset of the results that are presented here, was presented at NeurIPS 2023
Published: 2024

2. RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation

Author: Park, Chanwoo, Liu, Mingyang, Kong, Dingwen, Zhang, Kaiqing, and Ozdaglar, Asuman
Subjects: Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Reinforcement learning from human feedback (RLHF) has been an effective technique for aligning AI systems with human values, with remarkable successes in fine-tuning large-language models recently. Most existing RLHF paradigms make the underlying assumption that human preferences are relatively homogeneous, and can be encoded by a single reward model. In this paper, we focus on addressing the issues due to the inherent heterogeneity in human preferences, as well as their potential strategic behavior in providing feedback. Specifically, we propose two frameworks to address heterogeneous human feedback in principled ways: personalization-based one and aggregation-based one. For the former, we propose two approaches based on representation learning and clustering, respectively, for learning multiple reward models that trades off the bias (due to preference heterogeneity) and variance (due to the use of fewer data for learning each model by personalization). We then establish sample complexity guarantees for both approaches. For the latter, we aim to adhere to the single-model framework, as already deployed in the current RLHF paradigm, by carefully aggregating diverse and truthful preferences from humans. We propose two approaches based on reward and preference aggregation, respectively: the former utilizes both utilitarianism and Leximin approaches to aggregate individual reward models, with sample complexity guarantees; the latter directly aggregates the human feedback in the form of probabilistic opinions. Under the probabilistic-opinion-feedback model, we also develop an approach to handle strategic human labelers who may bias and manipulate the aggregated preferences with untruthful feedback. Based on the ideas in mechanism design, our approach ensures truthful preference reporting, with the induced aggregation rule maximizing social welfare functions., Comment: Added experiments
Published: 2024

3. Do LLM Agents Have Regret? A Case Study in Online Learning and Games

Author: Park, Chanwoo, Liu, Xiangyu, Ozdaglar, Asuman, and Zhang, Kaiqing
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Science and Game Theory
Abstract: Large language models (LLMs) have been increasingly employed for (interactive) decision-making, via the development of LLM-based autonomous agents. Despite their emerging successes, the performance of LLM agents in decision-making has not been fully investigated through quantitative metrics, especially in the multi-agent setting when they interact with each other, a typical scenario in real-world LLM-agent applications. To better understand the limits of LLM agents in these interactive environments, we propose to study their interactions in benchmark decision-making settings in online learning and game theory, through the performance metric of \emph{regret}. We first empirically study the {no-regret} behaviors of LLMs in canonical (non-stationary) online learning problems, as well as the emergence of equilibria when LLM agents interact through playing repeated games. We then provide some theoretical insights into the no-regret behaviors of LLM agents, under certain assumptions on the supervised pre-training and the rationality model of human decision-makers who generate the data. Notably, we also identify (simple) cases where advanced LLMs such as GPT-4 fail to be no-regret. To promote the no-regret behaviors, we propose a novel \emph{unsupervised} training loss of \emph{regret-loss}, which, in contrast to the supervised pre-training loss, does not require the labels of (optimal) actions. We then establish the statistical guarantee of generalization bound for regret-loss minimization, followed by the optimization guarantee that minimizing such a loss may automatically lead to known no-regret learning algorithms. Our further experiments demonstrate the effectiveness of our regret-loss, especially in addressing the above ``regrettable'' cases., Comment: Added experimental results for open-source models
Published: 2024

4. Orthorhombic metal carbide-borides MeC$_2$B$_{12}$ (Me=Mg, Ca, Sr) from first principles: structure, stability and mechanical properties

Author: Bystrenko, Oleksiy, Zhang, Jingxian, Sun, Tianxing, Ruan, Hu, Duan, Yusen, Zhang, Kaiqing, and Li, Xiaoguang
Subjects: Condensed Matter - Materials Science
Abstract: First principle DFT simulations are employed to study structural and mechanical properties of orthorhombic B12-based metal carbide-borides. The simulations predict the existence of Ca- and Sr- based phases with the structure similar to that of experimentally observed earlier compound MeC$_2$B$_{12}$. Dynamical stability of both phases is demonstrated, and the phase MeC$_2$B$_{12}$ is found to be thermodynamically stable. According to simulations, Ca- and Sr- based phases have significantly enhanced mechanical characteristics, which suggest their potential application as superhard materials. Calculated shear and Young moduli of these phases are nearly 250 and 540 GPa, respectively, and estimated Vickers hardness is 45-55 GPa., Comment: 11 pages, 4 figures, 3 tables
Published: 2024

5. Two-Timescale Q-Learning with Function Approximation in Zero-Sum Stochastic Games

Author: Chen, Zaiwei, Zhang, Kaiqing, Mazumdar, Eric, Ozdaglar, Asuman, and Wierman, Adam
Subjects: Computer Science - Machine Learning, Computer Science - Multiagent Systems
Abstract: We consider two-player zero-sum stochastic games and propose a two-timescale $Q$-learning algorithm with function approximation that is payoff-based, convergent, rational, and symmetric between the two players. In two-timescale $Q$-learning, the fast-timescale iterates are updated in spirit to the stochastic gradient descent and the slow-timescale iterates (which we use to compute the policies) are updated by taking a convex combination between its previous iterate and the latest fast-timescale iterate. Introducing the slow timescale as well as its update equation marks as our main algorithmic novelty. In the special case of linear function approximation, we establish, to the best of our knowledge, the first last-iterate finite-sample bound for payoff-based independent learning dynamics of these types. The result implies a polynomial sample complexity to find a Nash equilibrium in such stochastic games. To establish the results, we model our proposed algorithm as a two-timescale stochastic approximation and derive the finite-sample bound through a Lyapunov-based approach. The key novelty lies in constructing a valid Lyapunov function to capture the evolution of the slow-timescale iterates. Specifically, through a change of variable, we show that the update equation of the slow-timescale iterates resembles the classical smoothed best-response dynamics, where the regularized Nash gap serves as a valid Lyapunov function. This insight enables us to construct a valid Lyapunov function via a generalized variant of the Moreau envelope of the regularized Nash gap. The construction of our Lyapunov function might be of broad independent interest in studying the behavior of stochastic approximation algorithms.
Published: 2023

6. Robot Fleet Learning via Policy Merging

Author: Wang, Lirui, Zhang, Kaiqing, Zhou, Allan, Simchowitz, Max, and Tedrake, Russ
Subjects: Computer Science - Robotics, Computer Science - Machine Learning
Abstract: Fleets of robots ingest massive amounts of heterogeneous streaming data silos generated by interacting with their environments, far more than what can be stored or transmitted with ease. At the same time, teams of robots should co-acquire diverse skills through their heterogeneous experiences in varied settings. How can we enable such fleet-level learning without having to transmit or centralize fleet-scale data? In this paper, we investigate policy merging (PoMe) from such distributed heterogeneous datasets as a potential solution. To efficiently merge policies in the fleet setting, we propose FLEET-MERGE, an instantiation of distributed learning that accounts for the permutation invariance that arises when parameterizing the control policies with recurrent neural networks. We show that FLEET-MERGE consolidates the behavior of policies trained on 50 tasks in the Meta-World environment, with good performance on nearly all training tasks at test time. Moreover, we introduce a novel robotic tool-use benchmark, FLEET-TOOLS, for fleet policy learning in compositional and contact-rich robot manipulation tasks, to validate the efficacy of FLEET-MERGE on the benchmark., Comment: See the code https://github.com/liruiw/Fleet-Tools for more details
Published: 2023

7. Partially Observable Multi-Agent Reinforcement Learning with Information Sharing

Author: Liu, Xiangyu and Zhang, Kaiqing
Subjects: Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory, Computer Science - Multiagent Systems
Abstract: We study provable multi-agent reinforcement learning (RL) in the general framework of partially observable stochastic games (POSGs). To circumvent the known hardness results and the use of computationally intractable oracles, we advocate leveraging the potential \emph{information-sharing} among agents, a common practice in empirical multi-agent RL, and a standard model for multi-agent control systems with communications. We first establish several computational complexity results to justify the necessity of information-sharing, as well as the observability assumption that has enabled quasi-efficient single-agent RL with partial observations, for efficiently solving POSGs. {Inspired by the inefficiency of planning in the ground-truth model,} we then propose to further \emph{approximate} the shared common information to construct an {approximate model} of the POSG, in which planning an approximate \emph{equilibrium} (in terms of solving the original POSG) can be quasi-efficient, i.e., of quasi-polynomial-time, under the aforementioned assumptions. Furthermore, we develop a partially observable multi-agent RL algorithm that is \emph{both} statistically and computationally quasi-efficient. {Finally, beyond equilibrium learning, we extend our algorithmic framework to finding the \emph{team-optimal solution} in cooperative POSGs, i.e., decentralized partially observable Markov decision processes, a much more challenging goal. We establish concrete computational and sample complexities under several common structural assumptions of the model.} We hope our study could open up the possibilities of leveraging and even designing different \emph{information structures}, a well-studied notion in control theory, for developing both sample- and computation-efficient partially observable multi-agent RL., Comment: Journal extension of the conference version at ICML 2023. Changed to the more general reward function form, added new results for learning in Dec-POMDPs, and streamlined proof outlines
Published: 2023

8. Multi-Player Zero-Sum Markov Games with Networked Separable Interactions

Author: Park, Chanwoo, Zhang, Kaiqing, and Ozdaglar, Asuman
Subjects: Computer Science - Computer Science and Game Theory, Computer Science - Machine Learning
Abstract: We study a new class of Markov games, \emph(multi-player) zero-sum Markov Games} with \emph{Networked separable interactions} (zero-sum NMGs), to model the local interaction structure in non-cooperative multi-agent sequential decision-making. We define a zero-sum NMG as a model where {the payoffs of the auxiliary games associated with each state are zero-sum and} have some separable (i.e., polymatrix) structure across the neighbors over some interaction network. We first identify the necessary and sufficient conditions under which an MG can be presented as a zero-sum NMG, and show that the set of Markov coarse correlated equilibrium (CCE) collapses to the set of Markov Nash equilibrium (NE) in these games, in that the product of per-state marginalization of the former for all players yields the latter. Furthermore, we show that finding approximate Markov \emph{stationary} CCE in infinite-horizon discounted zero-sum NMGs is \texttt{PPAD}-hard, unless the underlying network has a ``star topology''. Then, we propose fictitious-play-type dynamics, the classical learning dynamics in normal-form games, for zero-sum NMGs, and establish convergence guarantees to Markov stationary NE under a star-shaped network structure. Finally, in light of the hardness result, we focus on computing a Markov \emph{non-stationary} NE and provide finite-iteration guarantees for a series of value-iteration-based algorithms. We also provide numerical experiments to corroborate our theoretical results.
Published: 2023

9. Tackling Combinatorial Distribution Shift: A Matrix Completion Perspective

Author: Simchowitz, Max, Gupta, Abhishek, and Zhang, Kaiqing
Subjects: Computer Science - Machine Learning, Computer Science - Data Structures and Algorithms, Statistics - Machine Learning
Abstract: Obtaining rigorous statistical guarantees for generalization under distribution shift remains an open and active research area. We study a setting we call combinatorial distribution shift, where (a) under the test- and training-distributions, the labels $z$ are determined by pairs of features $(x,y)$, (b) the training distribution has coverage of certain marginal distributions over $x$ and $y$ separately, but (c) the test distribution involves examples from a product distribution over $(x,y)$ that is {not} covered by the training distribution. Focusing on the special case where the labels are given by bilinear embeddings into a Hilbert space $H$: $\mathbb{E}[z \mid x,y ]=\langle f_{\star}(x),g_{\star}(y)\rangle_{{H}}$, we aim to extrapolate to a test distribution domain that is $not$ covered in training, i.e., achieving bilinear combinatorial extrapolation. Our setting generalizes a special case of matrix completion from missing-not-at-random data, for which all existing results require the ground-truth matrices to be either exactly low-rank, or to exhibit very sharp spectral cutoffs. In this work, we develop a series of theoretical results that enable bilinear combinatorial extrapolation under gradual spectral decay as observed in typical high-dimensional data, including novel algorithms, generalization guarantees, and linear-algebraic results. A key tool is a novel perturbation bound for the rank-$k$ singular value decomposition approximations between two matrices that depends on the relative spectral gap rather than the absolute spectral gap, a result that may be of broader independent interest., Comment: The 36th Annual Conference on Learning Theory (COLT 2023)
Published: 2023

10. Last-Iterate Convergent Policy Gradient Primal-Dual Methods for Constrained MDPs

Author: Ding, Dongsheng, Wei, Chen-Yu, Zhang, Kaiqing, and Ribeiro, Alejandro
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Systems and Control
Abstract: We study the problem of computing an optimal policy of an infinite-horizon discounted constrained Markov decision process (constrained MDP). Despite the popularity of Lagrangian-based policy search methods used in practice, the oscillation of policy iterates in these methods has not been fully understood, bringing out issues such as violation of constraints and sensitivity to hyper-parameters. To fill this gap, we employ the Lagrangian method to cast a constrained MDP into a constrained saddle-point problem in which max/min players correspond to primal/dual variables, respectively, and develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy. Specifically, we first propose a regularized policy gradient primal-dual (RPG-PD) method that updates the policy using an entropy-regularized policy gradient, and the dual variable via a quadratic-regularized gradient ascent, simultaneously. We prove that the policy primal-dual iterates of RPG-PD converge to a regularized saddle point with a sublinear rate, while the policy iterates converge sublinearly to an optimal constrained policy. We further instantiate RPG-PD in large state or action spaces by including function approximation in policy parametrization, and establish similar sublinear last-iterate policy convergence. Second, we propose an optimistic policy gradient primal-dual (OPG-PD) method that employs the optimistic gradient method to update primal/dual variables, simultaneously. We prove that the policy primal-dual iterates of OPG-PD converge to a saddle point that contains an optimal constrained policy, with a linear rate. To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs., Comment: 65 pages, 17 figures, and 1 table; NeurIPS 2023
Published: 2023

11. Self-Supervised Reinforcement Learning that Transfers using Random Features

Author: Chen, Boyuan, Zhu, Chuning, Agrawal, Pulkit, Zhang, Kaiqing, and Gupta, Abhishek
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Model-free reinforcement learning algorithms have exhibited great potential in solving single-task sequential decision-making problems with high-dimensional observations and long horizons, but are known to be hard to generalize across tasks. Model-based RL, on the other hand, learns task-agnostic models of the world that naturally enables transfer across different reward functions, but struggles to scale to complex environments due to the compounding error. To get the best of both worlds, we propose a self-supervised reinforcement learning method that enables the transfer of behaviors across tasks with different rewards, while circumventing the challenges of model-based RL. In particular, we show self-supervised pre-training of model-free reinforcement learning with a number of random features as rewards allows implicit modeling of long-horizon environment dynamics. Then, planning techniques like model-predictive control using these implicit models enable fast adaptation to problems with new reward functions. Our method is self-supervised in that it can be trained on offline datasets without reward labels, but can then be quickly deployed on new tasks. We validate that our proposed method enables transfer across tasks on a variety of manipulation and locomotion domains in simulation, opening the door to generalist decision-making agents.
Published: 2023

12. Learning to Extrapolate: A Transductive Approach

Author: Netanyahu, Aviv, Gupta, Abhishek, Simchowitz, Max, Zhang, Kaiqing, and Agrawal, Pulkit
Subjects: Computer Science - Machine Learning
Abstract: Machine learning systems, especially with overparameterized deep neural networks, can generalize to novel test instances drawn from the same distribution as the training data. However, they fare poorly when evaluated on out-of-support test points. In this work, we tackle the problem of developing machine learning systems that retain the power of overparameterized function approximators while enabling extrapolation to out-of-support test points when possible. This is accomplished by noting that under certain conditions, a "transductive" reparameterization can convert an out-of-support extrapolation problem into a problem of within-support combinatorial generalization. We propose a simple strategy based on bilinear embeddings to enable this type of combinatorial generalization, thereby addressing the out-of-support extrapolation problem under certain conditions. We instantiate a simple, practical algorithm applicable to various supervised learning and imitation learning tasks.
Published: 2023

13. A Finite-Sample Analysis of Payoff-Based Independent Learning in Zero-Sum Stochastic Games

Author: Chen, Zaiwei, Zhang, Kaiqing, Mazumdar, Eric, Ozdaglar, Asuman, and Wierman, Adam
Subjects: Computer Science - Computer Science and Game Theory, Computer Science - Machine Learning
Abstract: We study two-player zero-sum stochastic games, and propose a form of independent learning dynamics called Doubly Smoothed Best-Response dynamics, which integrates a discrete and doubly smoothed variant of the best-response dynamics into temporal-difference (TD)-learning and minimax value iteration. The resulting dynamics are payoff-based, convergent, rational, and symmetric among players. Our main results provide finite-sample guarantees. In particular, we prove the first-known $\tilde{\mathcal{O}}(1/\epsilon^2)$ sample complexity bound for payoff-based independent learning dynamics, up to a smoothing bias. In the special case where the stochastic game has only one state (i.e., matrix games), we provide a sharper $\tilde{\mathcal{O}}(1/\epsilon)$ sample complexity. Our analysis uses a novel coupled Lyapunov drift approach to capture the evolution of multiple sets of coupled and stochastic iterates, which might be of independent interest.
Published: 2023

14. Breaking the Curse of Multiagents in a Large State Space: RL in Markov Games with Independent Linear Function Approximation

Author: Cui, Qiwen, Zhang, Kaiqing, and Du, Simon S.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Science and Game Theory, Computer Science - Multiagent Systems, Statistics - Machine Learning
Abstract: We propose a new model, independent linear Markov game, for multi-agent reinforcement learning with a large state space and a large number of agents. This is a class of Markov games with independent linear function approximation, where each agent has its own function approximation for the state-action value functions that are marginalized by other players' policies. We design new algorithms for learning the Markov coarse correlated equilibria (CCE) and Markov correlated equilibria (CE) with sample complexity bounds that only scale polynomially with each agent's own function class complexity, thus breaking the curse of multiagents. In contrast, existing works for Markov games with function approximation have sample complexity bounds scale with the size of the \emph{joint action space} when specialized to the canonical tabular Markov game setting, which is exponentially large in the number of agents. Our algorithms rely on two key technical innovations: (1) utilizing policy replay to tackle non-stationarity incurred by multiple agents and the use of function approximation; (2) separating learning Markov equilibria and exploration in the Markov games, which allows us to use the full-information no-regret learning oracle instead of the stronger bandit-feedback no-regret learning oracle used in the tabular setting. Furthermore, we propose an iterative-best-response type algorithm that can learn pure Markov Nash equilibria in independent linear Markov potential games. In the tabular case, by adapting the policy replay mechanism for independent linear Markov games, we propose an algorithm with $\widetilde{O}(\epsilon^{-2})$ sample complexity to learn Markov CCE, which improves the state-of-the-art result $\widetilde{O}(\epsilon^{-3})$ in Daskalakis et al. 2022, where $\epsilon$ is the desired accuracy, and also significantly improves other problem parameters., Comment: 51 pages. Update: Accepted for presentation at the Conference on Learning Theory (COLT) 2023
Published: 2023

15. Can Direct Latent Model Learning Solve Linear Quadratic Gaussian Control?

Author: Tian, Yi, Zhang, Kaiqing, Tedrake, Russ, and Sra, Suvrit
Subjects: Computer Science - Machine Learning, Electrical Engineering and Systems Science - Systems and Control, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: We study the task of learning state representations from potentially high-dimensional observations, with the goal of controlling an unknown partially observable system. We pursue a direct latent model learning approach, where a dynamic model in some latent state space is learned by predicting quantities directly related to planning (e.g., costs) without reconstructing the observations. In particular, we focus on an intuitive cost-driven state representation learning method for solving Linear Quadratic Gaussian (LQG) control, one of the most fundamental partially observable control problems. As our main results, we establish finite-sample guarantees of finding a near-optimal state representation function and a near-optimal controller using the directly learned latent model. To the best of our knowledge, despite various empirical successes, prior to this work it was unclear if such a cost-driven latent model learner enjoys finite-sample guarantees. Our work underscores the value of predicting multi-step costs, an idea that is key to our theory, and notably also an idea that is known to be empirically valuable for learning state representations., Comment: 37 pages; Updated structure and proofs
Published: 2022

16. Revisiting the Linear-Programming Framework for Offline RL with General Function Approximation

Author: Ozdaglar, Asuman, Pattathil, Sarath, Zhang, Jiawei, and Zhang, Kaiqing
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Offline reinforcement learning (RL) aims to find an optimal policy for sequential decision-making using a pre-collected dataset, without further interaction with the environment. Recent theoretical progress has focused on developing sample-efficient offline RL algorithms with various relaxed assumptions on data coverage and function approximators, especially to handle the case with excessively large state-action spaces. Among them, the framework based on the linear-programming (LP) reformulation of Markov decision processes has shown promise: it enables sample-efficient offline RL with function approximation, under only partial data coverage and realizability assumptions on the function classes, with favorable computational tractability. In this work, we revisit the LP framework for offline RL, and provide a new reformulation that advances the existing results in several aspects, relaxing certain assumptions and achieving optimal statistical rates in terms of sample size. Our key enabler is to introduce proper constraints in the reformulation, instead of using any regularization as in the literature, also with careful choices of the function classes and initial state distributions. We hope our insights bring into light the use of LP formulations and the induced primal-dual minimax optimization, in offline RL., Comment: 35 pages
Published: 2022

17. An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural Policy Gradient Methods

Author: Liu, Yanli, Zhang, Kaiqing, Başar, Tamer, and Yin, Wotao
Subjects: Computer Science - Machine Learning
Abstract: In this paper, we revisit and improve the convergence of policy gradient (PG), natural PG (NPG) methods, and their variance-reduced variants, under general smooth policy parametrizations. More specifically, with the Fisher information matrix of the policy being positive definite: i) we show that a state-of-the-art variance-reduced PG method, which has only been shown to converge to stationary points, converges to the globally optimal value up to some inherent function approximation error due to policy parametrization; ii) we show that NPG enjoys a lower sample complexity; iii) we propose SRVR-NPG, which incorporates variance-reduction into the NPG update. Our improvements follow from an observation that the convergence of (variance-reduced) PG and NPG methods can improve each other: the stationary convergence analysis of PG can be applied to NPG as well, and the global convergence analysis of NPG can help to establish the global convergence of (variance-reduced) PG methods. Our analysis carefully integrates the advantages of these two lines of works. Thanks to this improvement, we have also made variance-reduction for NPG possible, with both global convergence and an efficient finite-sample complexity., Comment: NeurIPS 2020 (improve the proof of Lemma B.1 and Proposition G.1.)
Published: 2022

18. Mechanical properties of AlMgB14-related boron carbide structures. A first principle study

Author: Bystrenko, Oleksiy, Zhang, Jingxian, Fangdong, Dong, Li, Xiaoguang, Tang, Weiyu, Zhang, Kaiqing, and Liu, Jianjun
Subjects: Condensed Matter - Materials Science
Abstract: We examine the effects produced by replacing B-B interlayer bonds by C-C bonds in AlMgB14-related boron network on its mechanical properties. The elastic constants, Vickers hardness and shear strength are evaluated by means of first principle computer simulations on the basis of density functional theory. The results of simulations suggest a possibility of existence of several orthorhombic boron carbide phases with strongly enhanced mechanical properties with the Young modulus and Vickers hardness being within the range of 550-600 GPa and 43-50 GPa, respectively, Comment: 9 pages, 6 figures, 2 tables
Published: 2022
Full Text: View/download PDF

19. Symmetric (Optimistic) Natural Policy Gradient for Multi-agent Learning with Parameter Convergence

Author: Pattathil, Sarath, Zhang, Kaiqing, and Ozdaglar, Asuman
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning, Computer Science - Multiagent Systems, Statistics - Machine Learning
Abstract: Multi-agent interactions are increasingly important in the context of reinforcement learning, and the theoretical foundations of policy gradient methods have attracted surging research interest. We investigate the global convergence of natural policy gradient (NPG) algorithms in multi-agent learning. We first show that vanilla NPG may not have parameter convergence, i.e., the convergence of the vector that parameterizes the policy, even when the costs are regularized (which enabled strong convergence guarantees in the policy space in the literature). This non-convergence of parameters leads to stability issues in learning, which becomes especially relevant in the function approximation setting, where we can only operate on low-dimensional parameters, instead of the high-dimensional policy. We then propose variants of the NPG algorithm, for several standard multi-agent learning scenarios: two-player zero-sum matrix and Markov games, and multi-player monotone games, with global last-iterate parameter convergence guarantees. We also generalize the results to certain function approximation settings. Note that in our algorithms, the agents take symmetric roles. Our results might also be of independent interest for solving nonconvex-nonconcave minimax optimization problems with certain structures. Simulations are also provided to corroborate our theoretical findings., Comment: Initially submitted for publication in January 2022
Published: 2022

20. Does Learning from Decentralized Non-IID Unlabeled Data Benefit from Self Supervision?

Author: Wang, Lirui, Zhang, Kaiqing, Li, Yunzhu, Tian, Yonglong, and Tedrake, Russ
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: Decentralized learning has been advocated and widely deployed to make efficient use of distributed datasets, with an extensive focus on supervised learning (SL) problems. Unfortunately, the majority of real-world data are unlabeled and can be highly heterogeneous across sources. In this work, we carefully study decentralized learning with unlabeled data through the lens of self-supervised learning (SSL), specifically contrastive visual representation learning. We study the effectiveness of a range of contrastive learning algorithms under decentralized learning settings, on relatively large-scale datasets including ImageNet-100, MS-COCO, and a new real-world robotic warehouse dataset. Our experiments show that the decentralized SSL (Dec-SSL) approach is robust to the heterogeneity of decentralized datasets, and learns useful representation for object classification, detection, and segmentation tasks. This robustness makes it possible to significantly reduce communication and reduce the participation ratio of data sources with only minimal drops in performance. Interestingly, using the same amount of data, the representation learned by Dec-SSL can not only perform on par with that learned by centralized SSL which requires communication and excessive data storage costs, but also sometimes outperform representations extracted from decentralized SL which requires extra knowledge about the data labels. Finally, we provide theoretical insights into understanding why data heterogeneity is less of a concern for Dec-SSL objectives, and introduce feature alignment and clustering techniques to develop a new Dec-SSL algorithm that further improves the performance, in the face of highly non-IID data. Our study presents positive evidence to embrace unlabeled data in decentralized learning, and we hope to provide new insights into whether and why decentralized SSL is effective.
Published: 2022

21. Towards a Theoretical Foundation of Policy Optimization for Learning Control Policies

Author: Hu, Bin, Zhang, Kaiqing, Li, Na, Mesbahi, Mehran, Fazel, Maryam, and Başar, Tamer
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Gradient-based methods have been widely used for system design and optimization in diverse application domains. Recently, there has been a renewed interest in studying theoretical properties of these methods in the context of control and reinforcement learning. This article surveys some of the recent developments on policy optimization, a gradient-based iterative approach for feedback control synthesis, popularized by successes of reinforcement learning. We take an interdisciplinary perspective in our exposition that connects control theory, reinforcement learning, and large-scale optimization. We review a number of recently-developed theoretical results on the optimization landscape, global convergence, and sample complexity of gradient-based methods for various continuous control problems such as the linear quadratic regulator (LQR), $\mathcal{H}_\infty$ control, risk-sensitive control, linear quadratic Gaussian (LQG) control, and output feedback synthesis. In conjunction with these optimization results, we also discuss how direct policy optimization handles stability and robustness concerns in learning-based control, two main desiderata in control engineering. We conclude the survey by pointing out several challenges and opportunities at the intersection of learning and control., Comment: To Appear in Annual Review of Control, Robotics, and Autonomous Systems
Published: 2022

22. The Power of Regularization in Solving Extensive-Form Games

Author: Liu, Mingyang, Ozdaglar, Asuman, Yu, Tiancheng, and Zhang, Kaiqing
Subjects: Computer Science - Computer Science and Game Theory, Computer Science - Machine Learning
Abstract: In this paper, we investigate the power of {\it regularization}, a common technique in reinforcement learning and optimization, in solving extensive-form games (EFGs). We propose a series of new algorithms based on regularizing the payoff functions of the game, and establish a set of convergence results that strictly improve over the existing ones, with either weaker assumptions or stronger convergence guarantees. In particular, we first show that dilated optimistic mirror descent (DOMD), an efficient variant of OMD for solving EFGs, with adaptive regularization can achieve a fast $\tilde O(1/T)$ last-iterate convergence in terms of duality gap and distance to the set of Nash equilibrium (NE) without uniqueness assumption of the NE. Second, we show that regularized counterfactual regret minimization (\texttt{Reg-CFR}), with a variant of optimistic mirror descent algorithm as regret-minimizer, can achieve $O(1/T^{1/4})$ best-iterate, and $O(1/T^{3/4})$ average-iterate convergence rate for finding NE in EFGs. Finally, we show that \texttt{Reg-CFR} can achieve asymptotic last-iterate convergence, and optimal $O(1/T)$ average-iterate convergence rate, for finding the NE of perturbed EFGs, which is useful for finding approximate extensive-form perfect equilibria (EFPE). To the best of our knowledge, they constitute the first last-iterate convergence results for CFR-type algorithms, while matching the state-of-the-art average-iterate convergence rate in finding NE for non-perturbed EFGs. We also provide numerical results to corroborate the advantages of our algorithms.
Published: 2022

23. What is a Good Metric to Study Generalization of Minimax Learners?

Author: Ozdaglar, Asuman, Pattathil, Sarath, Zhang, Jiawei, and Zhang, Kaiqing
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning, Mathematics - Optimization and Control
Abstract: Minimax optimization has served as the backbone of many machine learning (ML) problems. Although the convergence behavior of optimization algorithms has been extensively studied in the minimax settings, their generalization guarantees in stochastic minimax optimization problems, i.e., how the solution trained on empirical data performs on unseen testing data, have been relatively underexplored. A fundamental question remains elusive: What is a good metric to study generalization of minimax learners? In this paper, we aim to answer this question by first showing that primal risk, a universal metric to study generalization in minimization problems, which has also been adopted recently to study generalization in minimax ones, fails in simple examples. We thus propose a new metric to study generalization of minimax learners: the primal gap, defined as the difference between the primal risk and its minimum over all models, to circumvent the issues. Next, we derive generalization error bounds for the primal gap in nonconvex-concave settings. As byproducts of our analysis, we also solve two open questions: establishing generalization error bounds for primal risk and primal-dual risk, another existing metric that is only well-defined when the global saddle-point exists, in the strong sense, i.e., without strong concavity or assuming that the maximization and expectation can be interchanged, while either of these assumptions was needed in the literature. Finally, we leverage this new metric to compare the generalization behavior of two popular algorithms -- gradient descent-ascent (GDA) and gradient descent-max (GDMax) in stochastic minimax optimization., Comment: 34 pages, 2 figures
Published: 2022

24. Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs

Author: Ding, Dongsheng, Zhang, Kaiqing, Duan, Jiali, Başar, Tamer, and Jovanović, Mihailo R.
Subjects: Mathematics - Optimization and Control, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Systems and Control
Abstract: We study sequential decision making problems aimed at maximizing the expected total reward while satisfying a constraint on the expected total utility. We employ the natural policy gradient method to solve the discounted infinite-horizon optimal control problem for Constrained Markov Decision Processes (constrained MDPs). Specifically, we propose a new Natural Policy Gradient Primal-Dual (NPG-PD) method that updates the primal variable via natural policy gradient ascent and the dual variable via projected sub-gradient descent. Although the underlying maximization involves a nonconcave objective function and a nonconvex constraint set, under the softmax policy parametrization we prove that our method achieves global convergence with sublinear rates regarding both the optimality gap and the constraint violation. Such convergence is independent of the size of the state-action space, i.e., it is~dimension-free. Furthermore, for log-linear and general smooth policy parametrizations, we establish sublinear convergence rates up to a function approximation error caused by restricted policy parametrization. We also provide convergence and finite-sample complexity guarantees for two sample-based NPG-PD algorithms. Finally, we use computational experiments to showcase the merits and the effectiveness of our approach., Comment: 74 pages, 4 figures, 2 tables
Published: 2022

25. Byzantine-Robust Online and Offline Distributed Reinforcement Learning

Author: Chen, Yiding, Zhang, Xuezhou, Zhang, Kaiqing, Wang, Mengdi, and Zhu, Xiaojin
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning
Abstract: We consider a distributed reinforcement learning setting where multiple agents separately explore the environment and communicate their experiences through a central server. However, $\alpha$-fraction of agents are adversarial and can report arbitrary fake information. Critically, these adversarial agents can collude and their fake data can be of any sizes. We desire to robustly identify a near-optimal policy for the underlying Markov decision process in the presence of these adversarial agents. Our main technical contribution is Weighted-Clique, a novel algorithm for the robust mean estimation from batches problem, that can handle arbitrary batch sizes. Building upon this new estimator, in the offline setting, we design a Byzantine-robust distributed pessimistic value iteration algorithm; in the online setting, we design a Byzantine-robust distributed optimistic value iteration algorithm. Both algorithms obtain near-optimal sample complexities and achieve superior robustness guarantee than prior works.
Published: 2022

26. Fictitious Play in Markov Games with Single Controller

Author: Sayin, Muhammed O., Zhang, Kaiqing, and Ozdaglar, Asuman
Subjects: Computer Science - Computer Science and Game Theory
Abstract: Certain but important classes of strategic-form games, including zero-sum and identical-interest games, have the fictitious-play-property (FPP), i.e., beliefs formed in fictitious play dynamics always converge to a Nash equilibrium (NE) in the repeated play of these games. Such convergence results are seen as a (behavioral) justification for the game-theoretical equilibrium analysis. Markov games (MGs), also known as stochastic games, generalize the repeated play of strategic-form games to dynamic multi-state settings with Markovian state transitions. In particular, MGs are standard models for multi-agent reinforcement learning -- a reviving research area in learning and games, and their game-theoretical equilibrium analyses have also been conducted extensively. However, whether certain classes of MGs have the FPP or not (i.e., whether there is a behavioral justification for equilibrium analysis or not) remains largely elusive. In this paper, we study a new variant of fictitious play dynamics for MGs and show its convergence to an NE in n-player identical-interest MGs in which a single player controls the state transitions. Such games are of interest in communications, control, and economics applications. Our result together with the recent results in [Sayin et al. 2020] establishes the FPP of two-player zero-sum MGs and n-player identical-interest MGs with a single controller (standing at two different ends of the MG spectrum from fully competitive to fully cooperative)., Comment: Accepted to ACM Conference on Economics and Computation (EC) 2022
Published: 2022

27. Design and Commissioning of A Beam Distribution System for Multiple Undulator Line Operation of the SXFEL-UF

Author: Chen, Si, Zhang, Kaiqing, Liu, Tao, Feng, Chao, Deng, Haixiao, Liu, Bo, and Zhao, Zhentang
Subjects: Physics - Accelerator Physics
Abstract: As an important measure of improving the efficiency and usability of X-ray free electron laser facilities, simultaneous operation of multiple undulator lines realized by a beam distribution system has become a standard configuration in the recent built XFEL facilities. In Shanghai, SXFEL-UF, the first soft X-ray free electron laser user facility in China, has finished construction and started commissioning recently. Electron beam from linac is alternately distributed between the two parallel undulator beam lines by a beam distribution system with a 6{\deg} deflection line. The beam distribution system is designed to keep the beam properties like low emittance, high peak charge and small bunch length from being spoiled. Beam collective effects such as the dispersion, coherent synchrotron radiation and micro-bunching instability should be well suppressed to guarantee the beam quality. In this work, the detailed physics design of the beam distribution system is described and the recent commissioning result is reported.
Published: 2022

28. The Complexity of Markov Equilibrium in Stochastic Games

Author: Daskalakis, Constantinos, Golowich, Noah, and Zhang, Kaiqing
Subjects: Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory
Abstract: We show that computing approximate stationary Markov coarse correlated equilibria (CCE) in general-sum stochastic games is computationally intractable, even when there are two players, the game is turn-based, the discount factor is an absolute constant, and the approximation is an absolute constant. Our intractability results stand in sharp contrast to normal-form games where exact CCEs are efficiently computable. A fortiori, our results imply that there are no efficient algorithms for learning stationary Markov CCE policies in multi-agent reinforcement learning (MARL), even when the interaction is two-player and turn-based, and both the discount factor and the desired approximation of the learned policies is an absolute constant. In turn, these results stand in sharp contrast to single-agent reinforcement learning (RL) where near-optimal stationary Markov policies can be efficiently learned. Complementing our intractability results for stationary Markov CCEs, we provide a decentralized algorithm (assuming shared randomness among players) for learning a nonstationary Markov CCE policy with polynomial time and sample complexity in all problem parameters. Previous work for learning Markov CCE policies all required exponential time and sample complexity in the number of players., Comment: 50 pages
Published: 2022

29. Sequential conservative integer programming method for multi-constrained discrete variable structure topology optimization

Author: Sun, Kai, Cheng, Gengdong, Zhang, Kaiqing, and Liang, Yuan
Published: 2024
Full Text: View/download PDF

30. Globally Convergent Policy Search over Dynamic Filters for Output Estimation

Author: Umenberger, Jack, Simchowitz, Max, Perdomo, Juan C., Zhang, Kaiqing, and Tedrake, Russ
Subjects: Mathematics - Optimization and Control, Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We introduce the first direct policy search algorithm which provably converges to the globally optimal $\textit{dynamic}$ filter for the classical problem of predicting the outputs of a linear dynamical system, given noisy, partial observations. Despite the ubiquity of partial observability in practice, theoretical guarantees for direct policy search algorithms, one of the backbones of modern reinforcement learning, have proven difficult to achieve. This is primarily due to the degeneracies which arise when optimizing over filters that maintain internal state. In this paper, we provide a new perspective on this challenging problem based on the notion of $\textit{informativity}$, which intuitively requires that all components of a filter's internal state are representative of the true state of the underlying dynamical system. We show that informativity overcomes the aforementioned degeneracy. Specifically, we propose a $\textit{regularizer}$ which explicitly enforces informativity, and establish that gradient descent on this regularized objective - combined with a ``reconditioning step'' - converges to the globally optimal cost a $\mathcal{O}(1/T)$ rate. Our analysis relies on several new results which may be of independent interest, including a new framework for analyzing non-convex gradient descent via convex reformulation, and novel bounds on the solution to linear Lyapunov equations in terms of (our quantitative measure of) informativity.
Published: 2022

31. Independent Policy Gradient for Large-Scale Markov Potential Games: Sharper Rates, Function Approximation, and Game-Agnostic Convergence

Author: Ding, Dongsheng, Wei, Chen-Yu, Zhang, Kaiqing, and Jovanović, Mihailo R.
Subjects: Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory, Computer Science - Multiagent Systems, Mathematics - Optimization and Control
Abstract: We examine global non-asymptotic convergence properties of policy gradient methods for multi-agent reinforcement learning (RL) problems in Markov potential games (MPG). To learn a Nash equilibrium of an MPG in which the size of state space and/or the number of players can be very large, we propose new independent policy gradient algorithms that are run by all players in tandem. When there is no uncertainty in the gradient evaluation, we show that our algorithm finds an $\epsilon$-Nash equilibrium with $O(1/\epsilon^2)$ iteration complexity which does not explicitly depend on the state space size. When the exact gradient is not available, we establish $O(1/\epsilon^5)$ sample complexity bound in a potentially infinitely large state space for a sample-based algorithm that utilizes function approximation. Moreover, we identify a class of independent policy gradient algorithms that enjoys convergence for both zero-sum Markov games and Markov cooperative games with the players that are oblivious to the types of games being played. Finally, we provide computational experiments to corroborate the merits and the effectiveness of our theoretical developments., Comment: 55 pages, 6 figures; Revised comments in Sections 3, 5 of published ICML 2022 version, results unchanged
Published: 2022

32. Do Differentiable Simulators Give Better Policy Gradients?

Author: Suh, H. J. Terry, Simchowitz, Max, Zhang, Kaiqing, and Tedrake, Russ
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Robotics
Abstract: Differentiable simulators promise faster computation time for reinforcement learning by replacing zeroth-order gradient estimates of a stochastic objective with an estimate based on first-order gradients. However, it is yet unclear what factors decide the performance of the two estimators on complex landscapes that involve long-horizon planning and control on physical systems, despite the crucial relevance of this question for the utility of differentiable simulators. We show that characteristics of certain physical systems, such as stiffness or discontinuities, may compromise the efficacy of the first-order estimator, and analyze this phenomenon through the lens of bias and variance. We additionally propose an $\alpha$-order gradient estimator, with $\alpha \in [0,1]$, which correctly utilizes exact gradients to combine the efficiency of first-order estimates with the robustness of zero-order methods. We demonstrate the pitfalls of traditional estimators and the advantages of the $\alpha$-order estimator on some numerical examples., Comment: Accepted to ICML 2022
Published: 2022

33. Remote optogenetic control of the enteric nervous system and brain-gut axis in freely-behaving mice enabled by a wireless, battery-free optoelectronic device

Author: Efimov, Andrew I., Hibberd, Timothy J., Wang, Yue, Wu, Mingzheng, Zhang, Kaiqing, Ting, Kaila, Madhvapathy, Surabhi, Lee, Min-Kyu, Kim, Joohee, Kang, Jiheon, Riahi, Mohammad, Zhang, Haohui, Travis, Lee, Govier, Emily J., Yang, Lianye, Kelly, Nigel, Huang, Yonggang, Vázquez-Guardado, Abraham, Spencer, Nick J., and Rogers, John A.
Published: 2024
Full Text: View/download PDF

34. Independent Learning in Stochastic Games

Author: Ozdaglar, Asuman, Sayin, Muhammed O., and Zhang, Kaiqing
Subjects: Computer Science - Computer Science and Game Theory, Computer Science - Machine Learning, Mathematics - Dynamical Systems
Abstract: Reinforcement learning (RL) has recently achieved tremendous successes in many artificial intelligence applications. Many of the forefront applications of RL involve multiple agents, e.g., playing chess and Go games, autonomous driving, and robotics. Unfortunately, the framework upon which classical RL builds is inappropriate for multi-agent learning, as it assumes an agent's environment is stationary and does not take into account the adaptivity of other agents. In this review paper, we present the model of stochastic games for multi-agent learning in dynamic environments. We focus on the development of simple and independent learning dynamics for stochastic games: each agent is myopic and chooses best-response type actions to other agents' strategy without any coordination with her opponent. There has been limited progress on developing convergent best-response type independent learning dynamics for stochastic games. We present our recently proposed simple and independent learning dynamics that guarantee convergence in zero-sum stochastic games, together with a review of other contemporaneous algorithms for dynamic multi-agent learning in this setting. Along the way, we also reexamine some classical results from both the game theory and RL literature, to situate both the conceptual contributions of our independent learning dynamics, and the mathematical novelties of our analysis. We hope this review paper serves as an impetus for the resurgence of studying independent and natural learning dynamics in game theory, for the more challenging settings with a dynamic environment., Comment: An invited chapter for the International Congress of Mathematicians 2022 (ICM 2022)
Published: 2021

35. On Improving Model-Free Algorithms for Decentralized Multi-Agent Reinforcement Learning

Author: Mao, Weichao, Yang, Lin F., Zhang, Kaiqing, and Başar, Tamer
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Multiagent Systems
Abstract: Multi-agent reinforcement learning (MARL) algorithms often suffer from an exponential sample complexity dependence on the number of agents, a phenomenon known as \emph{the curse of multiagents}. In this paper, we address this challenge by investigating sample-efficient model-free algorithms in \emph{decentralized} MARL, and aim to improve existing algorithms along this line. For learning (coarse) correlated equilibria in general-sum Markov games, we propose \emph{stage-based} V-learning algorithms that significantly simplify the algorithmic design and analysis of recent works, and circumvent a rather complicated no-\emph{weighted}-regret bandit subroutine. For learning Nash equilibria in Markov potential games, we propose an independent policy gradient algorithm with a decentralized momentum-based variance reduction technique. All our algorithms are decentralized in that each agent can make decisions based on only its local information. Neither communication nor centralized coordination is required during learning, leading to a natural generalization to a large number of agents. We also provide numerical simulations to corroborate our theoretical findings.
Published: 2021

36. A critical review on new and efficient adsorbents for CO2 capture

Author: Zhang, Kaiqing and Wang, Rui
Published: 2024
Full Text: View/download PDF

37. An analytical model of reactive diffusion for transient electronics with thick encapsulation layer

Author: Zhang, Haohui, Zhang, Kaiqing, Rogers, John A., and Huang, Yonggang
Published: 2024
Full Text: View/download PDF

38. Mucociliary Wnt signaling promotes cilia biogenesis and beating

Author: Seidl, Carina, Da Silva, Fabio, Zhang, Kaiqing, Wohlgemuth, Kai, Omran, Heymut, and Niehrs, Christof
Published: 2023
Full Text: View/download PDF

39. Decentralized Q-Learning in Zero-sum Markov Games

Author: Sayin, Muhammed O., Zhang, Kaiqing, Leslie, David S., Basar, Tamer, and Ozdaglar, Asuman
Subjects: Computer Science - Computer Science and Game Theory, Computer Science - Machine Learning, Computer Science - Multiagent Systems, Mathematics - Dynamical Systems
Abstract: We study multi-agent reinforcement learning (MARL) in infinite-horizon discounted zero-sum Markov games. We focus on the practical but challenging setting of decentralized MARL, where agents make decisions without coordination by a centralized controller, but only based on their own payoffs and local actions executed. The agents need not observe the opponent's actions or payoffs, possibly being even oblivious to the presence of the opponent, nor be aware of the zero-sum structure of the underlying game, a setting also referred to as radically uncoupled in the literature of learning in games. In this paper, we develop a radically uncoupled Q-learning dynamics that is both rational and convergent: the learning dynamics converges to the best response to the opponent's strategy when the opponent follows an asymptotically stationary strategy; when both agents adopt the learning dynamics, they converge to the Nash equilibrium of the game. The key challenge in this decentralized setting is the non-stationarity of the environment from an agent's perspective, since both her own payoffs and the system evolution depend on the actions of other agents, and each agent adapts her policies simultaneously and independently. To address this issue, we develop a two-timescale learning dynamics where each agent updates her local Q-function and value function estimates concurrently, with the latter happening at a slower timescale., Comment: To appear at NeurIPS 2021. Strengthened the results in Theorem 1 and Corollary 1
Published: 2021

40. Crosstalk between lung cancer cells and macrophages contributes to traditional Chinese medicine Feiyanning-induced anti-tumor activities by suppressing M2 macrophage polarization

Author: Zhang Kaiqing, Peng Yonglin, Wang Qingting, Tan Sheng, Wang Bin, Tian Huaqin, Zhao Xiaodong, and Zhang Ming
Subjects: Biochemistry, QD415-436, Genetics, QH426-470
Published: 2023
Full Text: View/download PDF

41. Learning Safe Multi-Agent Control with Decentralized Neural Barrier Certificates

Author: Qin, Zengyi, Zhang, Kaiqing, Chen, Yuxiao, Chen, Jingkai, and Fan, Chuchu
Subjects: Computer Science - Multiagent Systems, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Systems and Control
Abstract: We study the multi-agent safe control problem where agents should avoid collisions to static obstacles and collisions with each other while reaching their goals. Our core idea is to learn the multi-agent control policy jointly with learning the control barrier functions as safety certificates. We propose a novel joint-learning framework that can be implemented in a decentralized fashion, with generalization guarantees for certain function classes. Such a decentralized framework can adapt to an arbitrarily large number of agents. Building upon this framework, we further improve the scalability by incorporating neural network architectures that are invariant to the quantity and permutation of neighboring agents. In addition, we propose a new spontaneous policy refinement method to further enforce the certificate condition during testing. We provide extensive experiments to demonstrate that our method significantly outperforms other leading multi-agent control approaches in terms of maintaining safety and completing original tasks. Our approach also shows exceptional generalization capability in that the control policy can be trained with 8 agents in one scenario, while being used on other scenarios with up to 1024 agents in complex multi-agent environments and dynamics., Comment: Published at ICLR 2021
Published: 2021

42. Derivative-Free Policy Optimization for Linear Risk-Sensitive and Robust Control Design: Implicit Regularization and Sample Complexity

Author: Zhang, Kaiqing, Zhang, Xiangyuan, Hu, Bin, and Başar, Tamer
Subjects: Mathematics - Optimization and Control, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Systems and Control
Abstract: Direct policy search serves as one of the workhorses in modern reinforcement learning (RL), and its applications in continuous control tasks have recently attracted increasing attention. In this work, we investigate the convergence theory of policy gradient (PG) methods for learning the linear risk-sensitive and robust controller. In particular, we develop PG methods that can be implemented in a derivative-free fashion by sampling system trajectories, and establish both global convergence and sample complexity results in the solutions of two fundamental settings in risk-sensitive and robust control: the finite-horizon linear exponential quadratic Gaussian, and the finite-horizon linear-quadratic disturbance attenuation problems. As a by-product, our results also provide the first sample complexity for the global convergence of PG methods on solving zero-sum linear-quadratic dynamic games, a nonconvex-nonconcave minimax optimization problem that serves as a baseline setting in multi-agent reinforcement learning (MARL) with continuous spaces. One feature of our algorithms is that during the learning phase, a certain level of robustness/risk-sensitivity of the controller is preserved, which we termed as the implicit regularization property, and is an essential requirement in safety-critical control systems.
Published: 2021

43. Towards Understanding Asynchronous Advantage Actor-critic: Convergence and Linear Speedup

Author: Shen, Han, Zhang, Kaiqing, Hong, Mingyi, and Chen, Tianyi
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control
Abstract: Asynchronous and parallel implementation of standard reinforcement learning (RL) algorithms is a key enabler of the tremendous success of modern RL. Among many asynchronous RL algorithms, arguably the most popular and effective one is the asynchronous advantage actor-critic (A3C) algorithm. Although A3C is becoming the workhorse of RL, its theoretical properties are still not well-understood, including its non-asymptotic analysis and the performance gain of parallelism (a.k.a. linear speedup). This paper revisits the A3C algorithm and establishes its non-asymptotic convergence guarantees. Under both i.i.d. and Markovian sampling, we establish the local convergence guarantee for A3C in the general policy approximation case and the global convergence guarantee in softmax policy parameterization. Under i.i.d. sampling, A3C obtains sample complexity of $\mathcal{O}(\epsilon^{-2.5}/N)$ per worker to achieve $\epsilon$ accuracy, where $N$ is the number of workers. Compared to the best-known sample complexity of $\mathcal{O}(\epsilon^{-2.5})$ for two-timescale AC, A3C achieves \emph{linear speedup}, which justifies the advantage of parallelism and asynchrony in AC algorithms theoretically for the first time. Numerical tests on synthetic environment, OpenAI Gym environments and Atari games have been provided to verify our theoretical analysis.
Published: 2020
Full Text: View/download PDF

44. Self-Amplification of Coherent Energy Modulation in Seeded Free-Electron Lasers

Author: Yan, Jiawei, Gao, Zhangfeng, Qi, Zheng, Zhang, Kaiqing, Zhou, Kaishang, Liu, Tao, Chen, Si, Feng, Chao, Li, Chunlei, Feng, Lie, Lan, Taihe, Zhang, Wenyan, Wang, Xingtao, Li, Xuan, Jiang, Zenggong, Wang, Baoliang, Wang, Zhen, Gu, Duan, Zhang, Meng, Deng, Haixiao, Gu, Qiang, Leng, Yongbin, Yin, Lixin, Liu, Bo, Wang, Dong, and Zhao, Zhentang
Subjects: Physics - Accelerator Physics, Physics - Optics
Abstract: The spectroscopic techniques for time-resolved fine analysis of matter require coherent X-ray radiation with femtosecond duration and high average brightness. Seeded free-electron lasers (FELs), which use the frequency up-conversion of an external seed laser to improve temporal coherence, are ideal for providing fully coherent soft X-ray pulses. However, it is difficult to operate seeded FELs at a high repetition rate due to the limitations of present state-of-the-art laser systems. Here, we report the novel self-modulation method for enhancing laser-induced energy modulation, thereby significantly reducing the requirement of an external laser system. Driven by this scheme, we experimentally realize high harmonic generation in a seeded FEL using an unprecedentedly small energy modulation. An electron beam with a laser-induced energy modulation as small as 1.8 times the slice energy spread is used for lasing at the 7th harmonic of a 266-nm seed laser in a single-stage high-gain harmonic generation (HGHG) setup and the 30th harmonic of the seed laser in a two-stage HGHG setup. The results mark a major step towards a high-repetition-rate, fully coherent X-ray FEL.
Published: 2020
Full Text: View/download PDF

45. Model-Free Non-Stationary RL: Near-Optimal Regret and Applications in Multi-Agent RL and Inventory Control

Author: Mao, Weichao, Zhang, Kaiqing, Zhu, Ruihao, Simchi-Levi, David, and Başar, Tamer
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning
Abstract: We consider model-free reinforcement learning (RL) in non-stationary Markov decision processes. Both the reward functions and the state transition functions are allowed to vary arbitrarily over time as long as their cumulative variations do not exceed certain variation budgets. We propose Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), the first model-free algorithm for non-stationary RL, and show that it outperforms existing solutions in terms of dynamic regret. Specifically, RestartQ-UCB with Freedman-type bonus terms achieves a dynamic regret bound of $\widetilde{O}(S^{\frac{1}{3}} A^{\frac{1}{3}} \Delta^{\frac{1}{3}} H T^{\frac{2}{3}})$, where $S$ and $A$ are the numbers of states and actions, respectively, $\Delta>0$ is the variation budget, $H$ is the number of time steps per episode, and $T$ is the total number of time steps. We further present a parameter-free algorithm named Double-Restart Q-UCB that does not require prior knowledge of the variation budget. We show that our algorithms are \emph{nearly optimal} by establishing an information-theoretical lower bound of $\Omega(S^{\frac{1}{3}} A^{\frac{1}{3}} \Delta^{\frac{1}{3}} H^{\frac{2}{3}} T^{\frac{2}{3}})$, the first lower bound in non-stationary RL. Numerical experiments validate the advantages of RestartQ-UCB in terms of both cumulative rewards and computational efficiency. We demonstrate the power of our results in examples of multi-agent RL and inventory control across related products., Comment: A preliminary version of this work has appeared in ICML 2021
Published: 2020

46. Reinforcement Learning in Non-Stationary Discrete-Time Linear-Quadratic Mean-Field Games

Author: Zaman, Muhammad Aneeq uz, Zhang, Kaiqing, Miehling, Erik, and Başar, Tamer
Subjects: Electrical Engineering and Systems Science - Systems and Control, Computer Science - Computer Science and Game Theory, Computer Science - Machine Learning
Abstract: In this paper, we study large population multi-agent reinforcement learning (RL) in the context of discrete-time linear-quadratic mean-field games (LQ-MFGs). Our setting differs from most existing work on RL for MFGs, in that we consider a non-stationary MFG over an infinite horizon. We propose an actor-critic algorithm to iteratively compute the mean-field equilibrium (MFE) of the LQ-MFG. There are two primary challenges: i) the non-stationarity of the MFG induces a linear-quadratic tracking problem, which requires solving a backwards-in-time (non-causal) equation that cannot be solved by standard (causal) RL algorithms; ii) Many RL algorithms assume that the states are sampled from the stationary distribution of a Markov chain (MC), that is, the chain is already mixed, an assumption that is not satisfied for real data sources. We first identify that the mean-field trajectory follows linear dynamics, allowing the problem to be reformulated as a linear quadratic Gaussian problem. Under this reformulation, we propose an actor-critic algorithm that allows samples to be drawn from an unmixed MC. Finite-sample convergence guarantees for the algorithm are then provided. To characterize the performance of our algorithm in multi-agent RL, we have developed an error bound with respect to the Nash equilibrium of the finite-population game., Comment: To appear in CDC 2020
Published: 2020

47. Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity

Author: Zhang, Kaiqing, Kakade, Sham M., Başar, Tamer, and Yang, Lin F.
Subjects: Computer Science - Machine Learning, Computer Science - Computer Science and Game Theory, Computer Science - Multiagent Systems, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the corner stones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has not been fully investigated. In this paper, our goal is to address the fundamental question about its sample complexity. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model. We show that model-based MARL achieves a sample complexity of $\tilde O(|S||A||B|(1-\gamma)^{-3}\epsilon^{-2})$ for finding the Nash equilibrium (NE) value up to some $\epsilon$ error, and the $\epsilon$-NE policies with a smooth planning oracle, where $\gamma$ is the discount factor, and $S,A,B$ denote the state space, and the action spaces for the two agents. We further show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge, by establishing a matching lower bound. This is in contrast to the usual reward-aware setting, with a $\tilde\Omega(|S|(|A|+|B|)(1-\gamma)^{-3}\epsilon^{-2})$ lower bound, where this model-based approach is near-optimal with only a gap on the $|A|,|B|$ dependence. Our results not only demonstrate the sample-efficiency of this basic model-based approach in MARL, but also elaborate on the fundamental tradeoff between its power (easily handling the more challenging reward-agnostic case) and limitation (less adaptive and suboptimal in $|A|,|B|$), particularly arises in the multi-agent context., Comment: Updated version accepted to Journal of Machine Learning Research (JMLR)
Published: 2020

48. POLY-HOOT: Monte-Carlo Planning in Continuous Space MDPs with Non-Asymptotic Analysis

Author: Mao, Weichao, Zhang, Kaiqing, Xie, Qiaomin, and Başar, Tamer
Subjects: Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Monte-Carlo planning, as exemplified by Monte-Carlo Tree Search (MCTS), has demonstrated remarkable performance in applications with finite spaces. In this paper, we consider Monte-Carlo planning in an environment with continuous state-action spaces, a much less understood problem with important applications in control and robotics. We introduce POLY-HOOT, an algorithm that augments MCTS with a continuous armed bandit strategy named Hierarchical Optimistic Optimization (HOO) (Bubeck et al., 2011). Specifically, we enhance HOO by using an appropriate polynomial, rather than logarithmic, bonus term in the upper confidence bounds. Such a polynomial bonus is motivated by its empirical successes in AlphaGo Zero (Silver et al., 2017b), as well as its significant role in achieving theoretical guarantees of finite space MCTS (Shah et al., 2019). We investigate, for the first time, the regret of the enhanced HOO algorithm in non-stationary bandit problems. Using this result as a building block, we establish non-asymptotic convergence guarantees for POLY-HOOT: the value estimate converges to an arbitrarily small neighborhood of the optimal value function at a polynomial rate. We further provide experimental results that corroborate our theoretical findings., Comment: NeurIPS 2020
Published: 2020

49. Information State Embedding in Partially Observable Cooperative Multi-Agent Reinforcement Learning

Author: Mao, Weichao, Zhang, Kaiqing, Miehling, Erik, and Başar, Tamer
Subjects: Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multiagent Systems
Abstract: Multi-agent reinforcement learning (MARL) under partial observability has long been considered challenging, primarily due to the requirement for each agent to maintain a belief over all other agents' local histories -- a domain that generally grows exponentially over time. In this work, we investigate a partially observable MARL problem in which agents are cooperative. To enable the development of tractable algorithms, we introduce the concept of an information state embedding that serves to compress agents' histories. We quantify how the compression error influences the resulting value functions for decentralized control. Furthermore, we propose an instance of the embedding based on recurrent neural networks (RNNs). The embedding is then used as an approximate information state, and can be fed into any MARL algorithm. The proposed embed-then-learn pipeline opens the black-box of existing (partially observable) MARL algorithms, allowing us to establish some theoretical guarantees (error bounds of value functions) while still achieving competitive performance with many end-to-end approaches., Comment: Accepted to CDC 2020
Published: 2020

50. Approximate Equilibrium Computation for Discrete-Time Linear-Quadratic Mean-Field Games

Author: Zaman, Muhammad Aneeq uz, Zhang, Kaiqing, Miehling, Erik, and Başar, Tamer
Subjects: Electrical Engineering and Systems Science - Systems and Control, Computer Science - Multiagent Systems
Abstract: While the topic of mean-field games (MFGs) has a relatively long history, heretofore there has been limited work concerning algorithms for the computation of equilibrium control policies. In this paper, we develop a computable policy iteration algorithm for approximating the mean-field equilibrium in linear-quadratic MFGs with discounted cost. Given the mean-field, each agent faces a linear-quadratic tracking problem, the solution of which involves a dynamical system evolving in retrograde time. This makes the development of forward-in-time algorithm updates challenging. By identifying a structural property of the mean-field update operator, namely that it preserves sequences of a particular form, we develop a forward-in-time equilibrium computation algorithm. Bounds that quantify the accuracy of the computed mean-field equilibrium as a function of the algorithm's stopping condition are provided. The optimality of the computed equilibrium is validated numerically. In contrast to the most recent/concurrent results, our algorithm appears to be the first to study infinite-horizon MFGs with non-stationary mean-field equilibria, though with focus on the linear quadratic setting., Comment: This paper has been accepted in ACC 2020
Published: 2020

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

349 results on '"Zhang, Kaiqing"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources