Author: "Bai, Chenjia" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Bai, Chenjia"' showing total 110 results

Start Over Author "Bai, Chenjia"

110 results on '"Bai, Chenjia"'

1. ODRL: A Benchmark for Off-Dynamics Reinforcement Learning

Author: Lyu, Jiafei, Xu, Kang, Xu, Jiacheng, Yan, Mengbei, Yang, Jingwen, Zhang, Zongzhang, Bai, Chenjia, Lu, Zongqing, and Li, Xiu
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: We consider off-dynamics reinforcement learning (RL) where one needs to transfer policies across different domains with dynamics mismatch. Despite the focus on developing dynamics-aware algorithms, this field is hindered due to the lack of a standard benchmark. To bridge this gap, we introduce ODRL, the first benchmark tailored for evaluating off-dynamics RL methods. ODRL contains four experimental settings where the source and target domains can be either online or offline, and provides diverse tasks and a broad spectrum of dynamics shifts, making it a reliable platform to comprehensively evaluate the agent's adaptation ability to the target domain. Furthermore, ODRL includes recent off-dynamics RL algorithms in a unified framework and introduces some extra baselines for different settings, all implemented in a single-file manner. To unpack the true adaptation capability of existing methods, we conduct extensive benchmarking experiments, which show that no method has universal advantages across varied dynamics shifts. We hope this benchmark can serve as a cornerstone for future research endeavors. Our code is publicly available at https://github.com/OffDynamicsRL/off-dynamics-rl., Comment: NeurIPS 2024 D&B Track
Published: 2024

2. Preference Aligned Diffusion Planner for Quadrupedal Locomotion Control

Author: Yuan, Xinyi, Shang, Zhiwei, Wang, Zifan, Wang, Chenkai, Shan, Zhao, Qi, Zhenchao, Zhu, Meixin, Bai, Chenjia, and Li, Xuelong
Subjects: Computer Science - Robotics
Abstract: Diffusion models demonstrate superior performance in capturing complex distributions from large-scale datasets, providing a promising solution for quadrupedal locomotion control. However, offline policy is sensitive to Out-of-Distribution (OOD) states due to the limited state coverage in the datasets. In this work, we propose a two-stage learning framework combining offline learning and online preference alignment for legged locomotion control. Through the offline stage, the diffusion planner learns the joint distribution of state-action sequences from expert datasets without using reward labels. Subsequently, we perform the online interaction in the simulation environment based on the trained offline planer, which significantly addresses the OOD issues and improves the robustness. Specifically, we propose a novel weak preference labeling method without the ground-truth reward or human preferences. The proposed method exhibits superior stability and velocity tracking accuracy in pacing, trotting, and bounding gait under both slow- and high-speed scenarios and can perform zero-shot transfer to the real Unitree Go1 robots. The project website for this paper is at https://shangjaven.github.io/preference-aligned-diffusion-legged/.
Published: 2024

3. Task-agnostic Pre-training and Task-guided Fine-tuning for Versatile Diffusion Planner

Author: Fan, Chenyou, Bai, Chenjia, Shan, Zhao, He, Haoran, Zhang, Yang, and Wang, Zhen
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Diffusion models have demonstrated their capabilities in modeling trajectories of multi-tasks. However, existing multi-task planners or policies typically rely on task-specific demonstrations via multi-task imitation, or require task-specific reward labels to facilitate policy optimization via Reinforcement Learning (RL). To address these challenges, we aim to develop a versatile diffusion planner that can leverage large-scale inferior data that contains task-agnostic sub-optimal trajectories, with the ability to fast adapt to specific tasks. In this paper, we propose \textbf{SODP}, a two-stage framework that leverages \textbf{S}ub-\textbf{O}ptimal data to learn a \textbf{D}iffusion \textbf{P}lanner, which is generalizable for various downstream tasks. Specifically, in the pre-training stage, we train a foundation diffusion planner that extracts general planning capabilities by modeling the versatile distribution of multi-task trajectories, which can be sub-optimal and has wide data coverage. Then for downstream tasks, we adopt RL-based fine-tuning with task-specific rewards to fast refine the diffusion planner, which aims to generate action sequences with higher task-specific returns. Experimental results from multi-task domains including Meta-World and Adroit demonstrate that SODP outperforms state-of-the-art methods with only a small amount of data for reward-guided fine-tuning.
Published: 2024

4. Forward KL Regularized Preference Optimization for Aligning Diffusion Policies

Author: Shan, Zhao, Fan, Chenyou, Qiu, Shuang, Shi, Jiyuan, and Bai, Chenjia
Subjects: Computer Science - Machine Learning
Abstract: Diffusion models have achieved remarkable success in sequential decision-making by leveraging the highly expressive model capabilities in policy learning. A central problem for learning diffusion policies is to align the policy output with human intents in various tasks. To achieve this, previous methods conduct return-conditioned policy generation or Reinforcement Learning (RL)-based policy optimization, while they both rely on pre-defined reward functions. In this work, we propose a novel framework, Forward KL regularized Preference optimization for aligning Diffusion policies, to align the diffusion policy with preferences directly. We first train a diffusion policy from the offline dataset without considering the preference, and then align the policy to the preference data via direct preference optimization. During the alignment phase, we formulate direct preference learning in a diffusion policy, where the forward KL regularization is employed in preference optimization to avoid generating out-of-distribution actions. We conduct extensive experiments for MetaWorld manipulation and D4RL tasks. The results show our method exhibits superior alignment with preferences and outperforms previous state-of-the-art algorithms.
Published: 2024

5. SelfBC: Self Behavior Cloning for Offline Reinforcement Learning

Author: Liu, Shirong, Bai, Chenjia, Guo, Zixian, Zhang, Hao, Sharma, Gaurav, and Liu, Yang
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Policy constraint methods in offline reinforcement learning employ additional regularization techniques to constrain the discrepancy between the learned policy and the offline dataset. However, these methods tend to result in overly conservative policies that resemble the behavior policy, thus limiting their performance. We investigate this limitation and attribute it to the static nature of traditional constraints. In this paper, we propose a novel dynamic policy constraint that restricts the learned policy on the samples generated by the exponential moving average of previously learned policies. By integrating this self-constraint mechanism into off-policy methods, our method facilitates the learning of non-conservative policies while avoiding policy collapse in the offline setting. Theoretical results show that our approach results in a nearly monotonically improved reference policy. Extensive experiments on the D4RL MuJoCo domain demonstrate that our proposed method achieves state-of-the-art performance among the policy constraint methods.
Published: 2024

6. Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models

Author: Zhang, Yang, Bai, Chenjia, Zhao, Bin, Yan, Junchi, Li, Xiu, and Li, Xuelong
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Multiagent Systems
Abstract: Learning a world model for model-free Reinforcement Learning (RL) agents can significantly improve the sample efficiency by learning policies in imagination. However, building a world model for Multi-Agent RL (MARL) can be particularly challenging due to the scalability issue in a centralized architecture arising from a large number of agents, and also the non-stationarity issue in a decentralized architecture stemming from the inter-dependency among agents. To address both challenges, we propose a novel world model for MARL that learns decentralized local dynamics for scalability, combined with a centralized representation aggregation from all agents. We cast the dynamics learning as an auto-regressive sequence modeling problem over discrete tokens by leveraging the expressive Transformer architecture, in order to model complex local dynamics across different agents and provide accurate and consistent long-term imaginations. As the first pioneering Transformer-based world model for multi-agent systems, we introduce a Perceiver Transformer as an effective solution to enable centralized representation aggregation within this context. Results on Starcraft Multi-Agent Challenge (SMAC) show that it outperforms strong model-free approaches and existing model-based methods in both sample efficiency and overall performance.
Published: 2024

7. SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

Author: Zhang, Junjie, Bai, Chenjia, He, Haoran, Xia, Wenke, Wang, Zhigang, Zhao, Bin, Li, Xiu, and Li, Xuelong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Robotics
Abstract: Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction. Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot's end-effector. However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning. In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning. Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios. To address long-horizon reasoning, we develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency. Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks., Comment: ICML 2024. Project page: https://sam-embodied.github.io
Published: 2024

8. Constrained Ensemble Exploration for Unsupervised Skill Discovery

Author: Bai, Chenjia, Yang, Rushuai, Zhang, Qiaosheng, Xu, Kang, Chen, Yi, Xiao, Ting, and Li, Xuelong
Subjects: Computer Science - Machine Learning
Abstract: Unsupervised Reinforcement Learning (RL) provides a promising paradigm for learning useful behaviors via reward-free per-training. Existing methods for unsupervised RL mainly conduct empowerment-driven skill discovery or entropy-based exploration. However, empowerment often leads to static skills, and pure exploration only maximizes the state coverage rather than learning useful behaviors. In this paper, we propose a novel unsupervised RL framework via an ensemble of skills, where each skill performs partition exploration based on the state prototypes. Thus, each skill can explore the clustered area locally, and the ensemble skills maximize the overall state coverage. We adopt state-distribution constraints for the skill occupancy and the desired cluster for learning distinguishable skills. Theoretical analysis is provided for the state entropy and the resulting skill distributions. Based on extensive experiments on several challenging tasks, we find our method learns well-explored ensemble skills and achieves superior performance in various downstream tasks compared to previous methods., Comment: Accepted by ICML 2024
Published: 2024

9. Cross-Domain Policy Adaptation by Capturing Representation Mismatch

Author: Lyu, Jiafei, Bai, Chenjia, Yang, Jingwen, Lu, Zongqing, and Li, Xiu
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: It is vital to learn effective policies that can be transferred to different domains with dynamics discrepancies in reinforcement learning (RL). In this paper, we consider dynamics adaptation settings where there exists dynamics mismatch between the source domain and the target domain, and one can get access to sufficient source domain data, while can only have limited interactions with the target domain. Existing methods address this problem by learning domain classifiers, performing data filtering from a value discrepancy perspective, etc. Instead, we tackle this challenge from a decoupled representation learning perspective. We perform representation learning only in the target domain and measure the representation deviations on the transitions from the source domain, which we show can be a signal of dynamics mismatch. We also show that representation deviation upper bounds performance difference of a given policy in the source domain and target domain, which motivates us to adopt representation deviation as a reward penalty. The produced representations are not involved in either policy or value function, but only serve as a reward penalizer. We conduct extensive experiments on environments with kinematic and morphology mismatch, and the results show that our method exhibits strong performance on many tasks. Our code is publicly available at https://github.com/dmksjfl/PAR., Comment: ICML 2024
Published: 2024

10. Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration

Author: Zhang, Yang, Yang, Shixin, Bai, Chenjia, Wu, Fei, Li, Xiu, Wang, Zhen, and Li, Xuelong
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Multiagent Systems, Computer Science - Robotics
Abstract: Grounding the reasoning ability of large language models (LLMs) for embodied tasks is challenging due to the complexity of the physical world. Especially, LLM planning for multi-agent collaboration requires communication of agents or credit assignment as the feedback to re-adjust the proposed plans and achieve effective coordination. However, existing methods that overly rely on physical verification or self-reflection suffer from excessive and inefficient querying of LLMs. In this paper, we propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans. Specifically, we perform critic regression to learn a sequential advantage function from LLM-planned data, and then treat the LLM planner as an optimizer to generate actions that maximize the advantage function. It endows the LLM with the foresight to discern whether the action contributes to accomplishing the final task. We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems. Experiments on Overcooked-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents and query rounds of LLMs, demonstrating its high efficiency for grounding LLMs. More results are given at https://read-llm.github.io/., Comment: The first two authors contributed equally
Published: 2024

11. Ensemble Successor Representations for Task Generalization in Offline-to-Online Reinforcement Learning

Author: Wang, Changhong, Yu, Xudong, Bai, Chenjia, Zhang, Qiaosheng, and Wang, Zhen
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: In Reinforcement Learning (RL), training a policy from scratch with online experiences can be inefficient because of the difficulties in exploration. Recently, offline RL provides a promising solution by giving an initialized offline policy, which can be refined through online interactions. However, existing approaches primarily perform offline and online learning in the same task, without considering the task generalization problem in offline-to-online adaptation. In real-world applications, it is common that we only have an offline dataset from a specific task while aiming for fast online-adaptation for several tasks. To address this problem, our work builds upon the investigation of successor representations for task generalization in online RL and extends the framework to incorporate offline-to-online learning. We demonstrate that the conventional paradigm using successor features cannot effectively utilize offline data and improve the performance for the new task by online fine-tuning. To mitigate this, we introduce a novel methodology that leverages offline data to acquire an ensemble of successor representations and subsequently constructs ensemble Q functions. This approach enables robust representation learning from datasets with different coverage and facilitates fast adaption of Q functions towards new tasks during the online fine-tuning phase. Extensive empirical evaluations provide compelling evidence showcasing the superior performance of our method in generalizing to diverse or even unseen tasks., Comment: Accepted by Science China Information Sciences
Published: 2024

12. Contrastive Representation for Data Filtering in Cross-Domain Offline Reinforcement Learning

Author: Wen, Xiaoyu, Bai, Chenjia, Xu, Kang, Yu, Xudong, Zhang, Yang, Li, Xuelong, and Wang, Zhen
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Cross-domain offline reinforcement learning leverages source domain data with diverse transition dynamics to alleviate the data requirement for the target domain. However, simply merging the data of two domains leads to performance degradation due to the dynamics mismatch. Existing methods address this problem by measuring the dynamics gap via domain classifiers while relying on the assumptions of the transferability of paired domains. In this paper, we propose a novel representation-based approach to measure the domain gap, where the representation is learned through a contrastive objective by sampling transitions from different domains. We show that such an objective recovers the mutual-information gap of transition functions in two domains without suffering from the unbounded issue of the dynamics gap in handling significantly different domains. Based on the representations, we introduce a data filtering algorithm that selectively shares transitions from the source domain according to the contrastive score functions. Empirical results on various tasks demonstrate that our method achieves superior performance, using only 10% of the target data to achieve 89.2% of the performance on 100% target dataset with state-of-the-art methods., Comment: This paper has been accepted by ICML2024
Published: 2024

13. Pessimistic Value Iteration for Multi-Task Data Sharing in Offline Reinforcement Learning

Author: Bai, Chenjia, Wang, Lingxiao, Hao, Jianye, Yang, Zhuoran, Zhao, Bin, Wang, Zhen, and Li, Xuelong
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Offline Reinforcement Learning (RL) has shown promising results in learning a task-specific policy from a fixed dataset. However, successful offline RL often relies heavily on the coverage and quality of the given dataset. In scenarios where the dataset for a specific task is limited, a natural approach is to improve offline RL with datasets from other tasks, namely, to conduct Multi-Task Data Sharing (MTDS). Nevertheless, directly sharing datasets from other tasks exacerbates the distribution shift in offline RL. In this paper, we propose an uncertainty-based MTDS approach that shares the entire dataset without data selection. Given ensemble-based uncertainty quantification, we perform pessimistic value iteration on the shared offline dataset, which provides a unified framework for single- and multi-task offline RL. We further provide theoretical analysis, which shows that the optimality gap of our method is only related to the expected data coverage of the shared dataset, thus resolving the distribution shift issue in data sharing. Empirically, we release an MTDS benchmark and collect datasets from three challenging domains. The experimental results show our algorithm outperforms the previous state-of-the-art methods in challenging MTDS problems. See https://github.com/Baichenjia/UTDS for the datasets and code., Comment: Accepted by Artificial Intelligence (AIJ)
Published: 2024
Full Text: View/download PDF

14. Provably Efficient Information-Directed Sampling Algorithms for Multi-Agent Reinforcement Learning

Author: Zhang, Qiaosheng, Bai, Chenjia, Hu, Shuyue, Wang, Zhen, and Li, Xuelong
Subjects: Computer Science - Information Theory, Computer Science - Machine Learning, Computer Science - Multiagent Systems, Statistics - Machine Learning
Abstract: This work designs and analyzes a novel set of algorithms for multi-agent reinforcement learning (MARL) based on the principle of information-directed sampling (IDS). These algorithms draw inspiration from foundational concepts in information theory, and are proven to be sample efficient in MARL settings such as two-player zero-sum Markov games (MGs) and multi-player general-sum MGs. For episodic two-player zero-sum MGs, we present three sample-efficient algorithms for learning Nash equilibrium. The basic algorithm, referred to as MAIDS, employs an asymmetric learning structure where the max-player first solves a minimax optimization problem based on the joint information ratio of the joint policy, and the min-player then minimizes the marginal information ratio with the max-player's policy fixed. Theoretical analyses show that it achieves a Bayesian regret of tilde{O}(sqrt{K}) for K episodes. To reduce the computational load of MAIDS, we develop an improved algorithm called Reg-MAIDS, which has the same Bayesian regret bound while enjoying less computational complexity. Moreover, by leveraging the flexibility of IDS principle in choosing the learning target, we propose two methods for constructing compressed environments based on rate-distortion theory, upon which we develop an algorithm Compressed-MAIDS wherein the learning target is a compressed environment. Finally, we extend Reg-MAIDS to multi-player general-sum MGs and prove that it can learn either the Nash equilibrium or coarse correlated equilibrium in a sample efficient manner.
Published: 2024

15. Diverse Randomized Value Functions: A Provably Pessimistic Approach for Offline Reinforcement Learning

Author: Yu, Xudong, Bai, Chenjia, Guo, Hongyi, Wang, Changhong, and Wang, Zhen
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Offline Reinforcement Learning (RL) faces distributional shift and unreliable value estimation, especially for out-of-distribution (OOD) actions. To address this, existing uncertainty-based methods penalize the value function with uncertainty quantification and demand numerous ensemble networks, posing computational challenges and suboptimal outcomes. In this paper, we introduce a novel strategy employing diverse randomized value functions to estimate the posterior distribution of $Q$-values. It provides robust uncertainty quantification and estimates lower confidence bounds (LCB) of $Q$-values. By applying moderate value penalties for OOD actions, our method fosters a provably pessimistic approach. We also emphasize on diversity within randomized value functions and enhance efficiency by introducing a diversity regularization method, reducing the requisite number of networks. These modules lead to reliable value estimation and efficient policy learning from offline data. Theoretical analysis shows that our method recovers the provably efficient LCB-penalty under linear MDP assumptions. Extensive empirical results also demonstrate that our proposed method significantly outperforms baseline methods in terms of performance and parametric efficiency.
Published: 2024

16. Regularized Conditional Diffusion Model for Multi-Task Preference Alignment

Author: Yu, Xudong, Bai, Chenjia, He, Haoran, Wang, Changhong, and Li, Xuelong
Subjects: Computer Science - Machine Learning
Abstract: Sequential decision-making is desired to align with human intents and exhibit versatility across various tasks. Previous methods formulate it as a conditional generation process, utilizing return-conditioned diffusion models to directly model trajectory distributions. Nevertheless, the return-conditioned paradigm relies on pre-defined reward functions, facing challenges when applied in multi-task settings characterized by varying reward functions (versatility) and showing limited controllability concerning human preferences (alignment). In this work, we adopt multi-task preferences as a unified condition for both single- and multi-task decision-making, and propose preference representations aligned with preference labels. The learned representations are used to guide the conditional generation process of diffusion models, and we introduce an auxiliary objective to maximize the mutual information between representations and corresponding generated trajectories, improving alignment between trajectories and preferences. Extensive experiments in D4RL and Meta-World demonstrate that our method presents favorable performance in single- and multi-task scenarios, and exhibits superior alignment with preferences., Comment: Accepted by NeurIPS 2024, 23 pages
Published: 2024

17. Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training

Author: He, Haoran, Bai, Chenjia, Pan, Ling, Zhang, Weinan, Zhao, Bin, and Li, Xuelong
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: Learning a generalist embodied agent capable of completing multiple tasks poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. In contrast, a vast amount of human videos exist, capturing intricate tasks and interactions with the physical world. Promising prospects arise for utilizing actionless human videos for pre-training and transferring the knowledge to facilitate robot policy learning through limited robot demonstrations. However, it remains a challenge due to the domain gap between humans and robots. Moreover, it is difficult to extract useful information representing the dynamic world from human videos, because of its noisy and multimodal data structure. In this paper, we introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. We start by compressing both human and robot videos into unified video tokens. In the pre-training stage, we employ a discrete diffusion model with a mask-and-replace diffusion strategy to predict future video tokens in the latent space. In the fine-tuning stage, we harness the imagined future videos to guide low-level action learning with a limited set of robot data. Experiments demonstrate that our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches with superior performance. Our project website is available at https://video-diff.github.io/., Comment: Accepted by NeurIPS 2024. 24 pages
Published: 2024

18. OVD-Explorer: Optimism Should Not Be the Sole Pursuit of Exploration in Noisy Environments

Author: Liu, Jinyi, Wang, Zhi, Zheng, Yan, Hao, Jianye, Bai, Chenjia, Ye, Junjie, Wang, Zhen, Piao, Haiyin, and Sun, Yang
Subjects: Computer Science - Machine Learning
Abstract: In reinforcement learning, the optimism in the face of uncertainty (OFU) is a mainstream principle for directing exploration towards less explored areas, characterized by higher uncertainty. However, in the presence of environmental stochasticity (noise), purely optimistic exploration may lead to excessive probing of high-noise areas, consequently impeding exploration efficiency. Hence, in exploring noisy environments, while optimism-driven exploration serves as a foundation, prudent attention to alleviating unnecessary over-exploration in high-noise areas becomes beneficial. In this work, we propose Optimistic Value Distribution Explorer (OVD-Explorer) to achieve a noise-aware optimistic exploration for continuous control. OVD-Explorer proposes a new measurement of the policy's exploration ability considering noise in optimistic perspectives, and leverages gradient ascent to drive exploration. Practically, OVD-Explorer can be easily integrated with continuous control RL algorithms. Extensive evaluations on the MuJoCo and GridChaos tasks demonstrate the superiority of OVD-Explorer in achieving noise-aware optimistic exploration., Comment: Accepted by AAAI 2024, with appendix
Published: 2023

19. Towards Robust Offline-to-Online Reinforcement Learning via Uncertainty and Smoothness

Author: Wen, Xiaoyu, Yu, Xudong, Yang, Rui, Chen, Haoyuan, Bai, Chenjia, and Wang, Zhen
Subjects: Computer Science - Machine Learning
Abstract: To obtain a near-optimal policy with fewer interactions in Reinforcement Learning (RL), a promising approach involves the combination of offline RL, which enhances sample efficiency by leveraging offline datasets, and online RL, which explores informative transitions by interacting with the environment. Offline-to-Online (O2O) RL provides a paradigm for improving an offline trained agent within limited online interactions. However, due to the significant distribution shift between online experiences and offline data, most offline RL algorithms suffer from performance drops and fail to achieve stable policy improvement in O2O adaptation. To address this problem, we propose the Robust Offline-to-Online (RO2O) algorithm, designed to enhance offline policies through uncertainty and smoothness, and to mitigate the performance drop in online adaptation. Specifically, RO2O incorporates Q-ensemble for uncertainty penalty and adversarial samples for policy and value smoothness, which enable RO2O to maintain a consistent learning procedure in online adaptation without requiring special changes to the learning objective. Theoretical analyses in linear MDPs demonstrate that the uncertainty and smoothness lead to a tighter optimality bound in O2O against distribution shift. Experimental results illustrate the superiority of RO2O in facilitating stable offline-to-online learning and achieving significant improvement with limited online interactions., Comment: This paper has been accepted by Journal of Artificial Intelligence Research (JAIR). arXiv admin note: text overlap with arXiv:2306.06871 by other authors
Published: 2023
Full Text: View/download PDF

20. Robust Quadrupedal Locomotion via Risk-Averse Policy Learning

Author: Shi, Jiyuan, Bai, Chenjia, He, Haoran, Han, Lei, Wang, Dong, Zhao, Bin, Zhao, Mingguo, Li, Xiu, and Li, Xuelong
Subjects: Computer Science - Robotics
Abstract: The robustness of legged locomotion is crucial for quadrupedal robots in challenging terrains. Recently, Reinforcement Learning (RL) has shown promising results in legged locomotion and various methods try to integrate privileged distillation, scene modeling, and external sensors to improve the generalization and robustness of locomotion policies. However, these methods are hard to handle uncertain scenarios such as abrupt terrain changes or unexpected external forces. In this paper, we consider a novel risk-sensitive perspective to enhance the robustness of legged locomotion. Specifically, we employ a distributional value function learned by quantile regression to model the aleatoric uncertainty of environments, and perform risk-averse policy learning by optimizing the worst-case scenarios via a risk distortion measure. Extensive experiments in both simulation environments and a real Aliengo robot demonstrate that our method is efficient in handling various external disturbances, and the resulting policy exhibits improved robustness in harsh and uncertain situations in legged locomotion. Videos are available at https://risk-averse-locomotion.github.io/., Comment: 8 pages, 5 figures
Published: 2023

21. Bridging the Sim-to-Real Gap from the Information Bottleneck Perspective

Author: He, Haoran, Wu, Peilin, Bai, Chenjia, Lai, Hang, Wang, Lingxiao, Pan, Ling, Hu, Xiaolin, and Zhang, Weinan
Subjects: Computer Science - Machine Learning, Computer Science - Robotics
Abstract: Reinforcement Learning (RL) has recently achieved remarkable success in robotic control. However, most works in RL operate in simulated environments where privileged knowledge (e.g., dynamics, surroundings, terrains) is readily available. Conversely, in real-world scenarios, robot agents usually rely solely on local states (e.g., proprioceptive feedback of robot joints) to select actions, leading to a significant sim-to-real gap. Existing methods address this gap by either gradually reducing the reliance on privileged knowledge or performing a two-stage policy imitation. However, we argue that these methods are limited in their ability to fully leverage the available privileged knowledge, resulting in suboptimal performance. In this paper, we formulate the sim-to-real gap as an information bottleneck problem and therefore propose a novel privileged knowledge distillation method called the Historical Information Bottleneck (HIB). In particular, HIB learns a privileged knowledge representation from historical trajectories by capturing the underlying changeable dynamic information. Theoretical analysis shows that the learned privileged knowledge representation helps reduce the value discrepancy between the oracle and learned policies. Empirical experiments on both simulated and real-world tasks demonstrate that HIB yields improved generalizability compared to previous methods. Videos of real-world experiments are available at https://sites.google.com/view/history-ib ., Comment: Accepted by CoRL 2024
Published: 2023

22. Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning

Author: He, Haoran, Bai, Chenjia, Xu, Kang, Yang, Zhuoran, Zhang, Weinan, Wang, Dong, Zhao, Bin, and Li, Xuelong
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Diffusion models have demonstrated highly-expressive generative capabilities in vision and NLP. Recent studies in reinforcement learning (RL) have shown that diffusion models are also powerful in modeling complex policies or trajectories in offline datasets. However, these works have been limited to single-task settings where a generalist agent capable of addressing multi-task predicaments is absent. In this paper, we aim to investigate the effectiveness of a single diffusion model in modeling large-scale multi-task offline data, which can be challenging due to diverse and multimodal data distribution. Specifically, we propose Multi-Task Diffusion Model (\textsc{MTDiff}), a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis in multi-task offline settings. \textsc{MTDiff} leverages vast amounts of knowledge available in multi-task data and performs implicit knowledge sharing among tasks. For generative planning, we find \textsc{MTDiff} outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D. For data synthesis, \textsc{MTDiff} generates high-quality data for testing tasks given a single demonstration as a prompt, which enhances the low-quality datasets for even unseen tasks., Comment: Accepted by NeurIPS 2023. 22 pages
Published: 2023

23. Cross-Domain Policy Adaptation via Value-Guided Data Filtering

Author: Xu, Kang, Bai, Chenjia, Ma, Xiaoteng, Wang, Dong, Zhao, Bin, Wang, Zhen, Li, Xuelong, and Li, Wei
Subjects: Computer Science - Machine Learning
Abstract: Generalizing policies across different domains with dynamics mismatch poses a significant challenge in reinforcement learning. For example, a robot learns the policy in a simulator, but when it is deployed in the real world, the dynamics of the environment may be different. Given the source and target domain with dynamics mismatch, we consider the online dynamics adaptation problem, in which case the agent can access sufficient source domain data while online interactions with the target domain are limited. Existing research has attempted to solve the problem from the dynamics discrepancy perspective. In this work, we reveal the limitations of these methods and explore the problem from the value difference perspective via a novel insight on the value consistency across domains. Specifically, we present the Value-Guided Data Filtering (VGDF) algorithm, which selectively shares transitions from the source domain based on the proximity of paired value targets across the two domains. Empirical results on various environments with kinematic and morphology shifts demonstrate that our method achieves superior performance compared to prior approaches., Comment: 27 pages, 15 figures
Published: 2023

24. On the Value of Myopic Behavior in Policy Reuse

Author: Xu, Kang, Bai, Chenjia, Qiu, Shuang, He, Haoran, Zhao, Bin, Wang, Zhen, Li, Wei, and Li, Xuelong
Subjects: Computer Science - Machine Learning
Abstract: Leveraging learned strategies in unfamiliar scenarios is fundamental to human intelligence. In reinforcement learning, rationally reusing the policies acquired from other tasks or human experts is critical for tackling problems that are difficult to learn from scratch. In this work, we present a framework called Selective Myopic bEhavior Control~(SMEC), which results from the insight that the short-term behaviors of prior policies are sharable across tasks. By evaluating the behaviors of prior policies via a hybrid value function architecture, SMEC adaptively aggregates the sharable short-term behaviors of prior policies and the long-term behaviors of the task policy, leading to coordinated decisions. Empirical results on a collection of manipulation and locomotion tasks demonstrate that SMEC outperforms existing methods, and validate the ability of SMEC to leverage related prior policies., Comment: 28 pages, 25 figures
Published: 2023

25. Behavior Contrastive Learning for Unsupervised Skill Discovery

Author: Yang, Rushuai, Bai, Chenjia, Guo, Hongyi, Li, Siyuan, Zhao, Bin, Wang, Zhen, Liu, Peng, and Li, Xuelong
Subjects: Computer Science - Machine Learning
Abstract: In reinforcement learning, unsupervised skill discovery aims to learn diverse skills without extrinsic rewards. Previous methods discover skills by maximizing the mutual information (MI) between states and skills. However, such an MI objective tends to learn simple and static skills and may hinder exploration. In this paper, we propose a novel unsupervised skill discovery method through contrastive learning among behaviors, which makes the agent produce similar behaviors for the same skill and diverse behaviors for different skills. Under mild assumptions, our objective maximizes the MI between different behaviors based on the same skill, which serves as an upper bound of the previous MI objective. Meanwhile, our method implicitly increases the state entropy to obtain better state coverage. We evaluate our method on challenging mazes and continuous control tasks. The results show that our method generates diverse and far-reaching skills, and also obtains competitive performance in downstream tasks compared to the state-of-the-art methods., Comment: Accepted at the 40th International Conference on Machine Learning (ICML 2023)
Published: 2023

26. Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning

Author: Qiu, Shuang, Wang, Lingxiao, Bai, Chenjia, Yang, Zhuoran, and Wang, Zhaoran
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: In view of its power in extracting feature representation, contrastive self-supervised learning has been successfully integrated into the practice of (deep) reinforcement learning (RL), leading to efficient policy learning in various applications. Despite its tremendous empirical successes, the understanding of contrastive learning for RL remains elusive. To narrow such a gap, we study how RL can be empowered by contrastive learning in a class of Markov decision processes (MDPs) and Markov games (MGs) with low-rank transitions. For both models, we propose to extract the correct feature representations of the low-rank model by minimizing a contrastive loss. Moreover, under the online setting, we propose novel upper confidence bound (UCB)-type algorithms that incorporate such a contrastive loss with online RL algorithms for MDPs or MGs. We further theoretically prove that our algorithm recovers the true representations and simultaneously achieves sample efficiency in learning the optimal policy and Nash equilibrium in MDPs and MGs. We also provide empirical studies to demonstrate the efficacy of the UCB-based contrastive learning method for RL. To the best of our knowledge, we provide the first provably efficient online RL algorithm that incorporates contrastive learning for representation learning. Our codes are available at https://github.com/Baichenjia/Contrastive-UCB., Comment: ICML 2022
Published: 2022

27. RORL: Robust Offline Reinforcement Learning via Conservative Smoothing

Author: Yang, Rui, Bai, Chenjia, Ma, Xiaoteng, Wang, Zhaoran, Zhang, Chongjie, and Han, Lei
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning
Abstract: Offline reinforcement learning (RL) provides a promising direction to exploit massive amount of offline data for complex decision-making tasks. Due to the distribution shift issue, current offline RL algorithms are generally designed to be conservative in value estimation and action selection. However, such conservatism can impair the robustness of learned policies when encountering observation deviation under realistic conditions, such as sensor errors and adversarial attacks. To trade off robustness and conservatism, we propose Robust Offline Reinforcement Learning (RORL) with a novel conservative smoothing technique. In RORL, we explicitly introduce regularization on the policy and the value function for states near the dataset, as well as additional conservative value estimation on these states. Theoretically, we show RORL enjoys a tighter suboptimality bound than recent theoretical results in linear MDPs. We demonstrate that RORL can achieve state-of-the-art performance on the general offline RL benchmark and is considerably robust to adversarial observation perturbations., Comment: Accepted by Advances in Neural Information Processing Systems (NeurIPS) 2022
Published: 2022

28. Skill matters: Dynamic skill learning for multi-agent cooperative reinforcement learning

Author: Li, Tong, Bai, Chenjia, Xu, Kang, Chu, Chen, Zhu, Peican, and Wang, Zhen
Published: 2025
Full Text: View/download PDF

29. Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement Learning

Author: Bai, Chenjia, Wang, Lingxiao, Yang, Zhuoran, Deng, Zhihong, Garg, Animesh, Liu, Peng, and Wang, Zhaoran
Subjects: Computer Science - Machine Learning
Abstract: Offline Reinforcement Learning (RL) aims to learn policies from previously collected datasets without exploring the environment. Directly applying off-policy algorithms to offline RL usually fails due to the extrapolation error caused by the out-of-distribution (OOD) actions. Previous methods tackle such problem by penalizing the Q-values of OOD actions or constraining the trained policy to be close to the behavior policy. Nevertheless, such methods typically prevent the generalization of value functions beyond the offline data and also lack precise characterization of OOD data. In this paper, we propose Pessimistic Bootstrapping for offline RL (PBRL), a purely uncertainty-driven offline algorithm without explicit policy constraints. Specifically, PBRL conducts uncertainty quantification via the disagreement of bootstrapped Q-functions, and performs pessimistic updates by penalizing the value function based on the estimated uncertainty. To tackle the extrapolating error, we further propose a novel OOD sampling method. We show that such OOD sampling and pessimistic bootstrapping yields provable uncertainty quantifier in linear MDPs, thus providing the theoretical underpinning for PBRL. Extensive experiments on D4RL benchmark show that PBRL has better performance compared to the state-of-the-art algorithms., Comment: ICLR 2022
Published: 2022

30. False Correlation Reduction for Offline Reinforcement Learning

Author: Deng, Zhihong, Fu, Zuyue, Wang, Lingxiao, Yang, Zhuoran, Bai, Chenjia, Zhou, Tianyi, Wang, Zhaoran, and Jiang, Jing
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Offline reinforcement learning (RL) harnesses the power of massive datasets for resolving sequential decision problems. Most existing papers only discuss defending against out-of-distribution (OOD) actions while we investigate a broader issue, the false correlations between epistemic uncertainty and decision-making, an essential factor that causes suboptimality. In this paper, we propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm. We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL). The proposed algorithm introduces an annealing behavior cloning regularizer to help produce a high-quality estimation of uncertainty which is critical for eliminating false correlations from suboptimality. Theoretically, we justify the rationality of the proposed method and prove its convergence to the optimal policy with a sublinear rate under mild assumptions., Comment: 16 pages, 14 figures
Published: 2021

31. Dynamic Bottleneck for Robust Self-Supervised Exploration

Author: Bai, Chenjia, Wang, Lingxiao, Han, Lei, Garg, Animesh, Hao, Jianye, Liu, Peng, and Wang, Zhaoran
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Exploration methods based on pseudo-count of transitions or curiosity of dynamics have achieved promising results in solving reinforcement learning with sparse rewards. However, such methods are usually sensitive to environmental dynamics-irrelevant information, e.g., white-noise. To handle such dynamics-irrelevant information, we propose a Dynamic Bottleneck (DB) model, which attains a dynamics-relevant representation based on the information-bottleneck principle. Based on the DB model, we further propose DB-bonus, which encourages the agent to explore state-action pairs with high information gain. We establish theoretical connections between the proposed DB-bonus, the upper confidence bound (UCB) for linear case, and the visiting count for tabular case. We evaluate the proposed method on Atari suits with dynamics-irrelevant noises. Our experiments show that exploration with DB bonus outperforms several state-of-the-art exploration methods in noisy environments., Comment: NeurIPS 2021
Published: 2021

32. Exploration in Deep Reinforcement Learning: From Single-Agent to Multiagent Domain

Author: Hao, Jianye, Yang, Tianpei, Tang, Hongyao, Bai, Chenjia, Liu, Jinyi, Meng, Zhaopeng, Liu, Peng, and Wang, Zhen
Subjects: Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multiagent Systems
Abstract: Deep Reinforcement Learning (DRL) and Deep Multi-agent Reinforcement Learning (MARL) have achieved significant successes across a wide range of domains, including game AI, autonomous vehicles, robotics, and so on. However, DRL and deep MARL agents are widely known to be sample inefficient that millions of interactions are usually needed even for relatively simple problem settings, thus preventing the wide application and deployment in real-industry scenarios. One bottleneck challenge behind is the well-known exploration problem, i.e., how efficiently exploring the environment and collecting informative experiences that could benefit policy learning towards the optimal ones. This problem becomes more challenging in complex environments with sparse rewards, noisy distractions, long horizons, and non-stationary co-learners. In this paper, we conduct a comprehensive survey on existing exploration methods for both single-agent and multi-agent RL. We start the survey by identifying several key challenges to efficient exploration. Beyond the above two main branches, we also include other notable exploration methods with different ideas and techniques. In addition to algorithmic analysis, we provide a comprehensive and unified empirical comparison of different exploration methods for DRL on a set of commonly used benchmarks. According to our algorithmic and empirical investigation, we finally summarize the open problems of exploration in DRL and deep MARL and point out a few future directions., Comment: Accepted by IEEE Transactions on Neural Networks and Learning Systems (TNNLS)
Published: 2021
Full Text: View/download PDF

33. Principled Exploration via Optimistic Bootstrapping and Backward Induction

Author: Bai, Chenjia, Wang, Lingxiao, Han, Lei, Hao, Jianye, Garg, Animesh, Liu, Peng, and Wang, Zhaoran
Subjects: Computer Science - Machine Learning
Abstract: One principled approach for provably efficient exploration is incorporating the upper confidence bound (UCB) into the value function as a bonus. However, UCB is specified to deal with linear and tabular settings and is incompatible with Deep Reinforcement Learning (DRL). In this paper, we propose a principled exploration method for DRL through Optimistic Bootstrapping and Backward Induction (OB2I). OB2I constructs a general-purpose UCB-bonus through non-parametric bootstrap in DRL. The UCB-bonus estimates the epistemic uncertainty of state-action pairs for optimistic exploration. We build theoretical connections between the proposed UCB-bonus and the LSVI-UCB in a linear setting. We propagate future uncertainty in a time-consistent manner through episodic backward update, which exploits the theoretical advantage and empirically improves the sample-efficiency. Our experiments in the MNIST maze and Atari suite suggest that OB2I outperforms several state-of-the-art exploration approaches., Comment: ICML 2021
Published: 2021

34. Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning

Author: Bai, Chenjia, Liu, Peng, Liu, Kaiyu, Wang, Lingxiao, Zhao, Yingnan, and Han, Lei
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: Efficient exploration remains a challenging problem in reinforcement learning, especially for tasks where extrinsic rewards from environments are sparse or even totally disregarded. Significant advances based on intrinsic motivation show promising results in simple environments but often get stuck in environments with multimodal and stochastic dynamics. In this work, we propose a variational dynamic model based on the conditional variational inference to model the multimodality and stochasticity. We consider the environmental state-action transition as a conditional generative process by generating the next-state prediction under the condition of the current state, action, and latent variable, which provides a better understanding of the dynamics and leads a better performance in exploration. We derive an upper bound of the negative log-likelihood of the environmental transition and use such an upper bound as the intrinsic reward for exploration, which allows the agent to learn skills by self-supervised exploration without observing extrinsic rewards. We evaluate the proposed method on several image-based simulation tasks and a real robotic manipulating task. Our method outperforms several state-of-the-art environment model-based exploration approaches., Comment: IEEE Transactions on Neural Networks and Learning Systems (TNNLS) 2021
Published: 2020
Full Text: View/download PDF

35. Elucidation of the using condition and working mechanism of tertiary lactose in dry powder formulations for inhalation

Author: Ye, Yuqing, Shi, Tingting, Fan, Ziyi, Bai, Chenjia, Ma, Ying, and Zhu, Jesse
Published: 2023
Full Text: View/download PDF

36. Large-Scale Actionless Video Pre-Training via Discrete Diffusion for Efficient Policy Learning

Author: He, Haoran, Bai, Chenjia, Pan, Ling, Zhang, Weinan, Zhao, Bin, Li, Xuelong, He, Haoran, Bai, Chenjia, Pan, Ling, Zhang, Weinan, Zhao, Bin, and Li, Xuelong
Abstract: Learning a generalist embodied agent capable of completing multiple tasks poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. In contrast, a vast amount of human videos exist, capturing intricate tasks and interactions with the physical world. Promising prospects arise for utilizing actionless human videos for pre-training and transferring the knowledge to facilitate robot policy learning through limited robot demonstrations. In this paper, we introduce a novel framework that leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. We start by compressing both human and robot videos into unified video tokens. In the pre-training stage, we employ a discrete diffusion model with a mask-and-replace diffusion strategy to predict future video tokens in the latent space. In the fine-tuning stage, we harness the imagined future videos to guide low-level action learning trained on a limited set of robot data. Experiments demonstrate that our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches with superior generalization ability. Our project website is available at https://video-diff.github.io/., Comment: 21 pages
Published: 2024

37. Generating attentive goals for prioritized hindsight reinforcement learning

Author: Liu, Peng, Bai, Chenjia, Zhao, Yingnan, Bai, Chenyao, Zhao, Wei, and Tang, Xianglong
Published: 2020
Full Text: View/download PDF

38. Obtaining accurate estimated action values in categorical distributional reinforcement learning

Author: Zhao, Yingnan, Liu, Peng, Bai, Chenjia, Zhao, Wei, and Tang, Xianglong
Published: 2020
Full Text: View/download PDF

39. Pessimistic value iteration for multi-task data sharing in Offline Reinforcement Learning

Author: Bai, Chenjia, primary, Wang, Lingxiao, additional, Hao, Jianye, additional, Yang, Zhuoran, additional, Zhao, Bin, additional, Wang, Zhen, additional, and Li, Xuelong, additional
Published: 2024
Full Text: View/download PDF

40. Skill Matters: Dynamic Skill Learning for Multi-Agent Cooperative Reinforcement Learning

Author: Li, Tong, primary, Bai, Chenjia, additional, Xu, Kang, additional, Chu, Chen, additional, Zhu, Peican, additional, and Wang, Zhen, additional
Published: 2024
Full Text: View/download PDF

41. Guided goal generation for hindsight multi-goal reinforcement learning

Author: Bai, Chenjia, Liu, Peng, Zhao, Wei, and Tang, Xianglong
Published: 2019
Full Text: View/download PDF

42. Self-Supervised Imitation for Offline Reinforcement Learning With Hindsight Relabeling

Author: Yu, Xudong, primary, Bai, Chenjia, additional, Wang, Changhong, additional, Yu, Dengxiu, additional, Chen, C. L. Philip, additional, and Wang, Zhen, additional
Published: 2023
Full Text: View/download PDF

43. Privileged Knowledge Distillation for Sim-to-Real Policy Generalization

Author: He, Haoran, Bai, Chenjia, Lai, Hang, Wang, Lingxiao, Zhang, Weinan, He, Haoran, Bai, Chenjia, Lai, Hang, Wang, Lingxiao, and Zhang, Weinan
Abstract: Reinforcement Learning (RL) has recently achieved remarkable success in robotic control. However, most RL methods operate in simulated environments where privileged knowledge (e.g., dynamics, surroundings, terrains) is readily available. Conversely, in real-world scenarios, robot agents usually rely solely on local states (e.g., proprioceptive feedback of robot joints) to select actions, leading to a significant sim-to-real gap. Existing methods address this gap by either gradually reducing the reliance on privileged knowledge or performing a two-stage policy imitation. However, we argue that these methods are limited in their ability to fully leverage the privileged knowledge, resulting in suboptimal performance. In this paper, we propose a novel single-stage privileged knowledge distillation method called the Historical Information Bottleneck (HIB) to narrow the sim-to-real gap. In particular, HIB learns a privileged knowledge representation from historical trajectories by capturing the underlying changeable dynamic information. Theoretical analysis shows that the learned privileged knowledge representation helps reduce the value discrepancy between the oracle and learned policies. Empirical experiments on both simulated and real-world tasks demonstrate that HIB yields improved generalizability compared to previous methods., Comment: 22 pages
Published: 2023

44. False Correlation Reduction for Offline Reinforcement Learning

Author: Deng, Zhihong, Fu, Zuyue, Wang, Lingxiao, Yang, Zhuoran, Bai, Chenjia, Zhou, Tianyi, Wang, Zhaoran, and Jiang, Jing
Abstract: Offline reinforcement learning (RL) harnesses the power of massive datasets for resolving sequential decision problems. Most existing papers only discuss defending against out-of-distribution (OOD) actions while we investigate a broader issue, the false correlations between epistemic uncertainty and decision-making, an essential factor that causes suboptimality. In this paper, we propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm. We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL). The proposed algorithm introduces an annealing behavior cloning regularizer to help produce a high-quality estimation of uncertainty which is critical for eliminating false correlations from suboptimality. Theoretically, we justify the rationality of the proposed method and prove its convergence to the optimal policy with a sublinear rate under mild assumptions.
Published: 2024
Full Text: View/download PDF

45. Monotonic Quantile Network for Worst-Case Offline Reinforcement Learning

Author: Bai, Chenjia, Xiao, Ting, Zhu, Zhoufan, Wang, Lingxiao, Zhou, Fan, Garg, Animesh, He, Bin, Liu, Peng, and Wang, Zhaoran
Abstract: A key challenge in offline reinforcement learning (RL) is how to ensure the learned offline policy is safe, especially in safety-critical domains. In this article, we focus on learning a distributional value function in offline RL and optimizing a worst-case criterion of returns. However, optimizing a distributional value function in offline RL can be hard, since the crossing quantile issue is serious, and the distribution shift problem needs to be addressed. To this end, we propose monotonic quantile network (MQN) with conservative quantile regression (CQR) for risk-averse policy learning. First, we propose an MQN to learn the distribution over returns with non-crossing guarantees of the quantiles. Then, we perform CQR by penalizing the quantile estimation for out-of-distribution (OOD) actions to address the distribution shift in offline RL. Finally, we learn a worst-case policy by optimizing the conditional value-at-risk (CVaR) of the distributional value function. Furthermore, we provide theoretical analysis of the fixed-point convergence in our method. We conduct experiments in both risk-neutral and risk-sensitive offline settings, and the results show that our method obtains safe and conservative behaviors in robotic locomotion tasks.
Published: 2024
Full Text: View/download PDF

46. Exploration in Deep Reinforcement Learning: From Single-Agent to Multiagent Domain

Author: Hao, Jianye, Yang, Tianpei, Tang, Hongyao, Bai, Chenjia, Liu, Jinyi, Meng, Zhaopeng, Liu, Peng, and Wang, Zhen
Abstract: Deep reinforcement learning (DRL) and deep multiagent reinforcement learning (MARL) have achieved significant success across a wide range of domains, including game artificial intelligence (AI), autonomous vehicles, and robotics. However, DRL and deep MARL agents are widely known to be sample inefficient that millions of interactions are usually needed even for relatively simple problem settings, thus preventing the wide application and deployment in real-industry scenarios. One bottleneck challenge behind is the well-known exploration problem, i.e., how efficiently exploring the environment and collecting informative experiences that could benefit policy learning toward the optimal ones. This problem becomes more challenging in complex environments with sparse rewards, noisy distractions, long horizons, and nonstationary co-learners. In this article, we conduct a comprehensive survey on existing exploration methods for both single-agent RL and multiagent RL. We start the survey by identifying several key challenges to efficient exploration. Then, we provide a systematic survey of existing approaches by classifying them into two major categories: uncertainty-oriented exploration and intrinsic motivation-oriented exploration. Beyond the above two main branches, we also include other notable exploration methods with different ideas and techniques. In addition to algorithmic analysis, we provide a comprehensive and unified empirical comparison of different exploration methods for DRL on a set of commonly used benchmarks. According to our algorithmic and empirical investigation, we finally summarize the open problems of exploration in DRL and deep MARL and point out a few future directions.
Published: 2024
Full Text: View/download PDF

47. False Correlation Reduction for Offline Reinforcement Learning

Author: Deng, Zhihong, primary, Fu, Zuyue, additional, Wang, Lingxiao, additional, Yang, Zhuoran, additional, Bai, Chenjia, additional, Zhou, Tianyi, additional, Wang, Zhaoran, additional, and Jiang, Jing, additional
Published: 2023
Full Text: View/download PDF

48. Exploration in Deep Reinforcement Learning: From Single-Agent to Multiagent Domain

Author: Hao, Jianye, primary, Yang, Tianpei, additional, Tang, Hongyao, additional, Bai, Chenjia, additional, Liu, Jinyi, additional, Meng, Zhaopeng, additional, Liu, Peng, additional, and Wang, Zhen, additional
Published: 2023
Full Text: View/download PDF

49. Addressing Hindsight Bias in Multigoal Reinforcement Learning

Author: Bai, Chenjia, primary, Wang, Lingxiao, additional, Wang, Yixin, additional, Wang, Zhaoran, additional, Zhao, Rui, additional, Bai, Chenyao, additional, and Liu, Peng, additional
Published: 2023
Full Text: View/download PDF

50. Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning

Author: Bai, Chenjia, Liu, Peng, Liu, Kaiyu, Wang, Lingxiao, Zhao, Yingnan, Han, Lei, and Wang, Zhaoran
Abstract: Efficient exploration remains a challenging problem in reinforcement learning, especially for tasks where extrinsic rewards from environments are sparse or even totally disregarded. Significant advances based on intrinsic motivation show promising results in simple environments but often get stuck in environments with multimodal and stochastic dynamics. In this work, we propose a variational dynamic model based on the conditional variational inference to model the multimodality and stochasticity. We consider the environmental state–action transition as a conditional generative process by generating the next-state prediction under the condition of the current state, action, and latent variable, which provides a better understanding of the dynamics and leads to a better performance in exploration. We derive an upper bound of the negative log likelihood of the environmental transition and use such an upper bound as the intrinsic reward for exploration, which allows the agent to learn skills by self-supervised exploration without observing extrinsic rewards. We evaluate the proposed method on several image-based simulation tasks and a real robotic manipulating task. Our method outperforms several state-of-the-art environment model-based exploration approaches.
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

110 results on '"Bai, Chenjia"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources