Author: "Akrour, Riad" / Publication Year Range: Last 50 years - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Akrour, Riad"' showing total 43 results

Start Over Author "Akrour, Riad" Publication Year Range Last 50 years

43 results on '"Akrour, Riad"'

1. Augmented Bayesian Policy Search

Author: Kallel, Mahdi, Basu, Debabrota, Akrour, Riad, and D'Eramo, Carlo
Subjects: Computer Science - Machine Learning
Abstract: Deterministic policies are often preferred over stochastic ones when implemented on physical systems. They can prevent erratic and harmful behaviors while being easier to implement and interpret. However, in practice, exploration is largely performed by stochastic policies. First-order Bayesian Optimization (BO) methods offer a principled way of performing exploration using deterministic policies. This is done through a learned probabilistic model of the objective function and its gradient. Nonetheless, such approaches treat policy search as a black-box problem, and thus, neglect the reinforcement learning nature of the problem. In this work, we leverage the performance difference lemma to introduce a novel mean function for the probabilistic model. This results in augmenting BO methods with the action-value function. Hence, we call our method Augmented Bayesian Search~(ABS). Interestingly, this new mean function enhances the posterior gradient with the deterministic policy gradient, effectively bridging the gap between BO and policy gradient methods. The resulting algorithm combines the convenience of the direct policy search with the scalability of reinforcement learning. We validate ABS on high-dimensional locomotion problems and demonstrate competitive performance compared to existing direct policy search schemes., Comment: Accepted to the International Conference on Learning Representations (ICLR) 2024
Published: 2024

2. Interpretable and Editable Programmatic Tree Policies for Reinforcement Learning

Author: Kohler, Hector, Delfosse, Quentin, Akrour, Riad, Kersting, Kristian, and Preux, Philippe
Subjects: Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Deep reinforcement learning agents are prone to goal misalignments. The black-box nature of their policies hinders the detection and correction of such misalignments, and the trust necessary for real-world deployment. So far, solutions learning interpretable policies are inefficient or require many human priors. We propose INTERPRETER, a fast distillation method producing INTerpretable Editable tRee Programs for ReinforcEmenT lEaRning. We empirically demonstrate that INTERPRETER compact tree programs match oracles across a diverse set of sequential decision tasks and evaluate the impact of our design choices on interpretability and performances. We show that our policies can be interpreted and edited to correct misalignments on Atari games and to explain real farming strategies.
Published: 2024

3. Limits of Actor-Critic Algorithms for Decision Tree Policies Learning in IBMDPs

Author: Kohler, Hector, Akrour, Riad, and Preux, Philippe
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Interpretability of AI models allows for user safety checks to build trust in such AIs. In particular, Decision Trees (DTs) provide a global look at the learned model and transparently reveal which features of the input are critical for making a decision. However, interpretability is hindered if the DT is too large. To learn compact trees, a recent Reinforcement Learning (RL) framework has been proposed to explore the space of DTs using deep RL. This framework augments a decision problem (e.g. a supervised classification task) with additional actions that gather information about the features of an otherwise hidden input. By appropriately penalizing these actions, the agent learns to optimally trade-off size and performance of DTs. In practice, a reactive policy for a partially observable Markov decision process (MDP) needs to be learned, which is still an open problem. We show in this paper that deep RL can fail even on simple toy tasks of this class. However, when the underlying decision problem is a supervised classification task, we show that finding the optimal tree can be cast as a fully observable Markov decision problem and be solved efficiently, giving rise to a new family of algorithms for learning DTs that go beyond the classical greedy maximization ones., Comment: To be included in an other submission. arXiv admin note: text overlap with arXiv:2304.05839
Published: 2023

4. Interpretable Decision Tree Search as a Markov Decision Process

Author: Kohler, Hector, Akrour, Riad, and Preux, Philippe
Subjects: Computer Science - Machine Learning
Abstract: Finding an optimal decision tree for a supervised learning task is a challenging combinatorial problem to solve at scale. It was recently proposed to frame the problem as a Markov Decision Problem (MDP) and use deep reinforcement learning to tackle scaling. Unfortunately, these methods are not competitive with the current branch-and-bound state-of-the-art. We propose instead to scale the resolution of such MDPs using an information-theoretic tests generating function that heuristically, and dynamically for every state, limits the set of admissible test actions to a few good candidates. As a solver, we show empirically that our algorithm is at the very least competitive with branch-and-bound alternatives. As a machine learning tool, a key advantage of our approach is to solve for multiple complexity-performance trade-offs at virtually no additional cost. With such a set of solutions, a user can then select the tree that generalizes best and which has the interpretability level that best suits their needs, which no current branch-and-bound method allows.
Published: 2023

5. Optimal Interpretability-Performance Trade-off of Classification Trees with Black-Box Reinforcement Learning

Author: Kohler, Hector, Akrour, Riad, and Preux, Philippe
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Interpretability of AI models allows for user safety checks to build trust in these models. In particular, decision trees (DTs) provide a global view on the learned model and clearly outlines the role of the features that are critical to classify a given data. However, interpretability is hindered if the DT is too large. To learn compact trees, a Reinforcement Learning (RL) framework has been recently proposed to explore the space of DTs. A given supervised classification task is modeled as a Markov decision problem (MDP) and then augmented with additional actions that gather information about the features, equivalent to building a DT. By appropriately penalizing these actions, the RL agent learns to optimally trade-off size and performance of a DT. However, to do so, this RL agent has to solve a partially observable MDP. The main contribution of this paper is to prove that it is sufficient to solve a fully observable problem to learn a DT optimizing the interpretability-performance trade-off. As such any planning or RL algorithm can be used. We demonstrate the effectiveness of this approach on a set of classical supervised classification datasets and compare our approach with other interpretability-performance optimizing methods.
Published: 2023

6. Entropy Regularized Reinforcement Learning with Cascading Networks

Author: Della Vecchia, Riccardo, Shilova, Alena, Preux, Philippe, and Akrour, Riad
Subjects: Computer Science - Machine Learning
Abstract: Deep Reinforcement Learning (Deep RL) has had incredible achievements on high dimensional problems, yet its learning process remains unstable even on the simplest tasks. Deep RL uses neural networks as function approximators. These neural models are largely inspired by developments in the (un)supervised machine learning community. Compared to these learning frameworks, one of the major difficulties of RL is the absence of i.i.d. data. One way to cope with this difficulty is to control the rate of change of the policy at every iteration. In this work, we challenge the common practices of the (un)supervised learning community of using a fixed neural architecture, by having a neural model that grows in size at each policy update. This allows a closed form entropy regularized policy update, which leads to a better control of the rate of change of the policy at each iteration and help cope with the non i.i.d. nature of RL. Initial experiments on classical RL benchmarks show promising results with remarkable convergence on some RL tasks when compared to other deep RL baselines, while exhibiting limitations on others.
Published: 2022

7. Convex Optimization with an Interpolation-based Projection and its Application to Deep Learning

Author: Akrour, Riad, Atamna, Asma, and Peters, Jan
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Convex optimizers have known many applications as differentiable layers within deep neural architectures. One application of these convex layers is to project points into a convex set. However, both forward and backward passes of these convex layers are significantly more expensive to compute than those of a typical neural network. We investigate in this paper whether an inexact, but cheaper projection, can drive a descent algorithm to an optimum. Specifically, we propose an interpolation-based projection that is computationally cheap and easy to compute given a convex, domain defining, function. We then propose an optimization algorithm that follows the gradient of the composition of the objective and the projection and prove its convergence for linear objectives and arbitrary convex and Lipschitz domain defining inequality constraints. In addition to the theoretical contributions, we demonstrate empirically the practical interest of the interpolation projection when used in conjunction with neural networks in a reinforcement learning and a supervised learning setting.
Published: 2020

8. Continuous Action Reinforcement Learning from a Mixture of Interpretable Experts

Author: Akrour, Riad, Tateo, Davide, and Peters, Jan
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Reinforcement learning (RL) has demonstrated its ability to solve high dimensional tasks by leveraging non-linear function approximators. However, these successes are mostly achieved by 'black-box' policies in simulated domains. When deploying RL to the real world, several concerns regarding the use of a 'black-box' policy might be raised. In order to make the learned policies more transparent, we propose in this paper a policy iteration scheme that retains a complex function approximator for its internal value predictions but constrains the policy to have a concise, hierarchical, and human-readable structure, based on a mixture of interpretable experts. Each expert selects a primitive action according to a distance to a prototypical state. A key design decision to keep such experts interpretable is to select the prototypical states from trajectory data. The main technical contribution of the paper is to address the challenges introduced by this non-differentiable prototypical state selection procedure. Experimentally, we show that our proposed algorithm can learn compelling policies on continuous action deep RL benchmarks, matching the performance of neural network based policies, but returning policies that are more amenable to human inspection than neural network or linear-in-feature policies.
Published: 2020

9. An Upper Bound of the Bias of Nadaraya-Watson Kernel Regression under Lipschitz Assumptions

Author: Tosatto, Samuele, Akrour, Riad, and Peters, Jan
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: The Nadaraya-Watson kernel estimator is among the most popular nonparameteric regression technique thanks to its simplicity. Its asymptotic bias has been studied by Rosenblatt in 1969 and has been reported in a number of related literature. However, Rosenblatt's analysis is only valid for infinitesimal bandwidth. In contrast, we propose in this paper an upper bound of the bias which holds for finite bandwidths. Moreover, contrarily to the classic analysis we allow for discontinuous first order derivative of the regression function, we extend our bounds for multidimensional domains and we include the knowledge of the bound of the regression function when it exists and if it is known, to obtain a tighter bound. We believe that this work has potential applications in those fields where some hard guarantees on the error are needed
Published: 2020

10. Compatible Natural Gradient Policy Search

Author: Pajarinen, Joni, Thai, Hong Linh, Akrour, Riad, Peters, Jan, and Neumann, Gerhard
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.
Published: 2019

11. Model-Free Trajectory-based Policy Optimization with Monotonic Improvement

Author: Akrour, Riad, Abdolmaleki, Abbas, Abdulsamad, Hany, Peters, Jan, and Neumann, Gerhard
Subjects: Computer Science - Machine Learning, Computer Science - Robotics
Abstract: Many of the recent trajectory optimization algorithms alternate between linear approximation of the system dynamics around the mean trajectory and conservative policy update. One way of constraining the policy change is by bounding the Kullback-Leibler (KL) divergence between successive policies. These approaches already demonstrated great experimental success in challenging problems such as end-to-end control of physical systems. However, the linear approximation of the system dynamics can introduce a bias in the policy update and prevent convergence to the optimal policy. In this article, we propose a new model-free trajectory-based policy optimization algorithm with guaranteed monotonic improvement. The algorithm backpropagates a local, quadratic and time-dependent \qfunc~learned from trajectory data instead of a model of the system dynamics. Our policy update ensures exact KL-constraint satisfaction without simplifying assumptions on the system dynamics. We experimentally demonstrate on highly non-linear control tasks the improvement in performance of our algorithm in comparison to approaches linearizing the system dynamics. In order to show the monotonic improvement of our algorithm, we additionally conduct a theoretical analysis of our policy update scheme to derive a lower bound of the change in policy return between successive iterations.
Published: 2016

12. Convex optimization with an interpolation-based projection and its application to deep learning

Author: Akrour, Riad, Atamna, Asma, and Peters, Jan
Published: 2021
Full Text: View/download PDF

13. An Upper Bound of the Bias of Nadaraya-Watson Kernel Regression under Lipschitz Assumptions

Author: Tosatto, Samuele, Akrour, Riad, Peters, Jan, Tosatto, Samuele, Akrour, Riad, and Peters, Jan
Abstract: The Nadaraya-Watson kernel estimator is among the most popular nonparameteric regression technique thanks to its simplicity. Its asymptotic bias has been studied by Rosenblatt in 1969 and has been reported in several related literature. However, given its asymptotic nature, it gives no access to a hard bound. The increasing popularity of predictive tools for automated decision-making surges the need for hard (non-probabilistic) guarantees. To alleviate this issue, we propose an upper bound of the bias which holds for finite bandwidths using Lipschitz assumptions and mitigating some of the prerequisites of Rosenblatt’s analysis. Our bound has potential applications in fields like surgical robots or self-driving cars, where some hard guarantees on the prediction-error are needed.
Published: 2024

14. Hierarchical Tactile-Based Control Decomposition of Dexterous In-Hand Manipulation Tasks

Author: Veiga, Filipe, Akrour, Riad, Peters, Jan, Veiga, Filipe, Akrour, Riad, and Peters, Jan
Abstract: In-hand manipulation and grasp adjustment with dexterous robotic hands is a complex problem that not only requires highly coordinated finger movements but also deals with interaction variability. The control problem becomes even more complex when introducing tactile information into the feedback loop. Traditional approaches do not consider tactile feedback and attempt to solve the problem either by relying on complex models that are not always readily available or by constraining the problem in order to make it more tractable. In this paper, we propose a hierarchical control approach where a higher level policy is learned through reinforcement learning, while low level controllers ensure grip stability throughout the manipulation action. The low level controllers are independent grip stabilization controllers based on tactile feedback. The independent controllers allow reinforcement learning approaches to explore the manipulation tasks state-action space in a more structured manner. We show that this structure allows learning the unconstrained task with RL methods that cannot learn it in a non-hierarchical setting. The low level controllers also provide an abstraction to the tactile sensors input, allowing transfer to real robot platforms. We show preliminary results of the transfer of policies trained in simulation to the real robot hand.
Published: 2024

15. APRIL: Active Preference-learning based Reinforcement Learning

Author: Akrour, Riad, Schoenauer, Marc, and Sebag, Michèle
Subjects: Computer Science - Learning
Abstract: This paper focuses on reinforcement learning (RL) with limited prior knowledge. In the domain of swarm robotics for instance, the expert can hardly design a reward function or demonstrate the target behavior, forbidding the use of both standard RL and inverse reinforcement learning. Although with a limited expertise, the human expert is still often able to emit preferences and rank the agent demonstrations. Earlier work has presented an iterative preference-based RL framework: expert preferences are exploited to learn an approximate policy return, thus enabling the agent to achieve direct policy search. Iteratively, the agent selects a new candidate policy and demonstrates it; the expert ranks the new demonstration comparatively to the previous best one; the expert's ranking feedback enables the agent to refine the approximate policy return, and the process is iterated. In this paper, preference-based reinforcement learning is combined with active ranking in order to decrease the number of ranking queries to the expert needed to yield a satisfactory policy. Experiments on the mountain car and the cancer treatment testbeds witness that a couple of dozen rankings enable to learn a competent policy.
Published: 2012

16. Anti Imitation-Based Policy Learning

Author: Sebag, Michèle, Akrour, Riad, Mayeur, Basile, Schoenauer, Marc, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Frasconi, Paolo, editor, Landwehr, Niels, editor, Manco, Giuseppe, editor, and Vreeken, Jilles, editor
Published: 2016
Full Text: View/download PDF

17. Preference-Based Policy Learning

Author: Akrour, Riad, Schoenauer, Marc, Sebag, Michele, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Gunopulos, Dimitrios, editor, Hofmann, Thomas, editor, Malerba, Donato, editor, and Vazirgiannis, Michalis, editor
Published: 2011
Full Text: View/download PDF

18. Reinforcement Learning Based Underwater Wireless Optical Communication Alignment for Autonomous Underwater Vehicles

Author: Weng, Yang, primary, Pajarinen, Joni, additional, Akrour, Riad, additional, Matsuda, Takumi, additional, Peters, Jan, additional, and Maki, Toshihiro, additional
Published: 2022
Full Text: View/download PDF

19. Compatible natural gradient policy search

Author: Pajarinen, Joni, Thai, Hong Linh, Akrour, Riad, Peters, Jan, Neumann, Gerhard, Pajarinen, Joni, Thai, Hong Linh, Akrour, Riad, Peters, Jan, and Neumann, Gerhard
Abstract: Trust-region methods have yielded state-of-the-art results in policy search. A common approach is to use KL-divergence to bound the region of trust resulting in a natural gradient policy update. We show that the natural gradient and trust region optimization are equivalent if we use the natural parameterization of a standard exponential policy distribution in combination with compatible value function approximation. Moreover, we show that standard natural gradient updates may reduce the entropy of the policy according to a wrong schedule leading to premature convergence. To control entropy reduction we introduce a new policy search method called compatible policy search (COPOS) which bounds entropy loss. The experimental results show that COPOS yields state-of-the-art results in challenging continuous control tasks and in discrete partially observable tasks.
Published: 2022

20. Projections for Approximate Policy Iteration Algorithms

Author: Akrour, Riad, Pajarinen, Joni, Peters, Jan, Neumann, Gerhard, Akrour, Riad, Pajarinen, Joni, Peters, Jan, and Neumann, Gerhard
Abstract: Approximate policy iteration is a class of reinforcement learning (RL) algorithms where the policy is encoded using a function approximator and which has been especially prominent in RL with continuous action spaces. In this class of RL algorithms, ensuring increase of the policy return during policy update often requires to constrain the change in action distribution. Several approximations exist in the literature to solve this constrained policy update problem. In this paper, we propose to improve over such solutions by introducing a set of projections that transform the constrained problem into an unconstrained one which is then solved by standard gradient descent. Using these projections, we empirically demonstrate that our approach can improve the policy update solution and the control over exploration of existing approximate policy iteration algorithms.
Published: 2022

21. Anti Imitation-Based Policy Learning

Author: Sebag, Michèle, primary, Akrour, Riad, additional, Mayeur, Basile, additional, and Schoenauer, Marc, additional
Published: 2016
Full Text: View/download PDF

22. Continuous Action Reinforcement Learning From a Mixture of Interpretable Experts.

Author: Akrour, Riad, Tateo, Davide, and Peters, Jan
Subjects: *ACTIVE learning, *NONLINEAR functions, *MACHINE learning, *REINFORCEMENT learning, *APPROXIMATION algorithms
Abstract: Reinforcement learning (RL) has demonstrated its ability to solve high dimensional tasks by leveraging non-linear function approximators. However, these successes are mostly achieved by ’black-box’ policies in simulated domains. When deploying RL to the real world, several concerns regarding the use of a ’black-box’ policy might be raised. In order to make the learned policies more transparent, we propose in this paper a policy iteration scheme that retains a complex function approximator for its internal value predictions but constrains the policy to have a concise, hierarchical, and human-readable structure, based on a mixture of interpretable experts. Each expert selects a primitive action according to a distance to a prototypical state. A key design decision to keep such experts interpretable is to select the prototypical states from trajectory data. The main technical contribution of the paper is to address the challenges introduced by this non-differentiable prototypical state selection procedure. Experimentally, we show that our proposed algorithm can learn compelling policies on continuous action deep RL benchmarks, matching the performance of neural network based policies, but returning policies that are more amenable to human inspection than neural network or linear-in-feature policies. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

23. Hierarchical Tactile-Based Control Decomposition of Dexterous In-Hand Manipulation Tasks

Author: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory, Fernandes Veiga, Filipe, Akrour, Riad, Peters, Jan, Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory, Fernandes Veiga, Filipe, Akrour, Riad, and Peters, Jan
Abstract: In-hand manipulation and grasp adjustment with dexterous robotic hands is a complex problem that not only requires highly coordinated finger movements but also deals with interaction variability. The control problem becomes even more complex when introducing tactile information into the feedback loop. Traditional approaches do not consider tactile feedback and attempt to solve the problem either by relying on complex models that are not always readily available or by constraining the problem in order to make it more tractable. In this paper, we propose a hierarchical control approach where a higher level policy is learned through reinforcement learning, while low level controllers ensure grip stability throughout the manipulation action. The low level controllers are independent grip stabilization controllers based on tactile feedback. The independent controllers allow reinforcement learning approaches to explore the manipulation tasks state-action space in a more structured manner. We show that this structure allows learning the unconstrained task with RL methods that cannot learn it in a non-hierarchical setting. The low level controllers also provide an abstraction to the tactile sensors input, allowing transfer to real robot platforms. We show preliminary results of the transfer of policies trained in simulation to the real robot hand.
Published: 2020

24. Continuous Action Reinforcement Learning from a Mixture of Interpretable Experts

Author: Akrour, Riad, primary, Tateo, Davide, additional, and Peters, Jan, additional
Published: 2021
Full Text: View/download PDF

25. An Upper Bound of the Bias of Nadaraya-Watson Kernel Regression under Lipschitz Assumptions

Author: Tosatto, Samuele, primary, Akrour, Riad, additional, and Peters, Jan, additional
Published: 2020
Full Text: View/download PDF

26. APRIL: Active Preference Learning-Based Reinforcement Learning

Author: Akrour, Riad, primary, Schoenauer, Marc, additional, and Sebag, Michèle, additional
Published: 2012
Full Text: View/download PDF

27. Preference-Based Policy Learning

Author: Akrour, Riad, primary, Schoenauer, Marc, additional, and Sebag, Michele, additional
Published: 2011
Full Text: View/download PDF

28. Hierarchical Tactile-Based Control Decomposition of Dexterous In-Hand Manipulation Tasks

Author: Veiga, Filipe, primary, Akrour, Riad, additional, and Peters, Jan, additional
Published: 2020
Full Text: View/download PDF

29. Model-Free Trajectory-based Policy Optimization with Monotonic Improvement

Author: Akrour, Riad, Abdolmaleki, Abbas, Abdulsamad, Hany, Peters, Jan, and Neumann, Gerhard
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Robotics, Policy Optimization, Trajectory Optimization, DATA processing & computer science, Robotics, ddc:004, G760 Machine Learning, Robotics (cs.RO), Reinforcement Learning, Machine Learning (cs.LG)
Abstract: Many of the recent trajectory optimization algorithms alternate between linear approximation of the system dynamics around the mean trajectory and conservative policy update. One way of constraining the policy change is by bounding the Kullback-Leibler (KL) divergence between successive policies. These approaches already demonstrated great experimental success in challenging problems such as end-to-end control of physical systems. However, these approaches lack any improvement guarantee as the linear approximation of the system dynamics can introduce a bias in the policy update and prevent convergence to the optimal policy. In this article, we propose a new model-free trajectory-based policy optimization algorithm with guaranteed monotonic improvement. The algorithm backpropagates a local, quadratic and time-dependent Q-Function learned from trajectory data instead of a model of the system dynamics. Our policy update ensures exact KL-constraint satisfaction without simplifying assumptions on the system dynamics. We experimentally demonstrate on highly non-linear control tasks the improvement in performance of our algorithm in comparison to approaches linearizing the system dynamics. In order to show the monotonic improvement of our algorithm, we additionally conduct a theoretical analysis of our policy update scheme to derive a lower bound of the change in policy return between successive iterations.
Published: 2018

30. Compatible natural gradient policy search

Author: Pajarinen, Joni, primary, Thai, Hong Linh, additional, Akrour, Riad, additional, Peters, Jan, additional, and Neumann, Gerhard, additional
Published: 2019
Full Text: View/download PDF

31. Learning Replanning Policies With Direct Policy Search

Author: Brandherm, Florian, primary, Peters, Jan, additional, Neumann, Gerhard, additional, and Akrour, Riad, additional
Published: 2019
Full Text: View/download PDF

32. Regularizing Reinforcement Learning with State Abstraction

Author: Akrour, Riad, primary, Veiga, Filipe, additional, Peters, Jan, additional, and Neumann, Gerhard, additional
Published: 2018
Full Text: View/download PDF

33. Sample and Feedback Efficient Hierarchical Reinforcement Learning from Human Preferences

Author: Pinsler, Robert, primary, Akrour, Riad, additional, Osa, Takayuki, additional, Peters, Jan, additional, and Neumann, Gerhard, additional
Published: 2018
Full Text: View/download PDF

34. Direct Value Learning: Reinforcement Learning and Anti-Imitation

Author: Akrour, Riad, Mayeur, Basile, Sebag, Michèle, Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS), Machine Learning and Optimisation (TAO), Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Paris-Sud - Paris 11 (UP11)-Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-CentraleSupélec, INRIA, CNRS, and Université Paris-Sud 11
Subjects: value-based RL, ranking, Reinforcement learning, inverse reinforcement learning, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: The value function, at the core of the Bellmanian Reinforcement Learning framework, associates to each state the discounted expected cumulative reward which can be gathered after visiting this state. Given an (optimal) value function, an (optimal) policy is most simply derived by "greedification", heading in each time step toward the neighbor state with maximal value.Following the Bellman equations, the value function can be built by (approximate) dynamic programming, albeit facing severe scalability limitations in large state and action spaces. An alternative, inspired from the Energy-based learning framework (LeCun et al. 2006), is investigated in this paper, searching for a pseudo-value function such that it induces the same local order on the state space as a (nearly) optimal value function. By construction, the greedification of such a pseudo-value induces the same policy as the value function itself. The presented Direct Value Learning (DiVa) approach proceeds by directly learning the pseudo-value, taking some inspiration from the Inverse Reinforcement Learning (IRL) approach. In IRL, expert demonstrations are used to infer the reward function. Quite the contrary, DiVa uses bad demonstrations to infer the pseudo-value. Bad demonstrations are notoriously easier to generate than expert ones; typically, applying a random policy on a good initial state (e.g., a bicycle in equilibrium) will on average lead to visit states with decreasing values (the bicycle ultimately falls down). DiVa thus uses bad demonstrations, generated from weak prior knowledge, to learn a pseudo-value along a standard learning-to-rank approach. The derived pseudo-value directly induces a policy in the model-based RL framework, when the transition function is known. In the model-free RL setting, the state pseudo-value is exploited using off-policy learning, to infer a state-action pseudo-value and induce a policy.The proposed DiVa approach and the use of bad demonstrations to achieve direct value learning is original to our best knowledge. The loss of optimality of the pseudo value-based policy is analyzed and it is shown that it is bounded under mild assumptions. Finally, the experimental validation of DiVa on the mountain car, the bicycle and the swing-up pendulum problems comparatively demonstrates the simplicity and the merits of the approach.
Published: 2015

35. Apprentissage direct de fonction de valeur : Renforcement par Anti-Imitation

Author: Akrour, Riad, Mayeur, Basile, Sebag, Michèle, Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS), Machine Learning and Optimisation (TAO), Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Paris-Sud - Paris 11 (UP11)-Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-CentraleSupélec, INRIA, CNRS, Université Paris-Sud 11, Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France, and Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)
Subjects: value-based RL, ranking, Reinforcement learning, inverse reinforcement learning, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: The value function, at the core of the Bellmanian Reinforcement Learning framework, associates to each state the discounted expected cumulative reward which can be gathered after visiting this state. Given an (optimal) value function, an (optimal) policy is most simply derived by "greedification", heading in each time step toward the neighbor state with maximal value.Following the Bellman equations, the value function can be built by (approximate) dynamic programming, albeit facing severe scalability limitations in large state and action spaces. An alternative, inspired from the Energy-based learning framework (LeCun et al. 2006), is investigated in this paper, searching for a pseudo-value function such that it induces the same local order on the state space as a (nearly) optimal value function. By construction, the greedification of such a pseudo-value induces the same policy as the value function itself. The presented Direct Value Learning (DiVa) approach proceeds by directly learning the pseudo-value, taking some inspiration from the Inverse Reinforcement Learning (IRL) approach. In IRL, expert demonstrations are used to infer the reward function. Quite the contrary, DiVa uses bad demonstrations to infer the pseudo-value. Bad demonstrations are notoriously easier to generate than expert ones; typically, applying a random policy on a good initial state (e.g., a bicycle in equilibrium) will on average lead to visit states with decreasing values (the bicycle ultimately falls down). DiVa thus uses bad demonstrations, generated from weak prior knowledge, to learn a pseudo-value along a standard learning-to-rank approach. The derived pseudo-value directly induces a policy in the model-based RL framework, when the transition function is known. In the model-free RL setting, the state pseudo-value is exploited using off-policy learning, to infer a state-action pseudo-value and induce a policy.The proposed DiVa approach and the use of bad demonstrations to achieve direct value learning is original to our best knowledge. The loss of optimality of the pseudo value-based policy is analyzed and it is shown that it is bounded under mild assumptions. Finally, the experimental validation of DiVa on the mountain car, the bicycle and the swing-up pendulum problems comparatively demonstrates the simplicity and the merits of the approach.
Published: 2015

36. Layered direct policy search for learning hierarchical skills

Author: End, Felix, primary, Akrour, Riad, additional, Peters, Jan, additional, and Neumann, Gerhard, additional
Published: 2017
Full Text: View/download PDF

37. Empowered skills

Author: Gabriel, Alexander, primary, Akrour, Riad, additional, Peters, Jan, additional, and Neumann, Gerhard, additional
Published: 2017
Full Text: View/download PDF

38. Direct Value Learning: a Rank-Invariant Approach to Reinforcement Learning

Author: Mayeur, Basile, Akrour, Riad, Sebag, Michèle, Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS), Machine Learning and Optimisation (TAO), Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Paris-Sud - Paris 11 (UP11)-Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-CentraleSupélec, Centre National de la Recherche Scientifique (CNRS), Gerhard Neumann (TU-Darmstadt) and Joelle Pineau (McGill University) and Peter Auer (Uni Leoben) and Marc Toussaint (Uni Stuttgart), Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France, and Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)
Subjects: ranking SVM, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-RB]Computer Science [cs]/Robotics [cs.RO], Reinforcement Learning, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: International audience; Taking inspiration from inverse reinforcement learning, the proposed Direct Value Learning for Reinforcement Learning (DIVA) approach uses light priors to gener-ate inappropriate behaviors, and uses the corresponding state sequences to directly learn a value function. When the transition model is known, this value function directly defines a (nearly) optimal controller. Otherwise, the value function is extended to the state-action space using off-policy learning. The experimental validation of DIVA on the mountain car problem shows the robustness of the approach comparatively to SARSA, based on the assumption that the target state is known. The experimental validation on the bicycle problem shows that DIVA still finds good policies when relaxing this assumption.
Published: 2014

39. Apprentissage par renforcement robuste reposant sur l'apprentissage par préférences

Author: Akrour, Riad, Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS), Université Paris Sud - Paris XI, Michèle Sebag, and Marc Schoenauer
Subjects: Human-Computer Interaction, Preference Learning, Apprentissage par renforcement, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Interaction homme-machine, [INFO.INFO-RB]Computer Science [cs]/Robotics [cs.RO], Robotics, Robotique, Reinforcement Learning, Apprentissage par préférences, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: The thesis contributions resolves around sequential decision taking and more precisely Reinforcement Learning (RL). Taking its root in Machine Learning in the same way as supervised and unsupervised learning, RL quickly grow in popularity within the last two decades due to a handful of achievements on both the theoretical and applicative front. RL supposes that the learning agent and its environment follow a stochastic Markovian decision process over a state and action space. The process is said of decision as the agent is asked to choose at each time step an action to take. It is said stochastic as the effect of selecting a given action in a given state does not systematically yield the same state but rather defines a distribution over the state space. It is said to be Markovian as this distribution only depends on the current state-action pair. Consequently to the choice of an action, the agent receives a reward. The RL goal is then to solve the underlying optimization problem of finding the behaviour that maximizes the sum of rewards all along the interaction of the agent with its environment. From an applicative point of view, a large spectrum of problems can be cast onto an RL one, from Backgammon (TD-Gammon, was one of Machine Learning first success giving rise to a world class player of advanced level) to decision problems in the industrial and medical world. However, the optimization problem solved by RL depends on the prevous definition of a reward function that requires a certain level of domain expertise and also knowledge of the internal quirks of RL algorithms. As such, the first contribution of the thesis was to propose a learning framework that lightens the requirements made to the user. The latter does not need anymore to know the exact solution of the problem but to only be able to choose between two behaviours exhibited by the agent, the one that matches more closely the solution. Learning is interactive between the agent and the user and resolves around the three main following points: i) The agent demonstrates a behaviour ii) The user compares it w.r.t. to the current best one iii) The agent uses this feedback to update its preference model of the user and uses it to find the next behaviour to demonstrate. To reduce the number of required interactions before finding the optimal behaviour, the second contribution of the thesis was to define a theoretically sound criterion making the trade-off between the sometimes contradicting desires of complying with the user's preferences and demonstrating sufficiently different behaviours. The last contribution was to ensure the robustness of the algorithm w.r.t. the feedback errors that the user might make. Which happens more often than not in practice, especially at the initial phase of the interaction, when all the behaviours are far from the expected solution.; Les contributions de la thèse sont centrées sur la prise de décisions séquentielles et plus spécialement sur l'Apprentissage par Renforcement (AR). Prenant sa source de l'apprentissage statistique au même titre que l'apprentissage supervisé et non-supervisé, l'AR a gagné en popularité ces deux dernières décennies en raisons de percées aussi bien applicatives que théoriques. L'AR suppose que l'agent (apprenant) ainsi que son environnement suivent un processus de décision stochastique Markovien sur un espace d'états et d'actions. Le processus est dit de décision parce que l'agent est appelé à choisir à chaque pas de temps du processus l'action à prendre. Il est dit stochastique parce que le choix d'une action donnée en un état donné n'implique pas le passage systématique à un état particulier mais définit plutôt une distribution sur l'espace d'états. Il est dit Markovien parce que cette distribution ne dépend que de l'état et de l'action courante. En conséquence d'un choix d'action, l'agent reçoit une récompense. Le but de l'AR est alors de résoudre le problème d'optimisation retournant le comportement qui assure à l'agent une récompense maximale tout au long de son interaction avec l'environnement. D'un point de vue pratique, un large éventail de problèmes peuvent être transformés en un problème d'AR, du Backgammon (cf. TD-Gammon, l'une des premières grandes réussites de l'AR et de l'apprentissage statistique en général, donnant lieu à un joueur expert de classe internationale) à des problèmes de décision dans le monde industriel ou médical. Seulement, le problème d'optimisation résolu par l'AR dépend de la définition préalable d'une fonction de récompense adéquate nécessitant une expertise certaine du domaine d'intérêt mais aussi du fonctionnement interne des algorithmes d'AR. En ce sens, la première contribution de la thèse a été de proposer un nouveau cadre d'apprentissage, allégeant les prérequis exigés à l'utilisateur. Ainsi, ce dernier n'a plus besoin de connaître la solution exacte du problème mais seulement de pouvoir désigner entre deux comportements, celui qui s'approche le plus de la solution. L'apprentissage se déroule en interaction entre l'utilisateur et l'agent. Cette interaction s'articule autour des trois points suivants : i) L'agent exhibe un nouveau comportement ii) l'expert le compare au meilleur comportement jusqu'à présent iii) l'agent utilise ce retour pour mettre à jour son modèle des préférences puis choisit le prochain comportement à démontrer. Afin de réduire le nombre d'interactions nécessaires entre l'utilisateur et l'agent pour que ce dernier trouve le comportement optimal, la seconde contribution de la thèse a été de définir un critère théoriquement justifié faisant le compromis entre les désirs parfois contradictoires de prendre en compte les préférences de l'utilisateur tout en exhibant des comportements suffisamment différents de ceux déjà proposés. La dernière contribution de la thèse est d'assurer la robustesse de l'algorithme face aux éventuelles erreurs d'appréciation de l'utilisateur. Ce qui arrive souvent en pratique, spécialement au début de l'interaction, quand tous les comportements proposés par l'agent sont loin de la solution attendue.
Published: 2014

40. Robust Preference Learning-based Reinforcement Learning

Author: Akrour, Riad, Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS), Université Paris Sud - Paris XI, Michèle Sebag, Marc Schoenauer, and Briot, Brigitte
Subjects: [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], Preference Learning, [INFO.INFO-RB] Computer Science [cs]/Robotics [cs.RO], [INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG], Robotics, Reinforcement Learning, Apprentissage par préférences, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Human-Computer Interaction, Apprentissage par renforcement, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Interaction homme-machine, [INFO.INFO-RB]Computer Science [cs]/Robotics [cs.RO], Robotique
Abstract: The thesis contributions resolves around sequential decision taking and more precisely Reinforcement Learning (RL). Taking its root in Machine Learning in the same way as supervised and unsupervised learning, RL quickly grow in popularity within the last two decades due to a handful of achievements on both the theoretical and applicative front. RL supposes that the learning agent and its environment follow a stochastic Markovian decision process over a state and action space. The process is said of decision as the agent is asked to choose at each time step an action to take. It is said stochastic as the effect of selecting a given action in a given state does not systematically yield the same state but rather defines a distribution over the state space. It is said to be Markovian as this distribution only depends on the current state-action pair. Consequently to the choice of an action, the agent receives a reward. The RL goal is then to solve the underlying optimization problem of finding the behaviour that maximizes the sum of rewards all along the interaction of the agent with its environment. From an applicative point of view, a large spectrum of problems can be cast onto an RL one, from Backgammon (TD-Gammon, was one of Machine Learning first success giving rise to a world class player of advanced level) to decision problems in the industrial and medical world. However, the optimization problem solved by RL depends on the prevous definition of a reward function that requires a certain level of domain expertise and also knowledge of the internal quirks of RL algorithms. As such, the first contribution of the thesis was to propose a learning framework that lightens the requirements made to the user. The latter does not need anymore to know the exact solution of the problem but to only be able to choose between two behaviours exhibited by the agent, the one that matches more closely the solution. Learning is interactive between the agent and the user and resolves around the three main following points: i) The agent demonstrates a behaviour ii) The user compares it w.r.t. to the current best one iii) The agent uses this feedback to update its preference model of the user and uses it to find the next behaviour to demonstrate. To reduce the number of required interactions before finding the optimal behaviour, the second contribution of the thesis was to define a theoretically sound criterion making the trade-off between the sometimes contradicting desires of complying with the user's preferences and demonstrating sufficiently different behaviours. The last contribution was to ensure the robustness of the algorithm w.r.t. the feedback errors that the user might make. Which happens more often than not in practice, especially at the initial phase of the interaction, when all the behaviours are far from the expected solution., Les contributions de la thèse sont centrées sur la prise de décisions séquentielles et plus spécialement sur l'Apprentissage par Renforcement (AR). Prenant sa source de l'apprentissage statistique au même titre que l'apprentissage supervisé et non-supervisé, l'AR a gagné en popularité ces deux dernières décennies en raisons de percées aussi bien applicatives que théoriques. L'AR suppose que l'agent (apprenant) ainsi que son environnement suivent un processus de décision stochastique Markovien sur un espace d'états et d'actions. Le processus est dit de décision parce que l'agent est appelé à choisir à chaque pas de temps du processus l'action à prendre. Il est dit stochastique parce que le choix d'une action donnée en un état donné n'implique pas le passage systématique à un état particulier mais définit plutôt une distribution sur l'espace d'états. Il est dit Markovien parce que cette distribution ne dépend que de l'état et de l'action courante. En conséquence d'un choix d'action, l'agent reçoit une récompense. Le but de l'AR est alors de résoudre le problème d'optimisation retournant le comportement qui assure à l'agent une récompense maximale tout au long de son interaction avec l'environnement. D'un point de vue pratique, un large éventail de problèmes peuvent être transformés en un problème d'AR, du Backgammon (cf. TD-Gammon, l'une des premières grandes réussites de l'AR et de l'apprentissage statistique en général, donnant lieu à un joueur expert de classe internationale) à des problèmes de décision dans le monde industriel ou médical. Seulement, le problème d'optimisation résolu par l'AR dépend de la définition préalable d'une fonction de récompense adéquate nécessitant une expertise certaine du domaine d'intérêt mais aussi du fonctionnement interne des algorithmes d'AR. En ce sens, la première contribution de la thèse a été de proposer un nouveau cadre d'apprentissage, allégeant les prérequis exigés à l'utilisateur. Ainsi, ce dernier n'a plus besoin de connaître la solution exacte du problème mais seulement de pouvoir désigner entre deux comportements, celui qui s'approche le plus de la solution. L'apprentissage se déroule en interaction entre l'utilisateur et l'agent. Cette interaction s'articule autour des trois points suivants : i) L'agent exhibe un nouveau comportement ii) l'expert le compare au meilleur comportement jusqu'à présent iii) l'agent utilise ce retour pour mettre à jour son modèle des préférences puis choisit le prochain comportement à démontrer. Afin de réduire le nombre d'interactions nécessaires entre l'utilisateur et l'agent pour que ce dernier trouve le comportement optimal, la seconde contribution de la thèse a été de définir un critère théoriquement justifié faisant le compromis entre les désirs parfois contradictoires de prendre en compte les préférences de l'utilisateur tout en exhibant des comportements suffisamment différents de ceux déjà proposés. La dernière contribution de la thèse est d'assurer la robustesse de l'algorithme face aux éventuelles erreurs d'appréciation de l'utilisateur. Ce qui arrive souvent en pratique, spécialement au début de l'interaction, quand tous les comportements proposés par l'agent sont loin de la solution attendue.
Published: 2014

41. Interactive Robot Education

Author: Akrour, Riad, Schoenauer, Marc, Sebag, Michèle, Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS), Machine Learning and Optimisation (TAO), Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Paris-Sud - Paris 11 (UP11)-Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-CentraleSupélec, Johannes Fuernkranz and Eyke Hüllermeier, European Project, Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Schoenauer, Marc, and SYMBRION - INCOMING
Subjects: [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: International audience; Aimed at on-board robot training, an approach hybridizing active preference learning and reinforcement learning is presented: Interactive Bayesian Policy Search (IBPS) builds a robotic controller through direct and frugal interaction with the human expert, iteratively emitting preferences among a few behaviors demonstrated by the robot. These preferences allow the robot to gradually refine its policy utility estimate, and select a new policy to be demonstrated, after an Expected Utility of Selection criterion. The paper contribution is on handling the preference noise, due to expert's mistakes or disinterest when demonstrated behaviors are equally unsatisfactory. A noise model is proposed, enabling a resource-limited robot to soundly estimate the preference noise and maintain a robust interaction with the expert, thus enforcing a low sample complexity. A proof of principle of the IBPS approach, in simulation and on-board, is presented.
Published: 2013

42. A Survey of Preference-Based Reinforcement Learning Methods.

Author: Wirth, Christian, Akrour, Riad, Neumann, Gerhard, and FÜRNKRANZ, JOHANNES
Subjects: *REINFORCEMENT learning, *MACHINE learning, *REINFORCEMENT (Psychology), *REWARD (Psychology), *ALGORITHMS, *MATHEMATICAL optimization
Abstract: Reinforcement learning (RL) techniques optimize the accumulated long-term reward of a suitably chosen reward function. However, designing such a reward function often requires a lot of task-specific prior knowledge. The designer needs to consider different objectives that do not only influence the learned behavior but also the learning progress. To alleviate these issues, preference-based reinforcement learning algorithms (PbRL) have been proposed that can directly learn from an expert's preferences instead of a hand-designed numeric reward. PbRL has gained traction in recent years due to its ability to resolve the reward shaping problem, its ability to learn from non numeric rewards and the possibility to reduce the dependence on expert knowledge. We provide a unified framework for PbRL that describes the task formally and points out the different design principles that affect the evaluation task for the human as well as the computational complexity. The design principles include the type of feedback that is assumed, the representation that is learned to capture the preferences, the optimization problem that has to be solved as well as how the exploration/exploitation problem is tackled. Furthermore, we point out shortcomings of current algorithms, propose open research questions and briefly survey practical tasks that have been solved using PbRL. [ABSTRACT FROM AUTHOR]
Published: 2017

43. Direct Value Learning: a Preference-based Approach to Reinforcement Learning

Author: Meunier, David, Deguchi, Yutaka, Akrour, Riad, Suzuki, Enoshin, Schoenauer, Marc, Sebag, Michèle, Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS), Machine Learning and Optimisation (TAO), Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Paris-Sud - Paris 11 (UP11)-Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-CentraleSupélec, Dept. Informatics, ISEE, Kyushu University [Fukuoka], Johannes Fürnkranz and Eyke Hüllermeier, Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria), Kyushu University, and Schoenauer, Marc
Subjects: [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE/I.2.6: Learning, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI]
Abstract: International audience; Learning by imitation, among the most promising techniques for reinforcement learning in complex domains, critically depends on the human designer ability to provide sufficiently many demonstrations of satisfactory quality. The approach presented in this paper, referred to as DIVA (Direct Value Learning for Reinforcement Learning), aims at addressing both above limitations by exploiting simple experiments. The approach stems from a straightforward remark: while it is rather easy to set a robot in a target situation, the quality of its situation will naturally deteriorate upon the action of naive controllers. The demonstration of such naive controllers can thus be used to learn directly a value function, through a preference learning approach. Under some conditions on the transition model, this value function enables to define an optimal controller. The DIVA approach is experimentally demonstrated by teaching a robot to follow another robot. Importantly, the approach does not require any robotic simulator to be available, nor does it require any pattern-recognition primitive (e.g. seeing the other robot) to be provided.
Published: 2012

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

43 results on '"Akrour, Riad"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources