Start Over

Anti Imitation-Based Policy Learning

Authors :: Michèle Sebag
Marc Schoenauer
Riad Akrour
Basile Mayeur
Laboratoire de Recherche en Informatique (LRI)
Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)
Machine Learning and Optimisation (TAO)
Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France
Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Paris-Sud - Paris 11 (UP11)-Laboratoire de Recherche en Informatique (LRI)
Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-CentraleSupélec
Paolo Frasconi
Niels Landwehr
Giuseppe Manco
Jilles Vreeken
Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université Paris-Sud - Paris 11 (UP11)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Inria Saclay - Ile de France
Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)
Source :: Machine Learning and Knowledge Discovery in Databases-European Conference, ECML-PKDD 2016, Machine Learning and Knowledge Discovery in Databases-European Conference, ECML-PKDD 2016, Sep 2016, Riva del Garda, Afghanistan. pp.559-575, ⟨10.1007/978-3-319-46227-1_35⟩, Machine Learning and Knowledge Discovery in Databases ISBN: 9783319462264, ECML/PKDD (2)
Publication Year :: 2016
Publisher :: HAL CCSD, 2016.
Abstract: International audience; The Anti Imitation-based Policy Learning (AIPoL) approach, taking inspiration from the Energy-based learning framework (LeCun et al. 2006), aims at a pseudo-value function such that it induces the same order on the state space as a (nearly optimal) value function. By construction , the greedification of such a pseudo-value induces the same policy as the value function itself. The approach assumes that, thanks to prior knowledge, not-to-be-imitated demonstrations can easily be generated. For instance, applying a random policy on a good initial state (e.g., a bicycle in equilibrium) will on average lead to visit states with decreasing values (the bicycle ultimately falls down). Such a demonstration , that is, a sequence of states with decreasing values, is used along a standard learning-to-rank approach to define a pseudo-value function. If the model of the environment is known, this pseudo-value directly induces a policy by greedification. Otherwise, the bad demonstrations are exploited together with off-policy learning to learn a pseudo-Q-value function and likewise thence derive a policy by greedification. To our best knowledge the use of bad demonstrations to achieve policy learning is original. The theoretical analysis shows that the loss of optimality of the pseudo value-based policy is bounded under mild assumptions, and the empirical validation of AIPoL on the mountain car, the bicycle and the swing-up pendulum problems demonstrates the simplicity and the merits of the approach.

Details

Language :: English
ISBN :: 978-3-319-46226-4
ISBNs :: 9783319462264
Database :: OpenAIRE
Journal :: Machine Learning and Knowledge Discovery in Databases-European Conference, ECML-PKDD 2016, Machine Learning and Knowledge Discovery in Databases-European Conference, ECML-PKDD 2016, Sep 2016, Riva del Garda, Afghanistan. pp.559-575, ⟨10.1007/978-3-319-46227-1_35⟩, Machine Learning and Knowledge Discovery in Databases ISBN: 9783319462264, ECML/PKDD (2)
Accession number :: edsair.doi.dedup.....0c65f8772e7e4d2aca79a2d48291c5bb
Full Text :: https://doi.org/10.1007/978-3-319-46227-1_35⟩