Demystifying the Curse of Horizon in Offline Reinforcement Learning in Order to Break It Offline reinforcement learning (RL), where we evaluate and learn new policies using existing off-policy data, is crucial in applications where experimentation is challenging and simulation unreliable, such as medicine. It is also notoriously difficult because the similarity (density ratio) between observed trajectories and those generated by any new policy diminishes exponentially as the horizon grows, known as the curse of horizon, which severely limits the application of offline RL whenever horizons are moderately long or even infinite. In "Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning," Kallus and Uehara set out to understand these limits and when they can be broken. They precisely characterize the curse by deriving the semiparametric efficiency lower bounds for the policy-value estimation problem in different models. On the one hand, this shows why the curse necessarily plagues standard estimators: they work even in non-Markov models and therefore must be limited by the corresponding bound. On the other hand, greater efficiency is possible in certain Markovian models, and they give the first estimator achieving these much lower efficiency bounds in infinite-horizon Markov decision processes. Off-policy evaluation (OPE) in reinforcement learning is notoriously difficult in long- and infinite-horizon settings due to diminishing overlap between behavior and target policies. In this paper, we study the role of Markovian and time-invariant structure in efficient OPE. We first derive the efficiency bounds and efficient influence functions for OPE when one assumes each of these structures. This precisely characterizes the curse of horizon: in time-variant processes, OPE is only feasible in the near-on-policy setting, where behavior and target policies are sufficiently similar. But, in time-invariant Markov decision processes, our bounds show that truly off-policy evaluation is feasible, even with only just one dependent trajectory, and provide the limits of how well we could hope to do. We develop a new estimator based on double reinforcement learning (DRL) that leverages this structure for OPE. Our DRL estimator simultaneously uses estimated stationary density ratios and q-functions and remains efficient when both are estimated at slow, nonparametric rates and remains consistent when either is estimated consistently. We investigate these properties and the performance benefits of leveraging the problem structure for more efficient OPE. Funding: This work was supported by the National Science Foundation Division of Information and Intelligent Systems [1846210], and by theMasason Foundation. Supplemental Material: The online appendices are available at https://doi.org/10.1287/opre.2021.2249. [ABSTRACT FROM AUTHOR]