Start Over

Bandits atop Reinforcement Learning: Tackling Online Inventory Models with Cyclic Demands.

Authors :: Gong, Xiao-Yue
Simchi-Levi, David
Source :: Management Science; Sep2024, Vol. 70 Issue 9, p6139-6157, 19p
Publication Year :: 2024
Abstract: Motivated by a long-standing gap between inventory theory and practice, we study online inventory models with unknown cyclic demand distributions. We design provably efficient reinforcement learning (RL) algorithms that leverage the structure of inventory problems to achieve optimal theoretical guarantees that surpass existing results. We apply the standard performance measure in online learning literature, regret, which is defined as the difference between the total expected cost of our policy and the total expected cost of the clairvoyant optimal policy that has full knowledge of the demand distributions a priori. This paper analyzes, in the presence of unknown cyclic demands, both the lost-sales model with zero lead time and the multiproduct backlogging model with positive lead times, fixed joint-ordering costs and order limits. For both models, we first introduce episodic models where inventory is discarded at the end of every cycle, and then build upon these results to analyze the nondiscarding models. Our RL policies HQL and FQL achieve O˜(T) regret for the episodic lost-sales model and the episodic multiproduct backlogging model, matching the regret lower bound that we prove in this paper. For the nondiscarding models, we construct a bandit learning algorithm on top that governs multiple copies of the previous RL algorithms, named Meta-HQL. Meta-HQL achieves O˜(T) regret for the nondiscarding lost-sales model with zero lead time, again matching the regret lower bound. For the nondiscarding multiproduct backlogging model, our policy Mimic-QL achieves O˜(T5/6) regret. Our policies remove the regret dependence on the cardinality of the state-action space for inventory problems, which is an improvement over existing RL algorithms. We conducted experiments with a real sales data set from Rossmann, one of the largest drugstore chains in Europe, and also with a synthetic data set. For both sets of experiments, our policy converges rapidly to the optimal policy and dramatically outperforms the best policy that models demand as independent and identically distributed instead of cyclic. This paper was accepted by J. George Shanthikumar, data science. Funding: X.-Y. Gong was partially supported by an Accenture Fellowship. The work of X.-Y. Gong and D. Simchi-Levi was partially supported by the MIT Data Science Lab. Supplemental Material: The data and online appendices are available at https://doi.org/10.1287/mnsc.2023.4947. [ABSTRACT FROM AUTHOR]

Subjects :: MACHINE learning
INVENTORY theory
REINFORCEMENT learning
LEAD time (Supply chain management)
ONLINE education

Details

Language :: English
ISSN :: 00251909
Volume :: 70
Issue :: 9
Database :: Complementary Index
Journal :: Management Science
Publication Type :: Academic Journal
Accession number :: 179339498
Full Text :: https://doi.org/10.1287/mnsc.2023.4947

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Bandits atop Reinforcement Learning: Tackling Online Inventory Models with Cyclic Demands.

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Bandits atop Reinforcement Learning: Tackling Online Inventory Models with Cyclic Demands.

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources