414 results on '"sequential decision making"'
Search Results
2. Rate-Optimal Bayesian Simple Regret in Best Arm Identification.
- Author
-
Komiyama, Junpei, Ariu, Kaito, Kato, Masahiro, and Qin, Chao
- Subjects
MULTI-armed bandit problem (Probability theory) ,BAYESIAN analysis - Abstract
We consider best arm identification in the multiarmed bandit problem. Assuming certain continuity conditions of the prior, we characterize the rate of the Bayesian simple regret. Differing from Bayesian regret minimization, the leading term in the Bayesian simple regret derives from the region in which the gap between optimal and suboptimal arms is smaller than (logT)/T. We propose a simple and easy-to-compute algorithm with its leading term matching with the lower bound up to a constant factor; simulation results support our theoretical findings. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Tiered Assortment: Optimization and Online Learning.
- Author
-
Cao, Junyu and Sun, Wei
- Subjects
ONLINE algorithms ,CONSUMER preferences ,ONLINE education ,CONSUMERS ,INTERNET stores - Abstract
Due to the sheer number of available choices, online retailers frequently use tiered assortment to present products. In this case, groups of products are arranged across multiple pages or stages, and a customer clicks on "next" or "load more" to access them sequentially. Despite the prevalence of such assortments in practice, this topic has not received much attention in the existing literature. In this work, we focus on a sequential choice model that captures customers' behavior when product recommendations are presented in tiers. We analyze multiple variants of tiered assortment optimization (TAO) problems by imposing "no-duplication" and capacity constraints. For the offline setting involving known customers' preferences, we characterize the properties of the optimal tiered assortment and propose algorithms that improve the computational efficiency over the existing benchmarks. To the best of our knowledge, we are the first to study the online setting of the TAO problem. A unique characteristic of our online setting, absent from the one-tier MNL bandit, is partial feedback. Products in the lower priority tiers are not shown when a customer has purchased a product or has chosen to exit at an earlier tier. Such partial feedback, along with product interdependence across tiers, increases the learning complexity. For both the noncontextual and contextual problems, we propose online algorithms and quantify their respective regrets. Moreover, we are able to construct tighter uncertainty sets for model parameters in the contextual case and thus improve the performance. We demonstrate the efficacy of our proposed algorithms through numerical experiments. This paper was accepted by J. George Shanthikumar, data science. Supplemental Material: The data files and e-companion are available at https://doi.org/10.1287/mnsc.2023.4940. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. Price Optimization for a Multistage Choice Model.
- Author
-
Shi, Jiaqi, Ke, Ginger Y, Wang, Zizhuo, and Zhang, Lianmin
- Subjects
PRICES ,EXPECTED utility ,CONSUMERS ,PURCHASING ,NUMERICAL analysis - Abstract
Considering the real-world situations where a customer's purchase choices in previous stages can influence the prices she encounters in subsequent stages, this research examines the multiproduct price optimization problem under a multistage choice model. Particularly, the seller commits to a multistage pricing policy and determines product prices based on the customer's purchase history, and the customer makes purchase decisions such that the total expected utility is maximized. We show that the pricing problem has a unique optimal solution under some mild conditions and the optimal solution satisfies a modified equal adjusted markup property. Based on the property, the problem can be solved efficiently by reducing it to a single-dimensional search problem. Moreover, the optimal pricing policy has an important property, namely, the product with a higher adjusted markup in earlier stages should always lead to lower prices in subsequent stages. We also show that compared to customers who are myopic, the seller should offer higher first-stage prices and lower second-stage prices to forward-looking customers, which will lead to a higher profit. Numerical analyses are also conducted to demonstrate the above results. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
5. Adaptive Algorithm for Multi-Armed Bandit Problem with High-Dimensional Covariates.
- Author
-
Qian, Wei, Ing, Ching-Kang, and Liu, Ji
- Subjects
- *
STATISTICAL decision making , *ALGORITHMS , *DECISION making - Abstract
This article studies an important sequential decision making problem known as the multi-armed stochastic bandit problem with covariates. Under a linear bandit framework with high-dimensional covariates, we propose a general multi-stage arm allocation algorithm that integrates both arm elimination and randomized assignment strategies. By employing a class of high-dimensional regression methods for coefficient estimation, the proposed algorithm is shown to have near optimal finite-time regret performance under a new study scope that requires neither a margin condition nor a reward gap condition for competitive arms. Based on the synergistically verified benefit of the margin, our algorithm exhibits adaptive performance that automatically adapts to the margin and gap conditions, and attains optimal regret rates simultaneously for both study scopes, without or with the margin, up to a logarithmic factor. Besides the desirable regret performance, the proposed algorithm simultaneously generates useful coefficient estimation output for competitive arms and is shown to achieve both estimation consistency and variable selection consistency. Promising empirical performance is demonstrated through extensive simulation and two real data evaluation examples. for this article are available online. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. Order Matters: Rating Service Professionals First Reduces Tipping Amount.
- Author
-
Chen, Jinjie, Xu, Alison Jing, Rodas, Maria A., and Liu, Xuefeng
- Subjects
TIPS & tipping (Gratuities) ,CONSUMER attitudes ,CONSUMER behavior ,CONSUMER behavior research ,RATING ,MASS media influence ,DECISION making ,MANNERS & customs - Abstract
As customer ratings have become ubiquitous and digital platforms can directly request ratings and tips from customers, it is important to understand how a customer rating influences tipping. The authors investigate whether, how, why, and when the order of rating and tipping affects both consumer behaviors in seven studies, including one quasi-field experiment, one archival data analysis, one randomized field experiment, and four randomized lab experiments. They show that asking customers to rate a service professional before tipping negatively impacts the tip amount but that tipping first does not affect subsequent rating scores. The authors propose that the negative effect of rating on tipping occurs because, when rating a service professional first, customers categorize their feedback as a reward for the service professional, which partially alleviates the felt obligation to tip, resulting in a smaller tip. This negative effect is more evident when customers (1) tip from their own pocket, (2) have higher categorization flexibility, or (3) perceive that the service professional benefits from the rating. Moreover, highlighting the consistency motivation after rating but before tipping can attenuate this effect. These boundary conditions not only support the proposed mechanism and evaluate alternative processes but also have significant practical implications. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
7. Monte Carlo Tree Search Algorithm for SSPs Under the GUBS Criterion
- Author
-
Gabriel Nunes Crispino, Valdinei Freire, and Karina Valdivia Delgado
- Subjects
Markov Decision Processes ,Stochastic Shortest Path ,Sequential decision making ,Probabilistic planning ,Monte Carlo tree search ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
The Stochastic Shortest Path (SSP) is a formalism widely used for modeling goal-oriented probabilistic planning problems. When dead ends, which are states from which goal states cannot be reached, are present in the problem and cannot be avoided, the standard criterion for solving SSPs is not well defined in these scenarios. Because of that, several alternate criteria for solving SSPs with unavoidable dead ends have been proposed in the literature. One of these criteria is GUBS (Goals with Utility-Based Semantics), a criterion that makes trade-offs between probability-to-goal and cost by combining goal prioritization with Expected Utility Theory. GUBS is a good choice for these problems because it is one of the only criteria that are known to maintain the ?-strong probability-to-goal priority property, a property that provides guarantees on how a decision criterion can choose policies without having to preprocess any specific SSP problem. Although there already exist two exact algorithms for solving GUBS, eGUBS-VI and eGUBS-AO*, both are offline and there is no algorithm for solving GUBS in an online manner. In this paper we propose UCT-GUBS, an online approximate algorithm based on UCT (a Monte Carlo tree search algorithm) that solves SSPs under the GUBS criterion. We provide an analysis of an empirical evaluation performed on two probabilistic planning domains (Triangle Tireworld and Navigation) to observe how the probability-to-goal and utility values of the resulting policies compare to the optimal values, and also how the time performance of UCT-GUBS compares to the ones of eGUBS-VI and eGUBS-AO*. Our conclusion is that, like other algorithms, the usage of UCT-GUBS has to be evaluated considering the application requirements and of the problem being solved. Depending on these factors, it can be a good alternative for obtaining policies in an online fashion while, for some problems, also being able to have better time performance than other algorithms
- Published
- 2024
- Full Text
- View/download PDF
8. Variety Effects in Mobile Advertising.
- Author
-
Rafieian, Omid and Yoganarasimhan, Hema
- Subjects
MOBILE apps ,ADVERTISING ,ATTENTION ,DECISION making ,DIGITIZATION ,CONSUMER behavior ,SMARTPHONES ,TARGETED advertising ,TECHNOLOGICAL innovations - Abstract
Mobile app users are often exposed to a sequence of short-lived marketing interventions (e.g., ads) within each usage session. This study examines how an increase in the variety of ads shown in a session affects a user's response to the next ad. The authors leverage the quasi-experimental variation in ad assignment in their data and propose an empirical framework that accounts for different types of confounding to isolate the effects of a unit increase in variety. Across a series of models, the authors consistently show that an increase in ad variety in a session results in a higher response rate to the next ad: holding all else fixed, a unit increase in variety of the prior sequence of ads can increase the click-through rate on the next ad by approximately 13%. The authors then explore the underlying mechanism and document empirical evidence for an attention-based account. The article offers important managerial implications by identifying a source of interdependence across ad exposures that is often ignored in the design of advertising auctions. Furthermore, the attention-based mechanism suggests that platforms can incorporate real-time attention measures to help advertisers with targeting dynamics. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
9. Multi-armed bandits with dependent arms.
- Author
-
Singh, Rahul, Liu, Fang, Sun, Yin, and Shroff, Ness
- Subjects
MACHINE learning - Abstract
We study a variant of the multi-armed bandit problem (MABP) which we call as MABs with dependent arms. Multiple arms are grouped together to form a cluster, and the reward distributions of arms in the same cluster are known functions of an unknown parameter that is a characteristic of the cluster. Thus, pulling an arm i not only reveals information about its own reward distribution, but also about all arms belonging to the same cluster. This "correlation" among the arms complicates the exploration–exploitation trade-off that is encountered in the MABP because the observation dependencies allow us to test simultaneously multiple hypotheses regarding the optimality of an arm. We develop learning algorithms based on the principle of optimism in the face of uncertainty (Lattimore and Szepesvári in Bandit algorithms, Cambridge University Press, 2020), which know the clusters, and hence utilize these additional side observations appropriately while performing exploration–exploitation trade-off. We show that the regret of our algorithms grows as O (K log T) , where K is the number of clusters. In contrast, for an algorithm such as the vanilla UCB that does not utilize these dependencies, the regret scales as O (M log T) , where M is the number of arms. When K ≪ M , i.e. there is a lot of dependencies among arms, our proposed algorithm drastically reduces the dependence of regret on the number of arms. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
10. Task offloading in mobile edge computing using cost-based discounted optimal stopping
- Author
-
ALFahad Saleh, Wang Qiyuan, Anagnostopoulos Christos, and Kolomvatsos Kostas
- Subjects
service management ,sequential decision making ,task offloading ,mobile edge computing ,optimal stopping theory ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Mobile edge computing (MEC) paradigm has emerged to improve the quality of service & experience of applications deployed in close proximity to end-users. Due to their restricted computational and communication resources, MEC nodes can provide access to a portion of the entire set of services and data gathered. Therefore, there are several obstacles to their management. Keeping track of all the services offered by the MEC nodes is challenging, particularly if their demand rates change over time. Received tasks (such as, analytics queries, classification tasks, and model learning) require services to be invoked in real MEC use-case scenarios, e.g., smart cities. It is not unusual for a node to lack the necessary services or part of them. Undeniably, not all the requested services may be locally available; thus, MEC nodes must deal with the timely and appropriate choice of whether to carry out a service replication (pull action) or tasks offloading (push action) to peer nodes in a MEC environment. In this study, we contribute with a novel time-optimized mechanism based on the optimal stopping theory, which is built on the cost-based decreasing service demand rates evidenced in various service management situations. Our mechanism tries to optimally solve the decision-making dilemma between pull and push action. The experimental findings of our mechanism and its comparative assessment with other methods found in the literature showcase the achieved optimal decisions with respect to certain cost-based objective functions over dynamic service demand rates.
- Published
- 2024
- Full Text
- View/download PDF
11. Solving Complex Sequential Decision-Making Problems by Deep Reinforcement Learning with Heuristic Rules
- Author
-
Nguyen, Thanh Thi, Nguyen, Cuong M., Huynh-The, Thien, Pham, Quoc-Viet, Nguyen, Quoc Viet Hung, Razzak, Imran, Reddi, Vijay Janapa, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Mikyška, Jiří, editor, de Mulatier, Clélia, editor, Paszynski, Maciej, editor, Krzhizhanovskaya, Valeria V., editor, Dongarra, Jack J., editor, and Sloot, Peter M.A., editor
- Published
- 2023
- Full Text
- View/download PDF
12. Statistical Matching Model in Centralized Two-sided Markets With Contexts, Constraints, and Incentive Compatibility Consideration
- Author
-
Li, Yuantong
- Subjects
Statistics ,Incentive Compatible ,Sequential Decision Making ,Statistics ,Two-sided Matching - Abstract
Two-sided online matching is a crucial aspect of optimizing social welfare sequentially within economic frameworks, achieved through pairing participants via third-party platforms. These platforms are utilized across various marketplaces such as college admissions, ride-sharing, doctor placement, dating, and job applications. Typically, these markets allocate indivisible ``good'' to multiple agents based on mutual compatibility, with preferences often being unknown due to the large participant volume, making it explicitly challenging. Moreover, matching markets inherently involve scarcity due to limited resources on both sides. This dissertation presents significant advances in statistical sequential modeling for two-sided online matching markets, considering dynamic markets, quota constraints, and participants' incentive compatibility. Situated at the intersection of sequential decision-making algorithm design and economics, this work introduces new algorithms, theories, and insights with applications spanning economics, statistics, and machine learning.Part I establishes foundational concepts of statistical sequential decision making and relevant economic terminology. Chapter 1 explores bandit algorithms, probability theory, and concentration inequalities, while Chapter 2 elucidates essential concepts of two-sided matching markets from an economic perspective, laying the groundwork for subsequent applications.Part II presents a theoretical framework for multi-agent competitive two-sided matching markets, crucial for online recommendation systems in job markets. The first project, detailed in Chapter 3, introduces an online statistical ridge estimation method for the dynamic matching problem (DMP) with its application in the LinkedIn text data. The second project, discussed in Chapter 4, presents an online statistical sequential decision-making method for the competing matching under complementary preferences recommendation problem (CMCPR), along with a novel algorithm addressing both complementary preferences and quota constraints simultaneously.
- Published
- 2024
13. Sequential decision making with feature-linear models
- Author
-
Janz, David, Hernández-Lobato, José Miguel, and Ghahramani, Zoubin
- Subjects
Kernel methods ,Machine learning ,Sequential decision making ,Reinforcement learning ,Bandit algorithms ,Feature-linear models ,Gaussian processes - Abstract
This thesis is concerned with the problem of sequential decision making, where an agent interacts sequentially with an unknown environment and aims to maximise the sum of the rewards it receives. Our focus is on methods that model the reward as linear in some feature space. We consider a bandit problem, where the rewards are linear in a reproducing kernel Hilbert space, and a reinforcement learning setting with features given by a neural network. The thesis is split into two parts accordingly. In part I, we introduce a new algorithm for the optimisation of continuous functions with a known number of derivatives under noisy bandit feedback. The algorithm, Partitioned GP-UCB, combines ideas from classic kernel-based bandit algorithms and hierarchical optimisation methods to provide near-tight confidence intervals and strong guarantees on computational complexity. As part of our analysis, we develop a novel bound on the effective dimension of kernel ridge regression with the Matérn kernel with respect to the volume of the domain. Part I is mainly concerned with the derivation of this bound. In part II we tackle practical exploration in deep reinforcement learning, with a focus on methods that combine linear uncertainty estimates with feature embeddings given by deep Q-function networks. We observe a flaw within previous such work: while these methods enforce a temporal difference relationship on the mean state-action values given by the linear model, they do not constrain its inputs---the neural network embeddings---and these determine the uncertainty estimates of linear models. We show that such embeddings need to follow a certain successor feature structure and develop a model based on this insight: Successor Uncertainties. We demonstrate that our model is capable of solving hard tabular exploration challenges and that it scales to the Atari Learning Environment, where it attains strong scores relative to competing methods.
- Published
- 2021
- Full Text
- View/download PDF
14. Space-constrained autonomous reversing of articulated vehicles
- Author
-
Liu, Xuan Zuo and Cebon, David
- Subjects
629.2 ,Autonomous Driving ,Articulated Vehicles ,Model Predictive Control ,Sequential Decision Making ,Reinforcement Learning - Abstract
This dissertation presents several modern control methods for autonomous reversing of long combination vehicles (LCVs). These approaches not only significantly improve the performance of previous autonomous reversing systems, but also address a major gap in the reversing control literature for LCVs. The methods were validated by implementing them at full-scale on experimental articulated vehicles owned by the 'Cambridge Vehicle Dynamics Consortium' (CVDC). Experimental results were in very good agreement with simulation results. Previous path following control methods for autonomous reversing of LCVs have focussed on minimising the tracking error between the rear-end of the combination and a desired path, irrespective of the motion of the rest of the vehicle. A significant disadvantage of these strategies is that the other parts of the vehicle, particularly the tractor unit, can experience large excursions from the reference path, thus making their implementation infeasible for manoeuvres in spatially-limited areas, such as warehouse yards and normal roads. The 'Minimum Swept Path Control' (MSPC) method was devised to reduce this problem by relaxing the requirement for very accurate path following, while minimising the maximum excursion of the vehicle. This strategy weights both the path-following error at the rear end of the vehicle and the swept path of the front end of the vehicle. MSPC enables the swept path to be reduced by about 50%, compared with path following control, which gives this method more realistic applications. The 'Lane-bounded Reversing Control' (LBRC) method requires a vehicle to satisfy the reversing objectives while constraining the motion to be within a specified 'lane'. The LBRC controller is 'intelligent', which means it can pursue an optimum route without tracking a desired path generated by a path planner and can proactively avoid future potential clashes. Hence, this controller enables the autonomous reversing system to perform a precise, minimum-cost and collision-free manoeuvre to a specified terminal position by planning ahead and making optimal decisions. The controller performance was evaluated in numerous realistic scenarios, both simulated and in field tests at full-scale. The analysis of LBRC provides a solid foundation for the development of more advanced control methods. Adaptive Lane-bounded Reversing Control (ALBRC) and Adaptive Bi-directional Control (ABC) systems were designed to improve the LBRC method. The tuning of the LBRC controller was based on empirical experience and there are many weights to be tuned in the controller configuration. To offset this drawback, the ALBRC algorithm was developed by attaching 'virtual bumpers' to the vehicle system states, and allowing the controller weights to adapt to lane boundaries and obstacles. The ALBRC method simplifies the original tuning process significantly. The ALBRC controller performs well in most cases. However, a solution is not guaranteed if the preview and control horizons are not tuned properly or a vehicle reverses from an arbitrary position. Hence, the ABC algorithm incorporates a so-called 'cusp technology', which allows a vehicle to move forward and backward to realign its position and orientation between attempts at the reversing manoeuvre. In this case, the preview and control horizons do not need to change in different scenarios. The ABC method can significantly reduce computational time compared with using a long preview horizon.
- Published
- 2021
15. Optimal sequential decision making with probabilistic digital twins
- Author
-
Christian Agrell, Kristina Rognlien Dahl, and Andreas Hafver
- Subjects
Probabilistic digital twin ,Epistemic uncertainty ,Sequential decision making ,Partially observable Markov decision process ,Deep reinforcement learning ,Science ,Technology - Abstract
Abstract In this study, we present a formal definition of the probabilistic digital twin (PDT). Digital twins are emerging in many industries, typically consisting of simulation models and data associated with a specific physical system. In order to define probabilistic digital twins, we discuss how epistemic uncertainty can be treated using measure theory, by modelling epistemic information via $$\sigma$$ σ -algebras. A gentle introduction to the necessary mathematical theory is provided throughout the paper, together with a number of examples to illustrate the core concepts. We then introduce the problem of optimal sequential decision making. That is, when the outcome of each decision may inform the next. We discuss how this problem may be solved theoretically, and the current limitations that prohibit most practical applications. As a numerically tractable alternative, we propose a generic approximate solution using deep reinforcement learning together with neural networks defined on sets. We illustrate the method on a practical problem, considering optimal information gathering for the estimation of a failure probability.
- Published
- 2023
- Full Text
- View/download PDF
16. Multimodal embodied attribute learning by robots for object-centric action policies.
- Author
-
Zhang, Xiaohan, Amiri, Saeid, Sinapov, Jivko, Thomason, Jesse, Stone, Peter, and Zhang, Shiqi
- Subjects
ROBOTS ,CURIOSITY ,ROBOT control systems ,COGNITIVE robotics ,IDENTIFICATION ,MACHINE learning ,REINFORCEMENT learning - Abstract
Robots frequently need to perceive object attributes, such as red, heavy, and empty, using multimodal exploratory behaviors, such as look, lift, and shake. One possible way for robots to do so is to learn a classifier for each perceivable attribute given an exploratory behavior. Once the attribute classifiers are learned, they can be used by robots to select actions and identify attributes of new objects, answering questions, such as "Is this objectred andempty ?" In this article, we introduce a robot interactive perception problem, called Multimodal Embodied Attribute Learning (meal), and explore solutions to this new problem. Under different assumptions, there are two classes of meal problems. offline-meal problems are defined in this article as learning attribute classifiers from pre-collected data, and sequencing actions towards attribute identification under the challenging trade-off between information gains and exploration action costs. For this purpose, we introduce Mixed Observability Robot Control (morc), an algorithm for offline-meal problems, that dynamically constructs both fully and partially observable components of the state for multimodal attribute identification of objects. We further investigate a more challenging class of meal problems, called online-meal, where the robot assumes no pre-collected data, and works on both attribute classification and attribute identification at the same time. Based on morc, we develop an algorithm called Information-Theoretic Reward Shaping (morc-itrs) that actively addresses the trade-off between exploration and exploitation in online-meal problems. morc and morc-itrs are evaluated in comparison with competitive meal baselines, and results demonstrate the superiority of our methods in learning efficiency and identification accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
17. Optimal Real Estate Pricing and Offer Acceptance Strategy
- Author
-
Eugene Khmelnitsky and Gonen Singer
- Subjects
Housing market policy ,dynamic pricing ,sequential decision making ,optimal stopping ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
We consider the problem of choosing the best of a set of sequential offers proposed by the market in a house-selling process. During each decision epoch, the seller sets a listing price, observes the offers and decides whether to accept the maximum one or to reject all of them. We model a fixed holding cost, which is the constant marketing cost of searching for buyers, and a variable cost that is proportional to the number of offers received during each epoch. The objective is to maximize the expected revenue. Most previous studies assume a stationary known distribution from which the buyers’ offers are generated and which reflects the market valuation of the house. In contrast, we assume that the number of incoming offers, and the distribution from which each individual offer is generated, are affected by the seller’s listing price (i.e., price-based demand response). Thus, we propose a new approach for the selling policy, which consists of the listing price and the offer acceptance threshold in each period. We derive the seller’s optimal selling policy and apply it to a scenario involving the sale of individual residential properties in Ames (Iowa), which yields results consistent with empirical observations.
- Published
- 2023
- Full Text
- View/download PDF
18. Sequential Nature of Recommender Systems Disrupts the Evaluation Process
- Author
-
Shirali, Ali, Filipe, Joaquim, Editorial Board Member, Ghosh, Ashish, Editorial Board Member, Prates, Raquel Oliveira, Editorial Board Member, Zhou, Lizhu, Editorial Board Member, Boratto, Ludovico, editor, Faralli, Stefano, editor, Marras, Mirko, editor, and Stilo, Giovanni, editor
- Published
- 2022
- Full Text
- View/download PDF
19. A review of online supervised learning.
- Author
-
Singh, Charanjeet and Sharma, Anuj
- Abstract
Online learning emerged as a promising solution to handle large data problems. The high-level performance witnessed in real-life applications of online learning established dramatic advances in this field. The varying nature of the data needs special attention from a research point of view, as it has emerged as a common challenge in many domains. Interestingly, online learning response to this varying nature of the data is one of the promising solutions. We continue in this direction by covering successful algorithms in literature and their complexities to meet new challenges in this field. In particular, we have covered the working of online supervised learning algorithms and their bounds on mistake rate. A suitable and systematic review of online supervised learning algorithms is crucial for domain understanding and a step toward a solution to meet future challenges in this field. We have covered online supervised learning review with its common framework, algorithms description in ascending order of their development of applications in real-life use, and discussion on their theoretical analysis of algorithms. The present paper also includes an experimental comparison to understand advances in online learning algorithms responses to benchmarked datasets as well as future challenges in this field. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
20. Drafting strategies in fantasy football: A study of competitive sequential human decision making
- Author
-
Michael D. Lee and Siqi Liu
- Subjects
decision strategies ,competitive decision making ,sequential decision making ,sports drafts ,fantasy football ,Social Sciences ,Psychology ,BF1-990 - Abstract
Drafting is a competitive task in which a set of decision makers choose from a set of resources sequentially, with each resource becoming unavailable once selected. How people make these choices raises basic questions about human decision making, including people’s sensitivity to the statistical regularities of the resource environment, their ability to reason about the behavior of their competitors, and their ability to execute and adapt sophisticated strategies in dynamic situations involving uncertainty. Sports provides one real-world example of drafting behavior, in which a set of teams draft players from an available pool in a well-regulated way. Fantasy sport competitions provide potentially large data sets of drafting behavior. We study fantasy football drafting behavior from the 2017 National Football League (NFL) season based on 1350 leagues hosted by the http://sleeper.app platform. We find people are sensitive to some important environmental regularities in the order in which they draft players, but also present evidence that they use a more narrow range of strategies than is likely optimal in terms of team composition. We find little to no evidence for the use of the complicated but well-documented strategy known as handcuffing, and no evidence of irrational influence from individual-level biases for different NFL teams. We do, however, identify a set of circumstances for which there is clear evidence that people’s choices are strongly influenced by the immediately preceding choice made by a competitor.
- Published
- 2022
- Full Text
- View/download PDF
21. Dynamic path learning in decision trees using contextual bandits.
- Author
-
Ju, Weiyu, Yuan, Dong, Bao, Wei, Ge, Liming, and Zhou, Bing Bing
- Subjects
- *
DECISION trees , *MACHINE learning , *ROBBERS , *DECISION making , *CONTEXTUAL analysis , *LEARNING , *REINFORCEMENT learning - Abstract
We present a novel online decision-making solution, where the optimal path of a given decision tree is dynamically found based on the contextual bandits analysis. At each round, the learner finds a path in the decision tree by making a sequence of decisions following the tree structure and receives an outcome when a terminal node is reached. At each decision node, the environment information is observed to hint on which child node to visit, resulting in a better outcome. The objective is to learn the context-specific optimal decision for each decision node to maximize the accumulated outcome. In this paper, we propose Dynamic Path Identifier (DPI), a learning algorithm where the contextual bandit is applied to every decision node, and the observed outcome is used as the reward of the previous decisions of the same round. The technical difficulty of DPI is the high exploration challenge caused by the width (i.e., the number of paths) of the tree as well as the large context space. We mathematically prove that DPI's regret per round approached zero as the number of the rounds approaches infinity. We also prove that the regret is not a function of the number of paths in the tree. Numerical evaluations are provided to complement the theoretical analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
22. Knowledge Gradient: Capturing Value of Information in Iterative Decisions under Uncertainty.
- Author
-
Lee, Donghun
- Subjects
- *
VALUE capture , *ALGORITHMS - Abstract
Many real-life problems that involve decisions under uncertainty are often sequentially repeated and can be approached iteratively. Knowledge Gradient (KG) formulates the decision-under-uncertainty problem into repeatedly estimating the value of information observed from each possible decisions and then committing to a decision with the highest estimated value. This paper aims to provide a multi-faceted overview of modern research on KG: firstly, on how the KG algorithm is formulated in the beginning with an example implementation of its most frequently used implementation; secondly, on how KG algorithms are related to other problems and iterative algorithms, in particular, Bayesian optimization; thirdly, on the significant trends found in modern theoretical research on KG; lastly, on the diverse examples of applications that use KG in their key decision-making step. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
23. Stochastic control of geological carbon storage operations using geophysical monitoring and deep reinforcement learning.
- Author
-
Noh, Kyubo and Swidinsky, Andrei
- Subjects
DEEP reinforcement learning ,GREENHOUSE gas mitigation ,ARTIFICIAL intelligence ,MACHINE learning ,SEQUENTIAL learning - Abstract
• We explore the application of deep reinforcement learning for optimizing geological carbon storage operations. • We assess the efficacy of utilizing geoscientific monitoring data as input for the optimization process. • The impact of subsurface uncertainty on the optimization process is investigated. • The method demonstrates an increase of 1–8 % in profit compared to the benchmark. • The optimization process is found to be significantly influenced by subsurface uncertainty. Geological carbon storage (GCS) is the process of injecting and storing carbon dioxide (CO 2) in the subsurface to reduce greenhouse gas emissions. Safe and profitable GCS operations require effective decision-making in the presence of uncertain geological models, a process which can often be facilitated with geophysical monitoring. In this study, we examine how sequential decision-making algorithms can be combined with geophysical measurements for the optimal control of GCS operations. Specifically, we develop an artificial intelligence model using deep reinforcement learning (DRL) that takes geophysical time-lapse gravity and well pressure monitoring data as input and delivers an optimal CO 2 injection policy. The objective of the problem at hand is to maximize the profit of a hypothetical GCS operation while mitigating the potential for induced seismicity, by training DRL agents using combined geostatistical, reservoir and geophysical simulation. Comparisons against two benchmarks – a constant injection strategy and an injection schedule optimized using a commercial reservoir simulator toolbox – show that the stochastic control of such operations from subsurface monitoring data using deep reinforcement learning is feasible. Evaluation results show that DRL agent behavior generates profits which are on average 1 to 8 percent higher than what is possible through a constant injection approach. Furthermore, we show that DRL can generate optimal injection policies applicable to the true (yet previously unseen) subsurface given carefully managed levels of uncertainty. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
24. Evaluating the Behavioural Impact of Risk Exposure and Quality of Services Attributes on Preferences of Airline Customers: A Choice Behaviour Experiment
- Author
-
D’Alonzo, Luca, Chiara Leva, M., Bucciarelli, Edgardo, Mattoscio, Nicola, Kacprzyk, Janusz, Series Editor, Bucciarelli, Edgardo, editor, Chen, Shu-Heng, editor, Corchado, Juan M., editor, and Parra D., Javier, editor
- Published
- 2021
- Full Text
- View/download PDF
25. Optimization of Mitigation Strategies During Epidemics Using Offline Reinforcement Learning
- Author
-
Vereshchaka, Alina, Kulkarni, Nitin, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Thomson, Robert, editor, Hussain, Muhammad Nihal, editor, Dancy, Christopher, editor, and Pyke, Aryn, editor
- Published
- 2021
- Full Text
- View/download PDF
26. Explaining Human Decisino Making in Optimal Stopping Tasks
- Author
-
Baumann, Christiane, Singmann, Henrik, Kaxiras, Vassilios E, Gershman, Samuel J, and von Helveren, Bettina
- Subjects
optimal stopping ,cognitive modeling ,sequential decision making ,probabilistic choice behavior - Abstract
In an optimal stopping problem, people encounter a sequenceof options and are tasked with choosing the best one; once anoption is rejected, it is no longer available. Recent studies ofoptimal stopping suggest that people compare the current op-tion with an internal threshold and accept it when the optionexceeds the threshold. In contrast, we propose that humans de-cide to accept or reject an option based on an estimate of theprobability that a better option will be observed in the future.We develop a computational model that formalizes this idea,and compare the model to the optimal policy in two experi-ments. Our model provides a better account of the data thanthe optimal model. In particular, our model explains how thedistributional structure of option values affects stopping behav-ior, providing a step towards a more complete psychologicaltheory of optimal stopping.
- Published
- 2018
27. An optimal stopping policy for car rental businesses with purchasing customers.
- Author
-
Fontem, Belleh
- Subjects
- *
RENTAL automobiles , *AUTOMOBILE leasing & renting , *CONSUMERS , *MARKET value , *THEORY of the firm , *RENT - Abstract
We analyze decisions for a car rental firm using a common pool of cars to serve both rental, and purchasing customers. The rental customers arrive successively, and rent out cars for random durations while effectuating random incremental mileages on them. This stochastic rental behavior makes the decision of when to sell a rental car quite a crucial one for the firm because it involves a certain amount of risk. On one hand, selling a car when its mileage is low proactively avoids a huge decline in the car's residual market value (even though it could also cause the firm to forfeit income from future rental customers who intend to rent that car for long durations while driving it sparingly). On the other hand, delaying selling is equally risky because the sample of early rental customers whom that car serves may only successively keep the car for short durations while significantly adding to its mileage. Such opportunistic customers would therefore have the effect of drastically diminishing the car's market value without providing sufficient rental income to compensate. We use optimal stopping theory to derive the firm's optimal selling policy algorithm which unfortunately comes with a practical implementation challenge. To address this issue, we propose three heuristic selling policies, one of which is based on the restart-in formulation introduced by Katehakis and Veinott (Math Oper Res 12(2):262–268, 1987). Numerical experiments using real and artificial parameter settings (1) reveal conditions under which the proposed heuristic policies outperform the firm's current (suboptimal) policy, and (2) demonstrate how all four suboptimal policies compare to the optimal policy. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
28. Stopping Criterion Design for Recursive Bayesian Classification: Analysis and Decision Geometry.
- Author
-
Kocanaogullari, Aziz, Akcakaya, Murat, and Erdogmus, Deniz
- Subjects
- *
BAYESIAN analysis , *DECISION making , *COMPUTER interfaces , *GEOMETRY , *THRESHOLDING algorithms , *TRACKING radar - Abstract
Systems that are based on recursive Bayesian updates for classification limit the cost of evidence collection through certain stopping/termination criteria and accordingly enforce decision making. Conventionally, two termination criteria based on pre-defined thresholds over (i) the maximum of the state posterior distribution; and (ii) the state posterior uncertainty are commonly used. In this paper, we propose a geometric interpretation over the state posterior progression and accordingly we provide a point-by-point analysis over the disadvantages of using such conventional termination criteria. For example, through the proposed geometric interpretation we show that confidence thresholds defined over maximum of the state posteriors suffer from stiffness that results in unnecessary evidence collection whereas uncertainty based thresholding methods are fragile to number of categories and terminate prematurely if some state candidates are already discovered to be unfavorable. Moreover, both types of termination methods neglect the evolution of posterior updates. We then propose a new stopping/termination criterion with a geometrical insight to overcome the limitations of these conventional methods and provide a comparison in terms of decision accuracy and speed. We validate our claims using simulations and using real experimental data obtained through a brain computer interfaced typing system. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
29. Online Linear Programming: Dual Convergence, New Algorithms, and Regret Bounds.
- Author
-
Li, Xiaocheng and Ye, Yinyu
- Subjects
LINEAR programming ,MACHINE learning ,ONLINE algorithms ,STOCHASTIC approximation ,ALGORITHMS ,COMPUTER programming education - Abstract
We study an online linear programming (OLP) problem under a random input model in which the columns of the constraint matrix along with the corresponding coefficients in the objective function are independently and identically drawn from an unknown distribution and revealed sequentially over time. Virtually all existing online algorithms were based on learning the dual optimal solutions/prices of the linear programs (LPs), and their analyses were focused on the aggregate objective value and solving the packing LP, where all coefficients in the constraint matrix and objective are nonnegative. However, two major open questions were as follows. (i) Does the set of LP optimal dual prices learned in the existing algorithms converge to those of the "offline" LP? (ii) Could the results be extended to general LP problems where the coefficients can be either positive or negative? We resolve these two questions by establishing convergence results for the dual prices under moderate regularity conditions for general LP problems. Specifically, we identify an equivalent form of the dual problem that relates the dual LP with a sample average approximation to a stochastic program. Furthermore, we propose a new type of OLP algorithm, action-history-dependent learning algorithm, which improves the previous algorithm performances by taking into account the past input data and the past decisions/actions. We derive an O (log n log log n) regret bound (under a locally strong convexity and smoothness condition) for the proposed algorithm, against the O (n) bound for typical dual-price learning algorithms, where n is the number of decision variables. Numerical experiments demonstrate the effectiveness of the proposed algorithm and the action-history-dependent design. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
30. Learning in Safety-critical, Lifelong, and Multi-agent Systems: Bandits and RL Approaches
- Author
-
Amani Geshnigani, Sanae
- Subjects
Electrical engineering ,Computer science ,Computer engineering ,Markov decision process ,Multi-armed bandit ,Optimization ,Reinforcement learning ,Sequential decision making - Abstract
Sequential decision-making problems arise at every occasion that agents repeatedly interact with an unknown environment in an effort to maximize a certain notion of reward gained from interactions with this environment. Examples are abundant in online advertising, online gaming, robotics, deep learning, dynamic pricing, network routing, etc. In particular, multi-armed bandits (MAB) model the interaction between the agent and the unknown environment as follows. The agent repeatedly acts by pulling arms and after an arm is pulled, she receives a stochastic reward; the goal at the end of this process is to select actions that maximize the expected cumulative reward without knowledge of the arms’ distributions. Albeit simple, this model is widely applicable. On the other hand, many sequential decision making occasions deal with more complicated environments modeled through Markov Decision Processes (MDPs) where the environment’s status constantly changes as a result of taking actions and makes learning even more challenging. The field of reinforcement learning (RL) defines a principled foundation for this methodology, based on classical dynamic programming algorithms for solving MDPs.Our research goal is to expand the applicability of bandit and RL algorithms to new application domains: specifically, safety-critical, lifelong and distributed physical systems, such as robotics, wireless networks, the power grid and medical trials.One distinguishing feature of many of such “new” potential applications of bandits and RL is their safety-critical nature. Specifically, the algorithm’s chosen policies must satisfy certain system constraints that if violated can lead to catastrophic results for the system. Importantly, the specifics of these constraints often change based on the interactions with the unknown environment; thus, they are often unknown themselves. This leads to the new challenge of balancing the goal of reward maximization with the restriction of playing policies that are safe. We modeled this problem through bandits and RL frameworks with linear reward and constraint structures. It turns out that even this seemingly simple safe linear bandit and RL formulations are more intricate than the original setting without safety constraints. In particular, simple variations of existing algorithms can be shown to be highly suboptimal. Using appropriate tools from high-dimensional probability and exploration-exploitation dilemma, we were able to design novel algorithms and to guarantee that they not only respect the safety constraints, but also have performance comparable to the setting without safety constraints.Recently, there has been a surging interest in designing lifelong learning agents that can continuously learn to solve multiple sequential decision making problems in their lifetimes. This scenario is in particular motivated by building multi-purpose embodied intelligence, such as robots working in a weakly structured environment. Typically, curating all tasks beforehand for such problems is nearly infeasible, and the problems the agent is tasked with may be adaptively selected based on the agent’s past behaviors. Consider a household robot as an example. Since each household is unique, it is difficult to anticipate upfront all scenarios the robot would encounter. In this direction, we theoretically study lifelong RL in a regret minimization setting, where the agent needs to solve a sequence of tasks using rewards in an unknown environment while balancing exploration and exploitation. Motivated by the embodied intelligence scenario, we suppose that tasks differ in rewards, but share the same state and action spaces and transition dynamics.Another distinguishing feature of the envisioned applications of bandit algorithms is that interactions involve multiple distributed agents/learners (e.g., wireless/sensor networks). This calls for extensions of the traditional bandit setting to networked systems. In many such systems, it is critical to maintain an efficient communication among the network while achieving a good performance in terms of accumulated reward, usually measured as network’s regret. In view of this, for the problem of distributed contextual linear bandits, we prove a minimax lower bound on the communication cost of any distributed contextual linear bandit algorithm with stochastic contexts that is optimal in terms of regret. We further propose an algorithm whose regret is optimal and communication rate matches this lower bound, and therefore it is provably optimal in terms of both regret and communication rate.
- Published
- 2023
31. Abstraction-Based Decision Making for Statistical Properties (Invited Talk)
- Author
-
Filip Cano and Thomas A. Henzinger and Bettina Könighofer and Konstantin Kueffner and Kaushik Mallik, Cano, Filip, Henzinger, Thomas A., Könighofer, Bettina, Kueffner, Konstantin, Mallik, Kaushik, Filip Cano and Thomas A. Henzinger and Bettina Könighofer and Konstantin Kueffner and Kaushik Mallik, Cano, Filip, Henzinger, Thomas A., Könighofer, Bettina, Kueffner, Konstantin, and Mallik, Kaushik
- Abstract
Sequential decision-making in probabilistic environments is a fundamental problem with many applications in AI and economics. In this paper, we present an algorithm for synthesizing sequential decision-making agents that optimize statistical properties such as maximum and average response times. In the general setting of sequential decision-making, the environment is modeled as a random process that generates inputs. The agent responds to each input, aiming to maximize rewards and minimize costs within a specified time horizon. The corresponding synthesis problem is known to be PSPACE-hard. We consider the special case where the input distribution, reward, and cost depend on input-output statistics specified by counter automata. For such problems, this paper presents the first PTIME synthesis algorithms. We introduce the notion of statistical abstraction, which clusters statistically indistinguishable input-output sequences into equivalence classes. This abstraction allows for a dynamic programming algorithm whose complexity grows polynomially with the considered horizon, making the statistical case exponentially more efficient than the general case. We evaluate our algorithm on three different application scenarios of a client-server protocol, where multiple clients compete via bidding to gain access to the service offered by the server. The synthesized policies optimize profit while guaranteeing that none of the server’s clients is disproportionately starved of the service.
- Published
- 2024
- Full Text
- View/download PDF
32. Towards a Causal Decision-Making Framework for Recommender Systems
- Author
-
Cavenaghi, E, Zanga, A, Stella, F, Zanker, M, Cavenaghi, E, Zanga, A, Stella, F, and Zanker, M
- Abstract
Causality is gaining more and more attention in the machine learning community and consequently also in recommender systems research. The limitations of learning offline from observed data are widely recognized, however, applying debiasing strategies like Inverse Propensity Weighting does not always solve the problem of making wrong estimates. This concept paper contributes a summary of debiasing strategies in recommender systems and the design of several toy examples demonstrating the limits of these commonly applied approaches. Therefore, we propose to map the causality frameworks of potential outcomes and structural causal models onto the recommender systems domain in order to foster future research and development. For instance, applying causal discovery strategies on offline data to learn the causal graph in order to compute counterfactuals or improve debiasing strategies.
- Published
- 2024
33. General Tree Evaluation for AlphaZero
- Author
-
Jaldevik, Albin (author) and Jaldevik, Albin (author)
- Abstract
Over the last decade, there have been significant advances in model-based deep reinforcement learning. One of the most successful such algorithms is AlphaZero which combines Monte Carlo Tree Search with deep learning. AlphaZero and its successors commonly describe a unified framework for tree construction and acting. For instance, build the tree with PUCT and act according to visitation counts. Policies based on visitation counts inherently make assumptions about the tree construction. This is problematic since it constrains the construction algorithm. For example, breadth-first tree construction yields a uniform visitation policy. To address this, we investigate the goals when extracting policies from decision trees and propose novel construction decoupled policies. Furthermore, we use these to modify how decision nodes are evaluated and utilize this during tree construction. We support the claim that our novel policies can benefit AlphaZero with theoretical analysis and empirical evidence. Our results on classical Gym environments show that the benefits are especially prominent for limited simulation budgets. The code is available through GitHub., https://github.com/albinjal/GeneralAlphaZero GitHub, Computer Science | Artificial Intelligence
- Published
- 2024
34. Beyond risk preferences in sequential decision-making: How probability representation, sequential structure and choice perseverance bias optimal search.
- Author
-
Baumann C, Schlegelmilch R, and Helversen BV
- Abstract
Sequential decision-making, where choices are made one after the other, is an important aspect of our daily lives. For example, when searching for a job, an apartment, or deciding when to buy or sell a stock, people often have to make decisions without knowing what future opportunities might arise. These situations, which are known as optimal stopping problems, involve a risk associated with the decision to either stop or continue searching. However, previous research has not consistently found a clear connection between individuals' search behavior in these tasks and their risk preferences as measured in controlled experimental settings. In this paper, we explore how particular characteristics of optimal stopping tasks affect people's choices, extending beyond their stable risk preferences. We find that (1) the way the underlying sampling distribution is presented (whether it is based on experience or description), (2) the sequential presentation of options, and (3) the unequal frequencies of choices to reject versus to accept significantly bias people choices. These results shed light on the complex nature of decisions that unfold sequentially and emphasize the importance of incorporating context factors when studying human decision behavior., (Copyright © 2024 Elsevier B.V. All rights reserved.)
- Published
- 2024
- Full Text
- View/download PDF
35. Proximal Policy Optimization for Radiation Source Search
- Author
-
Philippe Proctor, Christof Teuscher, Adam Hecht, and Marek Osiński
- Subjects
deep reinforcement learning ,source search and localization ,active search ,gamma radiation ,source parameter estimation ,sequential decision making ,Nuclear engineering. Atomic power ,TK9001-9401 - Abstract
Rapid search and localization for nuclear sources can be an important aspect in preventing human harm from illicit material in dirty bombs or from contamination. In the case of a single mobile radiation detector, there are numerous challenges to overcome such as weak source intensity, multiple sources, background radiation, and the presence of obstructions, i.e., a non-convex environment. In this work, we investigate the sequential decision making capability of deep reinforcement learning in the nuclear source search context. A novel neural network architecture (RAD-A2C) based on the advantage actor critic (A2C) framework and a particle filter gated recurrent unit for localization is proposed. Performance is studied in a randomized 20×20 m convex and non-convex simulation environment across a range of signal-to-noise ratio (SNR)s for a single detector and single source. RAD-A2C performance is compared to both an information-driven controller that uses a bootstrap particle filter and to a gradient search (GS) algorithm. We find that the RAD-A2C has comparable performance to the information-driven controller across SNR in a convex environment. The RAD-A2C far outperforms the GS algorithm in the non-convex environment with greater than 95% median completion rate for up to seven obstructions.
- Published
- 2021
- Full Text
- View/download PDF
36. ORDER MATTERS: RATING SERVICE PROFESSIONALS FIRST REDUCES TIPPING AMOUNT.
- Author
-
Jinjie Chen, Alison Jing Xu, Rodas, Maria A., and Xuefeng Liu
- Subjects
RATING agencies (Finance) ,TIPS & tipping (Gratuities) - Published
- 2022
37. A consecutive hybrid spiking-convolutional (CHSC) neural controller for sequential decision making in robots.
- Author
-
Azimirad, Vahid, Ramezanlou, Mohammad Tayefe, Sotubadi, Saleh Valizadeh, and Janabi-Sharifi, Farrokh
- Subjects
- *
DECISION making , *CONVOLUTIONAL neural networks , *ROBOTS , *REINFORCEMENT learning , *INFORMATION networks - Abstract
[Display omitted] • Situation detection and control action generation occur in the trained CHSC network. • Each learned policy in the proposed method has an address in the SNN. • Information between the networks are transformed through a mediator algorithm. • The proposed SNN is capable of expanding itself after observing new objects. • The CHSC system can adapt to new conditions and forget the learned policies in past. In this paper, a Consecutive Hybrid Spiking-Convolutional (CHSC) neural controller is proposed by integrating Convolutional Neural Networks (CNNs) and Spiking Neural Networks (SNNs). The proposed controller is mainly developed around an image-based control strategy. For this purpose, the CNN processes the input images and detects objects. The Neuronal Data Transfer Interface (NDTI) algorithm receives information from CNN, calculates the position of the objects, and expands the SNN based on the input data. Finally, the SNN generates the proper motor commands according to the detected object. The applicability of the proposed controller in responding to the different situations was shown. The expansion of the network (i.e., creating new neurons and connections during training) is also examined. The CHSC made SNNs grow upon learning of interaction with new objects. The adaptability of the method is also evaluated. The results reveal that the SNN could adapt itself to changes in the environment without any supervisor. This characteristics allows the robot to forget the learned policies and learn new ones without changing the controller's whole structure. Each learned policy has its address in the SNN. Unlike the second-generation networks (e.g., Perceptron), change in an feature of the object only alters the synaptic connections associated with that object, so a small change in the environment does not affect the entire network. The method is verified experimentally using a two degrees-of-freedom robotic arm. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
38. Multi-Task Offloading Based on Optimal Stopping Theory in Edge Computing Empowered Internet of Vehicles.
- Author
-
Mu, Liting, Ge, Bin, Xia, Chenxing, and Wu, Cai
- Subjects
- *
EDGE computing , *MOBILE computing , *POWER (Social sciences) , *INTERNET , *SENSITIVITY analysis - Abstract
Vehicular edge computing is a new computing paradigm. By introducing edge computing into the Internet of Vehicles (IoV), service providers are able to serve users with low-latency services, as edge computing deploys resources (e.g., computation, storage, and bandwidth) at the side close to the IoV users. When mobile nodes are moving and generating structured tasks, they can connect with the roadside units (RSUs) and then choose a proper time and several suitable Mobile Edge Computing (MEC) servers to offload the tasks. However, how to offload tasks in sequence efficiently is challenging. In response to this problem, in this paper, we propose a time-optimized, multi-task-offloading model adopting the principles of Optimal Stopping Theory (OST) with the objective of maximizing the probability of offloading to the optimal servers. When the server utilization is close to uniformly distributed, we propose another OST-based model with the objective of minimizing the total offloading delay. The proposed models are experimentally compared and evaluated with related OST models using simulated data sets and real data sets, and sensitivity analysis is performed. The results show that the proposed offloading models can be efficiently implemented in the mobile nodes and significantly reduce the total expected processing time of the tasks. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
39. Dependent Choices in Employee Selection: Modeling Choice Compensation andConsistency
- Author
-
Krefeld-Schwalb, Antonia, Scheibehenne, Benjamin, Rieskamp, J ̈org, and Berkowitsch, Nicolas
- Subjects
sequential sampling model ,preferential choice ,sequential decision making ,employee selection. - Abstract
Past choices can influence subsequent choices in employee se-lection. Previous approaches rather described similar sequen-tial effects with feedback learning or the misperception of ran-domness. However, in the selection of job candidates also theaccumulation of the moral impact of previous choices mightinfluence subsequent choices. We investigated that questionby making two major contributions to the literature. First, wedeveloped an experimental paradigm for measuring sequentialchoices in employee selection and second, we implementeda widely applicable computational model, the Dependent Se-quential Sampling Model, for explaining sequential effects inchoices. By using this methodological approach, we uncov-ered sequential effects in employee selection. Participants(N=600) were especially motivated to compensate for morallydubious choices, with some participants showing consistentchoice behavior if their previous choices had been morally vir-tuous. These results support the assumption of asymmetriccompensation of morally dubious choices, sometimes referredto as the moral cleansing hypothesis.
- Published
- 2017
40. Forward planning driven by context-dependant conflict processing in anterior cingulate cortex
- Author
-
Florian Ott, Eric Legler, and Stefan J. Kiebel
- Subjects
Anterior cingulate cortex ,Forward planning ,Cognitive control ,Generalization ,Sequential decision making ,Metacontrol ,Neurosciences. Biological psychiatry. Neuropsychiatry ,RC321-571 - Abstract
Cognitive control and forward planning in particular is costly, and therefore must be regulated such that the amount of cognitive resources invested is adequate to the current situation. However, knowing in advance how beneficial forward planning will be in a given situation is hard. A way to know the exact value of planning would be to actually do it, which would ab initio defeat the purpose of regulating planning, i.e. the reduction of computational and time costs. One possible solution to this dilemma is that planning is regulated by learned associations between stimuli and the expected demand for planning. Such learning might be based on generalisation processes that cluster together stimulus states with similar control relevant properties into more general control contexts. In this way, the brain could infer the demand for planning, based on previous experience with situations that share some structural properties with the current situation. Here, we used a novel sequential task to test the hypothesis that people use control contexts to efficiently regulate their forward planning, using behavioural and functional magnetic resonance imaging data. Consistent with our hypothesis, reaction times increased with trial-by-trial conflict, where this increase was more pronounced in a context with a learned high demand for planning. Similarly, we found that fMRI activity in the dorsal anterior cingulate cortex (dACC) increased with conflict, and this increase was more pronounced in a context with generally high demand for planning. Taken together, the results indicate that the dACC integrates representations of planning demand at different levels of abstraction to regulate planning in an efficient and situation-appropriate way.
- Published
- 2022
- Full Text
- View/download PDF
41. SWOAM: Swarm optimized agents for energy management in grid-interactive connected buildings.
- Author
-
Tungom, Chia E., Wang, Hong, Beata, Kamuya, and Niu, Ben
- Subjects
- *
ENERGY management , *MACHINE learning , *SWARM intelligence , *CARBON emissions , *RENEWABLE energy sources , *REINFORCEMENT learning - Abstract
With increasing complexity of smart city energy systems and rising energy demand, effective energy management solutions are crucial. Buildings now incorporate renewable energy sources and battery storage for efficient energy utilization, making optimal control strategies important. Compared to rule-based controllers and model-based methods, swarm and evolutionary algorithms have the advantages of providing cost-effective, stable, and scalable alternatives. However, their potential in data-rich environments and multi-building energy systems remains underexplored. This research bridges this gap by demonstrating the cooperative capabilities of population-based optimization agents for efficient energy management. Specifically, novel algorithm and framework, named Swarm Optimized Agents for Sequential Decision Making (SWOAM) is proposed. It combines an online k-means classifier and a swarm optimizer for optimal control of multi-building energy systems. By designing an online K-means learning strategy and an agent-based sequential episodic sampling approach, it is feasible to train a swarm of agents find an optimal energy management policy for buildings in a district. For training and evaluation, we use the standardized CityLearn building energy management environment. The agents are evaluated on three key metrics: electricity cost, carbon dioxide emissions and grid ramping. SWOAM delivers state-of-the-art performance and outperforms modern reinforcement learning and rule-based controllers. • Introduced a novel framework for integrating swarm intelligence into sequential decision-making processes. • Designed and implemented a hybrid algorithm combining machine learning techniques with swarm optimization for efficient energy management in grid connected buildings. • Conducted empirical evaluations demonstrating the effectiveness of swarm optimization in sequential decision-making tasks, achieving comparable performance to state-of-the-art reinforcement learning algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
42. Search Under Accumulated Pressure.
- Author
-
Alizamir, Saed, de Véricourt, Francis, and Sun, Peng
- Subjects
PARTIALLY observable Markov decision processes ,STATISTICAL decision making - Abstract
The paper "Search Under Accumulated Pressure" explores how time pressure in the form of task accumulation might affect the information-gathering process of a decision maker. Decision makers often need to balance the benefit of improving their belief about the best choice against the cost of acquiring more information. In many instances, the cost to acquire additional information depends on the size of the accumulated workload. The paper formulates this problem as a partially observable Markov decision process and characterizes the optimal control policy. The analysis reveals that the highest workload that the decision maker is willing to sustain may increase in the middle of the information search. This runs counter to the more intuitive argument that, as the search for information on a task progresses, the decision maker becomes less willing to sustain a high level of workload. Arrow et al. [Arrow K, Blackwell D, Girshick M (1949) Bayes and minimax solutions of sequential decision problems. Econometrica 17(3/4):213–244.] introduced the first sequential search problem "where at each stage the options available are to stop and take a definite action or to continue sampling for more information." We study how time pressure in the form of task accumulation may affect this decision problem. To that end, we consider a search problem where the decision maker (DM) faces a stream of random decision tasks to be treated one at a time that accumulate when not attended to. We formulate the problem of managing this form of pressure as a partially observable Markov decision process and characterize the corresponding optimal policy. We find that the DM needs to alleviate this pressure very differently depending on how the search on the current task has unfolded thus far. As the search progresses, the DM is less and less willing to sustain high levels of workloads in the beginning and end of the search but actually increases the maximum workload that she is willing to handle in the middle of the process. The DM manages this workload first by making a priori decisions to release some accumulated tasks and later, by aborting the current search and deciding based on her updated belief. This novel search strategy critically depends on the DM's prior belief about the tasks and stems, in part, from an effect related to the decision ambivalence. These findings are robust to various extensions of our basic setup. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
43. Multiplayer Bandits Without Observing Collision Information.
- Author
-
Lugosi, Gábor and Mehrabian, Abbas
- Subjects
MULTI-armed bandit problem (Probability theory) ,ROBBERS ,NASH equilibrium - Abstract
We study multiplayer stochastic multiarmed bandit problems in which the players cannot communicate, and if two or more players pull the same arm, a collision occurs and the involved players receive zero reward. We consider two feedback models: a model in which the players can observe whether a collision has occurred and a more difficult setup in which no collision information is available. We give the first theoretical guarantees for the second model: an algorithm with a logarithmic regret and an algorithm with a square-root regret that does not depend on the gaps between the means. For the first model, we give the first square-root regret bounds that do not depend on the gaps. Building on these ideas, we also give an algorithm for reaching approximate Nash equilibria quickly in stochastic anticoordination games. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
44. The practical value of structural health information for time dependence in bridge maintenance.
- Author
-
Larsson Ivanov, Oskar, Björnsson, Ivar, Honfi, Dániel, and Leander, John
- Subjects
- *
BRIDGE maintenance & repair , *DECISION making , *MAINTENANCE costs , *BRIDGES , *INFORMATION modeling , *DECISION theory , *CONCRETE bridges , *INFORMATION needs - Abstract
For practical decisions on common recurring maintenance actions, the information from routine inspections form a decision basis for the bridge manager. It is often difficult to assess whether this information is sufficient for deciding on a repair action, or if more information is needed. For many bridges the information for supporting decisions may be limited, although those bridges cause large yearly maintenance costs for the society. The purpose of the study presented in this paper is to show how two different models for decision making based on Bayesian decision theory, a point-in-time decision model and a sequential updating decision model, can be used to improve the decision-making process for common maintenance decisions. The models use information from routine inspections and incorporates time dependent aspects such as material degradation and time value of money to improve the decision-making process. The focus is on presenting the methodology with a case study of a concrete bridge in Sweden where the edge beams may have to be replaced. Three assessment approaches are considered: (i) no assessment, (ii) desktop evaluation and (iii) measurements. The main finding is that sequential updating decision making will provide a higher benefit than a point-in-time decision, and thus give higher Value of Information. This value becomes even higher when the measurements are selected for the assessment. The results also show that the edge beams should be replaced. The general approach presented can be applicable to many decision scenarios related to maintenance of deteriorating structures. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
45. Best-Arm Identification Using Extreme Value Theory Estimates of the CVaR.
- Author
-
Troop, Dylan, Godin, Frédéric, and Yu, Jia Yuan
- Abstract
We consider a risk-aware multi-armed bandit framework with the goal of avoiding catastrophic risk. Such a framework has multiple applications in financial risk management. We introduce a new conditional value-at-risk (CVaR) estimation procedure combining extreme value theory with automated threshold selection by ordered goodness-of-fit tests, and we apply this procedure to a pure exploration best-arm identification problem under a fixed budget. We empirically compare our results with the commonly used sample average estimator of the CVaR, and we show a significant performance improvement when the underlying arm distributions are heavy-tailed. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
46. A Multi-armed Bandit Algorithm Available in Stationary or Non-stationary Environments Using Self-organizing Maps
- Author
-
Manome, Nobuhito, Shinohara, Shuji, Suzuki, Kouta, Tomonaga, Kosuke, Mitsuyoshi, Shunji, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Tetko, Igor V., editor, Kůrková, Věra, editor, Karpov, Pavel, editor, and Theis, Fabian, editor
- Published
- 2019
- Full Text
- View/download PDF
47. Multi-Step Look-Ahead Optimization Methods for Dynamic Pricing With Demand Learning
- Author
-
Dina Elreedy, Amir F. Atiya, and Samir I. Shaheen
- Subjects
Demand learning ,dynamic pricing ,optimization ,revenue management ,sequential decision making ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Dynamic pricing is a beneficial strategy for firms seeking to achieve high revenues. It has been widely applied to various domains such as the airline industry, the hotel industry, and e-services. Dynamic pricing is basically the problem of setting time-varying prices for a certain product or service for the purpose of optimizing revenue. However, a major challenge encountered when applying dynamic pricing is the lack of knowledge of the demand-price curve. The demand-price curve defines the customer’s response towards changing the price. In this work, we address the dynamic pricing problem in case of unknown demand-price relation. This work introduces a less myopic pricing approach based on looking ahead for one or several future steps in our quest to optimize revenue. Specifically, the proposed formulation maximizes the summation of the immediate revenue and the expected future revenues of one or multiple look-ahead steps. A key benefit of the proposed approach is that it automatically strikes a balance between the conflicting goals of revenue maximization and demand learning, by producing less myopic and profitable prices. We provide a formulation for the presented look-ahead pricing approach, and we implement two variants of it: one-step and two-step look-ahead methods. Experiments are conducted on synthetic and real datasets to compare the proposed pricing methods to other pricing strategies in literature. The experimental results indicate that the proposed look-ahead methods outperform their counterparts in terms of the achieved revenue gain.
- Published
- 2021
- Full Text
- View/download PDF
48. Non-asymptotic Properties of Individualized Treatment Rules from Sequentially Rule-Adaptive Trials.
- Author
-
DAIQI GAO, Yufeng Liu, and Donglin Zeng
- Subjects
- *
UBIQUINONES , *STATISTICAL learning , *MACHINE learning , *STATISTICAL decision making , *RANDOMIZED controlled trials , *INDIVIDUALIZED medicine , *EMPIRICAL research - Abstract
Learning optimal individualized treatment rules (ITRs) has become increasingly important in the modern era of precision medicine. Many statistical and machine learning methods for learning optimal ITRs have been developed in the literature. However, most existing methods are based on data collected from traditional randomized controlled trials and thus cannot take advantage of the accumulative evidence when patients enter the trials sequentially. It is also ethically important that future patients should have a high probability to be treated optimally based on the updated knowledge so far. In this work, we propose a new design called sequentially rule-adaptive trials to learn optimal ITRs based on the contextual bandit framework, in contrast to the response-adaptive design in traditional adaptive trials. In our design, each entering patient will be allocated with a high probability to the current best treatment for this patient, which is estimated using the past data based on some machine learning algorithm (for example, outcome weighted learning in our implementation). We explore the tradeoff between training and test values of the estimated ITR in single-stage problems by proving theoretically that for a higher probability of following the estimated ITR, the training value converges to the optimal value at a faster rate, while the test value converges at a slower rate. This problem is different from traditional decision problems in the sense that the training data are generated sequentially and are dependent. We also develop a tool that combines martingale with empirical process to tackle the problem that cannot be solved by previous techniques for i.i.d. data. We show by numerical examples that without much loss of the test value, our proposed algorithm can improve the training value significantly as compared to existing methods. Finally, we use a real data study to illustrate the performance of the proposed method. [ABSTRACT FROM AUTHOR]
- Published
- 2022
49. Proximal Policy Optimization for Radiation Source Search.
- Author
-
Proctor, Philippe, Teuscher, Christof, Hecht, Adam, and Osiński, Marek
- Subjects
- *
RADIATION , *NUCLEAR fusion , *NUCLEAR reactors , *NUCLEAR reactions , *DECISION making , *ARTIFICIAL neural networks - Abstract
Rapid search and localization for nuclear sources can be an important aspect in preventing human harm from illicit material in dirty bombs or from contamination. In the case of a single mobile radiation detector, there are numerous challenges to overcome such as weak source intensity, multiple sources, background radiation, and the presence of obstructions, i.e., a non-convex environment. In this work, we investigate the sequential decision making capability of deep reinforcement learning in the nuclear source search context. A novel neural network architecture (RAD-A2C) based on the advantage actor critic (A2C) framework and a particle filter gated recurrent unit for localization is proposed. Performance is studied in a randomized 20 × 20 m convex and non-convex simulation environment across a range of signal-to-noise ratio (SNR)s for a single detector and single source. RAD-A2C performance is compared to both an information-driven controller that uses a bootstrap particle filter and to a gradient search (GS) algorithm. We find that the RAD-A2C has comparable performance to the information-driven controller across SNR in a convex environment. The RAD-A2C far outperforms the GS algorithm in the non-convex environment with greater than 95 % median completion rate for up to seven obstructions. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
50. Popularity, novelty and relevance in point of interest recommendation: an experimental analysis.
- Author
-
Massimo, David and Ricci, Francesco
- Subjects
RECOMMENDER systems ,REINFORCEMENT learning ,POPULARITY - Abstract
Recommender Systems (RSs) are often assessed in off-line settings by measuring the system precision in predicting the observed user's ratings or choices. But, when a precise RS is on-line, the generated recommendations can be perceived as marginally useful because lacking novelty. The underlying problem is that it is hard to build an RS that can correctly generalise, from the analysis of user's observed behaviour, and can identify the essential characteristics of novel and yet relevant recommendations. In this paper we address the above mentioned issue by considering four RSs that try to excel on different target criteria: precision, relevance and novelty. Two state of the art RSs called SKNN and s-SKNN follow a classical Nearest Neighbour approach, while the other two, Q-BASE and Q-POP PUSH are based on Inverse Reinforcement Learning. SKNN and s-SKNN optimise precision, Q-BASE tries to identify the characteristics of POIs that make them relevant, and Q-POP PUSH, a novel RS here introduced, is similar to Q-BASE but it also tries to recommend popular POIs. In an off-line experiment we discover that the recommendations produced by SKNN and s-SKNN optimise precision essentially by recommending quite popular POIs. Q-POP PUSH can be tuned to achieve a desired level of precision at the cost of losing part of the best capability of Q-BASE to generate novel and yet relevant recommendations. In the on-line study we discover that the recommendations of SKNN and Q-POP PUSH are liked more than those produced by Q-BASE. The rationale of that was found in the large percentage of novel recommendations produced by Q-BASE, which are difficult to appreciate. However, Q-BASE excels in recommending items that are both novel and liked by the users. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.