Descriptor: "HA Statistics" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"HA Statistics"' showing total 4,098 results

Start Over Descriptor "HA Statistics"

4,098 results on '"HA Statistics"'

1. Network and school variations in adolescents' health behaviour and educational attainment : a multilevel analysis of US data

Author: Gerogiannis, George
Subjects: HA Statistics, HM Sociology, L Education (General), RA Public aspects of medicine
Abstract: This thesis develops a statistical methodology for an important area of social network modelling, that of the effects that an individual's social network can have on the individual's propensity to engage in an array of different acts, which has been a public health concern in many societies and is increasingly becoming important in the commercial world as access to such data is becoming increasingly available and can be used to maximise profits. The majority of studies that investigate this phenomenon estimate the fixed effects of network statistics on an individual's propensity to engage in a certain behaviour and are based on network health data. The process thought to generate this phenomenon is typically modelled with a univariate Bernoulli generalised linear model, which simplifies the network component present in the process by summarising it with statistics, a procedure which induces a loss of information. Over the past 20 years, statistical methodology has been developed to remedy this issue with the use of a Bernoulli generalised linear mixed model which explicitly accounts for the network components by modelling them as random effects. The work presented in this thesis provides several novel contributions to these approaches • The first of which is the development of a multivariate model that extends the multiple membership multiple classification model proposed by Browne et al. (2001). • The second is the development of a multivariate model that considers a spatio-network interaction involving the sets of spatial and network random effects, as it may be of interest to study whether friendship effects differ depending on the areal unit in which an individual lives. • The third concerns the development of a software package that will enable researchers to implement the models developed in this thesis. These novel contributions are achieved through the use of Bayesian hierarchical models with estimation performed with Markov chain Monte Carlo simulation.
Published: 2023
Full Text: View/download PDF

2. Exploring the effect of stepwise-multiple-round bidding on willingness-to-pay in contingent valuation study : when can we trust respondents' preferences?

Author: Wang, Yixin
Subjects: H Social Sciences (General), HA Statistics
Abstract: The contingent valuation method (CVM) has been widely used by economists and statisticians to measure the benefits of non-market goods or services since the 1990s. The framework for CVM is derived from the utility function of welfare economics. CVM asks people directly about their willingness to pay (WTP) for the value of specific goods or services, or willingness to accept to give up the value of goods or services. China has achieved rapid economic growth over the past three decades, during this period, however, it also faced serious environmental challenges and problems, for example, air pollution. China and the United States are jointly responsible for 40% of the world's carbon emissions. This thesis analyses people's concerns about environmental issues using CVM. Specifically, we implement the classical maximum likelihood estimation (MLE) method and the Monte Carlo Markov Chain (MCMC) simulation-based method to evaluate people's willingness to pay (WTP) for the improvement of environmental quality via support of a "geo-engineering" project. The data used in this thesis was collected through face-to-face interviewing in four cities in China, Harbin (northeast and inland), Zhengzhou (north and inland), Changsha (central-south and inland) and Zhuhai (southeast and coastal). We interviewed 1,044 participants, asked them to answer a CVM survey questionnaire and collected their responses. The CVM questionnaire included six aspects of information that could affect respondents' WTP. In addition to the social-demographics related to the respondents' preferences, e.g. gender, age, household income, we also asked about respondents' health conditions, social connections and awareness of political issues, governmental support and risks of human activities on environment and etc. Using this sample, we initially employed the step-wise and logistic regression models to identify the significant factors, then we applied the classical MLE to model the single- and double-bounded CVM answers and the WTP values, and expanded the modelling procedures to multiple-rounds bidding processes. Further more we also used MCMC to analyse the mean WTP values through multiple rounds of bidding process. MLE results suggested that the fitted mean WTP values from single-, double- and triple-bounded MLE models were CNY816.56, CNY565.79 and CNY539.27, respectively. The gap between the single- and the double-bounded estimates showed that the WTP estimates from commonly-used single-bounded approach could lead to unreliable results. We also discovered that the more respondents believed that they gained benefit from the ``geo-engineering'' project, the more they were willing to agree to the given bid, and were likely to pay a greater price; the more respondents were prepared to spend on pollution reduction products, the equated to their awareness of the harm of pollution, the more they would like to pay. Results also supported that being admitted to hospital was positively related to the value of WTP; being interested in news and public affairs had a negative effect on mean WTP. On the other hand, the estimated mean WTP values from the MCMC approach for the single, double and triple-bounded models were CNY810.82, CNY566.10 and CNY510.22, respectively, largely consistent with the results in MLE. MCMC improved WTP models because it produced more significant variables and narrower confidence intervals.
Published: 2023

3. The integration of immigrants and their descendants across Europe : a multidimensional overview

Author: Strain-Fajth, Veronika
Subjects: HA Statistics, HM Sociology, HT Communities. Classes. Races, JV Colonies and colonization. Emigration and immigration. International migration
Abstract: This thesis aims to strengthen the state of European-level knowledge on the integration and well-being of immigrants and their descendants via helping to achieve a more comprehensive understanding of four key areas: (1) the concept of integration; (2) the multidimensionality of integration; (3) The relevance of immigrant parentage for the second generation; and (4) the role of the host country context in immigrant integration patterns. I address each of these four main knowledge gaps via four individual yet interconnected analyses. These consist, respectively, of a wide-ranging conceptual review and three quantitative studies using recent cross-European data (European Social Survey, 2012-2018) for a set of multidimensional analyses (observing indicators of economic, political, and social inclusion, as well as health and well-being). The first study examines the multidimensionality of integration via a factor analysis of 18 integration-related outcomes for first- and second-generation racial/ethnic minority immigrants across Europe (ESS7; N=1,066). The second study compares outcomes of second-generation immigrants and native-parentage natives along multiple dimensions, with a systematic analysis of parental class background, gender, and ethnic/racial minority status both alongside and intersected with migration background (ESS 6-9, N=130,117). The third study explores the linkages between individual immigrants' outcomes and host country's macro-level characteristics through a wide range of models (ESS6-9, N\(_1\)=9,175, N\(_2\)=72 country-year contexts). Findings overall highlight, first, the complexity of integration concepts; second, the variation in integration outcomes across different dimensions; third, the continued and complex associations of immigrant parentage, including a relative second-generation advantage within otherwise vulnerable groups; and fourth, the relevance of several host country features for immigrant integration, including economic conditions, attitudes towards immigrants, and migrant integration policies. The thesis thus makes several original contributions helping to broaden the body of empirical evidence on first- and second-generation multidimensional integration in Europe, along with some informative conceptual and methodological insights. With its cross-European, multidimensional, wide-ranging scope, the thesis helps fill conceptual and empirical gaps and inconsistencies within a so far rich but fragmented body of integration literature, thus helping to advance the field towards a broader, more comprehensive understanding of immigrant integration at a European level.
Published: 2023

4. A distributed and real-time machine learning framework for smart meter big data

Author: Dai, Shuang
Subjects: HA Statistics, QA75 Electronic computers. Computer science
Abstract: The advanced metering infrastructure allows smart meters to collect high-resolution consumption data, thereby enabling consumers and utilities to understand their energy usage at different levels, which has led to numerous smart grid applications. Smart meter data, however, poses different challenges to developing machine learning frameworks than classic theoretical frameworks due to their big data features and privacy limitations. Therefore, in this work, we aim to address the challenges of building machine learning frameworks for smart meter big data. Specifically, our work includes three parts: 1) We first analyze and compare different learning algorithms for multi-level smart meter big data. A daily activity pattern recognition model has been developed based on non-intrusive load monitoring for appliance-level smart meter data. Then, a consensus-based load profiling and forecasting system has been proposed for individual building level and higher aggregated level smart meter data analysis; 2) Following discussion of multi-level smart meter data analysis from an offline perspective, a universal online functional analysis model has been proposed for multi-level real-time smart meter big data analysis. The proposed model consists of a multi-scale load dynamic profiling unit based on functional clustering and a multi-scale online load forecasting unit based on functional deep neural networks. The two units enable online tracking of the dynamic cluster trajectories and online forecasting of daily multi-scale demand; 3) To enable smart meter data analysis in the distributed environment, FederatedNILM was proposed, which is then combined with differential privacy to provide privacy guarantees for the appliance-level distributed machine learning framework. Based on federated deep learning enhanced with two schemes, namely the utility optimization scheme and the privacy-preserving scheme, the proposed distributed and privacy-preserving machine learning framework enables electric utilities and service providers to offer smart meter services on a large scale.
Published: 2023

5. Drawing exact samples : rejection sampling, density fusion and constrained disaggregation

Author: Hu, Shenggang
Subjects: HA Statistics
Abstract: Sampling is an important topic in the area of computational statistics. Being able to draw samples from a designated distribution allows one to numerically compute various statistics without the need to solve for solutions analytically. A popular branch of the sampling method generates samples by evolving a stationary Markov chain that admits the target distribution as its stationary distribution. The problem, however, is that one does not have a universal criterion to assess whether the chain is stationary. On the other hand, exact simulation methods, being the focus of this thesis, always produce samples that precisely follow the target distribution. We first begin with the path-space rejection sampling for the exact simulation of diffusion bridges and show how this rejection scheme can be further set up into an exact simulation method for sampling product densities. We provide guidance on how to tune the algorithm parameters in order to attain a near-optimal performance and introduce the construction of an importance sampler/particle filter based on the same theoretical result for better efficiency. Finally, we show a variant of the sampler that deals with linear constraints which render most of the target distributions intractable. Two application studies are conducted in the end to demonstrate the effectiveness of the algorithm.
Published: 2023

6. Essays on asynchronous time series and related multidimensional data

Author: Pellegrino, Filippo
Subjects: HA Statistics, QA Mathematics
Abstract: This thesis focusses on asynchronous time series and related multidimensional data: timedependent measurements with varying publication delays. This class of data exists in a broad range of fields. In social sciences, most official time series and repeated surveys are indeed asynchronous in nature since statistical offices need time to collect and aggregate raw data. In STEM, statistical offices are generally less relevant and most publication delays are caused by more exotic factors. For instance, with series derived from technological networks, they are usually generated by a direct reference (digital or textual) of the past (e.g., publishing pictures of a trip done a week ago that was also photographed and posted in real time by a friend). As a result, the study of data releases is key for developing accurate real-time models and finds applications in forecasting, policy and risk management.
Published: 2022
Full Text: View/download PDF

7. Three essays on climate change and child welfare in sub-Saharan Africa

Author: Okpala, Chifumnanya Ngozi
Subjects: H Social Sciences (General), HA Statistics, HB Economic Theory
Abstract: This thesis presents three empirical essays that examine important aspects of child development and investigates the role of climate change in putting millions of children at risk in disadvantaged regions and in thus shaping their overall welfare and development. In the first essay, we examine the impact of extreme weather events arising from climate change (droughts) on child educational outcomes in Ethiopia. Overall, our findings from our child fixed-effect model points to the fact that children suffer greatly in terms of their educational outcomes when exposed to droughts. Our results also suggests that boys (in terms of their cognitive ability), younger children, and children from less educated households are the most vulnerable to the adverse impacts of droughts on educational outcomes. In the second essay, we combine satellite PM2.5 data with individual-level data to examine the impact of in-utero air pollution on child health outcomes in Ethiopia. Employing the instrumental variable regression with wind speed as an instrument, we find mild evidence for the harmful effects of air pollution on child health. We show that within our preferred model specification which incorporates monthly adjustments for seasonality in our pollution variable, exposure to ambient air pollution has little to no effect on child health. Whilst we find no significant impact on our child stunting measure (height-for-age), we find weak effects on our wasting measure, with children exposed to PM2.5 during the first trimester being smaller on average and weighing less than their peers of the same age and gender not exposed to polluted air. Our study also finds mild evidence for the existence of heterogeneity in the impact of air pollution on child health in our sample. In the final essay, we investigate the impact of extreme weather events (droughts) on child marriage and fertility outcomes for young girls in Kenya. The findings from this paper provide evidence for the adverse effects of droughts on child marriage. The findings also show child fertility outcomes to be adversely impacted by droughts. Finally, our findings show that girls living in rural households with lower levels of income are more susceptible to the adverse effects of droughts on marriage and fertility.
Published: 2022

8. Faster socieoeconomic indicators using novel data sources

Author: Miller, Sam
Subjects: H Social Sciences (General), HA Statistics, HB Economic Theory, QA Mathematics, QA76 Electronic computers. Computer science. Computer software
Abstract: Policymakers require up-to-date statistics to make good decisions. Most official statistics in economics and public health are released only after a significant delay. The goal of "nowcasting" (a combination of the words "now" and "forecasting") is to estimate these statistics before their official release. Better nowcasts would help policymakers respond to rapidly developing crises in a range of domains. The recent Covid-19 crisis has highlighted this issue: it is extremely challenging to make decisions in crisis without knowledge of either the current state of the economy or the incidence of disease in the population. Recent technological advances mean we now generate real-time data simply by going about our lives. This thesis shows how we can use novel data sources to improve nowcasts in economics and public health. We highlight how we are no longer constrained to traditional data sources, such as surveys. We first investigate whether high-frequency aircraft location data can generate faster GDP estimates. We also show that this dataset can help improve estimates of airport performance, particularly at the onset of the Covid-19 crisis. We next use a novel combination of Wikipedia page views and data scraped from online "darknet" drug markets to nowcast illicit drug demand. Better statistics on drug markets would be highly valuable to policymakers in both economics and public health. Finally, we use data from Google Trends to nowcast the incidence of chikungunya in Rio de Janeiro. Official disease data is delivered with long and variable delays. We show that including real-time Google Trends data allows earlier detection of epidemics. This thesis finds evidence that novel data sources can improve the speed and accuracy of official statistics in a range of domains. As the variety of novel data sources keeps growing, these may give policymakers more complete, real-time information when making crucial decisions.
Published: 2022

9. Health reform, public health, and public hospitals in China

Author: Zhu, Jingmin
Subjects: H Social Sciences (General), HA Statistics
Abstract: This thesis consists of three empirical studies on health reform, public health and public hospitals in China. The first study evaluates the health impact of a major public health programme implemented during the 2009-2011 New Health Reform in China. Using panel data from the China Health and Nutrition Survey from 1991 to 2015, Difference-in- Differences estimation and Propensity Score Matching methods are combined to identify a causal effect of the rural water improvement programme on adults' health. It is suggested that the rural water improvement programme is marginally effective in improving adults' health mainly through promoting water quality. Water improvement in rural areas has resulted in a reduction of diseases including diarrhea and stomachache in adults. There is no significant health inequality associated with individual education and household income. Public health equalization is achieved. The second study focuses on a basic public health programme implemented during the new health reform. It investigates the extent to which prenatal exposure to the new health reform improves children's health and the extent to which prenatal care utilisation can explain the resulting effects upon health. It is suggested that children's height-for-age z score is significantly higher in 2011 following prenatal exposure to the new health reform. However, the risk of stunting is not affected. Having at least five prenatal examinations is a plausible explanation for the positive effect of the new health reform on children's health. In addition, children in households with a better long-term wealth status experience significantly greater improvements in their health. The third study applies stochastic frontier analysis to estimate the technical efficiency of Chinese public hospitals over the period 2009-2018. The existence of technical inefficiency is supported. The average technical efficiency is about 0.73. County-level public hospitals are found to be more efficient than city-level public hospitals. By contrast, tertiary public hospitals are not always more efficient than secondary public hospitals. Furthermore, the role of government subsidies in hospital's total income is examined. It is reported that government subsidies have a negative effect on hospitals' technical efficiency.
Published: 2022

10. Essays on the effectiveness of air pollution control policies in China

Author: Liu, Bowen
Subjects: H Social Sciences (General), HA Statistics
Abstract: Air pollution has become an increasingly major concern across the world and is considered by many to be the single largest environmental risk that humans face today. China's rapid economic development in recent decades was accompanied by a serious deterioration in air quality. In response to the quality of the air and its potentially damaging health impacts, the Chinese government has employed a range of different strategies and policies to mitigate air pollution problems. As a result, it is of great interest and importance to policymakers and the general public to understand the efficacy of such policies. The primary objective of this thesis is to investigate the effect of selected Chinese government policies on air quality. The thesis contributes to the environmental economics literature by employing a set of cutting-edge interdisciplinary research methods from economics, atmospheric science, and computer science. By combining methods from different disciplines we are able to obtain results that are grounded in both science and social science so we are able to contribute to the literature across each of these disciplines. This thesis consists of six chapters. Chapter 1 provides a brief introduction, chapters 2, 3, 4 and 5 present the main body of the thesis and can be considered as four independent papers. Finally, chapter 6 concludes. Chapter 2 investigates the effectiveness of China's Environmental Protection Inspection program that started in 2016 and ran through to 2017, a policy unparalleled in terms of its authority, scope and stringency. Chapter 3 quantifies the impact of the Wuhan Covid-19 lockdown on concentrations of four air pollutants using a two-step approach. Chapter 4 returns to the recent Central Environmental Inspection Policy (CEIP) and examine how effective it was in reducing local air pollution at the regional and city level. Methodologically, we combine weather normalisation techniques (a random forest-based machine learning model) from atmospheric sciences and the Augmented synthetic control method (ASCM) to provide a causal estimate of the impact of a province having a CEIP inspection on local air quality. Chapter 5 estimates the impact of the winter heating policy and the effectiveness of the clean winter heating plan on air quality in Chinese capital cities. After decoupling the impact of meteorological conditions on observed air pollutant concentrations, this chapter finds that, during the winter period in 2015 and 2016, the turning on of the winter heating system immediately increase the air pollution level across northern Chinese cities while the turning off of the winter heating system led to the sudden drop of air pollution.
Published: 2022

11. Essays on econometric methods

Author: Polselli, Annalivia
Subjects: H Social Sciences (General), HA Statistics
Abstract: This thesis consists of three chapters on econometric methods. In Chapter 1, I investigate the consequences of the simultaneous presence of small sample size, leveraged data, and heteroskedastic disturbances on the validity of the statistical inference in linear panel data models. I formalise the panel versions of two jackknife-type estimators and propose a new hybrid estimator. I derive their asymptotic distributions and analyse their finite sample properties with Monte Carlo simulations. I find that test statistics obtained with conventional robust standard errors are over-sized, upward biased, and with less power under heteroskedasticity and with good leveraged data and in small samples. In Chapter 2, I develop diagnostic methods for panel data to detect three types of anomalous units. I formalise statistical measures for quantifying the degree of leverage and outlyingness of units, and develop a method to visually detect the type of anomaly and its effect on other units. I use network analysis tools to show the total and bilateral influence. I then apply my method to four cross-country data sets used in published articles. Chapter 3 investigates the effect of gender sectoral segregation on employment contracts (part-time, permanent, remote work, number of weekly working hours) and hourly wages for both men and women. We use propensity score matching, the Kitagawa-Blinder-Oaxaca decomposition and Mincerian wage regressions to analyse the contribution of observable and unobservable factors on labour outcomes. We find that contractual features systematically chosen by a specific gender are more common in sectors dominated by that group and for both genders. Workers employed in female-dominated sectors are on average paid less but most of the gap is explained by the coefficient effect rather than differences in endowments in both gender dominated sectors. Women self-select into low-paid jobs where their skills are valued less, especially in female-dominated sectors.
Published: 2022

12. Investigating uncertainty and emulating process-based models with multivariate outputs, applied to aquaculture

Author: Currie, Michael
Subjects: HA Statistics
Abstract: There remain environmental challenges which can only accurately be assessed by process-based modelling. An example of this is the monitoring of the environmental impacts of aquaculture, where the logistical diffculty and cost of collecting data over large areas make mathematical modelling the more effective approach. Such approaches are computationally intensive and do not account for uncertainty. NewDEPOMOD is an example of a process-based model that is used within aquaculture to model the environmental impacts of aquaculture. This thesis provides an in-depth investigation of uncertainty in such a model using sensitivity analysis, and proposes a novel statistical emulation framework to approximate the output from NewDEPOMOD, reducing the computational cost. NewDEPOMOD is a complex mathematical model that was developed in order to estimate and predict the transportation of waste particles from fish farms to their deposition on the seabed. It features a number of different types of input, representing features such as the fish farm physical structure, ow speeds and waste transportation properties. In addition, the output produced by NewDEPOMOD provides a measure known as Solids Flux in grid cells across the domain, representing the environmental impact. This can be visualised as either a univariate or multivariate output. The univariate outputs produced by NewDEPOMOD are the Total Area Impacted and 99th Percentile of Solids Flux which provide a measure of the size and intensity of the impact on the seabed. In collaboration with the Scottish Environment Protection agency (`SEPA'), with application to fish farm sites around the coast of Scotland, a set of inputs were identified as being of most importance for investigating the effiect of their uncertainty on the NewDEPOMOD outputs. In this thesis, sensitivity analyses are conducted at multiple fish farm sites, classed as high and low energy based on their ow speeds, using random forest models. Random forest models are proposed as they are exible, efficient, and the importance values produced by the models can be used to rank the inputs based on their in uence on the output data. To assess the impact of changing the inputs values on the output maps produced by NewDEPOMOD, traditional univariate sensitivity analysis techniques are expanded here to develop novel sensitivity analysis methods for considering multivariate model outputs. Three different approaches to investigating the output maps are considered: 1) shape analysis based on a landmark approach for identifying the main shape of the impact, 2) bivariate functional analysis where the output maps are considered as smooth surfaces, and 3) grid cell approach where the Solids Flux in each grid cell is considered individually. The performance of each approach was considered individually before developing a framework, using a subset of the approaches, that could be applied to multiple sites to assess parameter uncertainty, and hence the impact of altering the inputs on the output maps. The application of statistical emulation to model the univariate outputs from NewDEPOMOD reducing the computational cost is a novel approach. The methods proposed for the emulation are random forests and Gaussian processes which both provide exibility and allow for fast predictions for new data in comparison to the time taken to run NewDEPOMOD. For each site, training data will be used to fit the emulation models for each approach before using a test set of data to assess their predictive performance. Root Mean Squared Error (`RMSE') and the Mean Absolute Error (`MAE') are both considered as measures of how well the emulators perform and allow for comparisons to be made between the approaches. Further investigation assesses the suitability of a single emulator to be used at all sites, or whether the emulators should be built individually for each site. In practice, correlated outputs are more realistic in such a scenario and hence the emulation framework for the univariate outputs is expanded to consider the univariate outputs together as a correlated multivariate output. Extensions to the random forest and Gaussian process models are proposed which account for correlation between the outputs. The predictive performance for both approaches can be reviewed using RMSE and MAE to determine if there are improvements when modelling the univariate outputs together as a correlated output. This research provides a deeper understanding of NewDEPOMOD through the development of novel sensitivity analysis and emulation tools for computationally efficient analyses of data on the impact of fish farms. Remarks on the approaches used and their results are provided throughout this thesis, along with potential future extensions to the research.
Published: 2022
Full Text: View/download PDF

13. Micro and macro indexes of economic activity : multiple indicators and multiple methods using Bangladesh as a test case

Author: Tahsin, Mariha
Subjects: HA Statistics, HB Economic Theory, QA Mathematics
Abstract: The first chapter explores, the use of night-time lights as a proxy for estimating annual GDP per capita and subsequently the GDP per capita growth rate. It is observed that even though, Bangladesh's, GDP per capita is under-estimated, the annual growth rate is over-estimated. The second chapter explores the quality of the household surveys conducted in Bangladesh through the application of the Benford's Law and triangulation against administrative data. Sampling errors are detected in all rounds of the household surveys. The results showed that the micro dataset over-sampled wealthier households. This indicated that the income and expenditure levels of the three lowest quartiles, estimated from the household surveys, is likely over-estimated. The results of the first two chapters are then combined to comment on the state of inequality in Bangladesh. It is observed that GDP per capita is higher than expected, while the income of the lowest three income quartiles is lower than estimated. Thus, true inequality is likely to be much higher than what is indicated by the published Gini-coefficients. The fourth chapter assesses the accuracy of a proxy-means test, the Poverty Probability Index, in classifying household poverty, in the absence of sound data. Applicability of the PPI, over years and across population sub-groups, was tested. It was seen that the index overestimated poverty probability in both cases. In the last chapter, machine learning algorithms are used to develop alternatives to the Poverty Probability Index, in the absence of extensive domain knowledge. These models out-performed the PPI by 3 percentage points in terms of accuracy and ROC-AUC.
Published: 2022

14. Risk estimation and discontinuity identification in Bayesian disease mapping

Author: Yin, Xueqing
Subjects: HA Statistics
Abstract: Disease mapping is the field of epidemiology that estimates the spatial or spatio-temporal pattern in disease risk. Approaches in this field are generally based on data collected on a set of non-overlapping areal units that comprise the study region, and typically utilise counts of the numbers of disease cases within each areal unit. Conditional autoregressive (CAR) models are commonly used to capture the spatial autocorrelation present in areal unit disease count data. The spatial correlation structure that is induced by these models is typically determined by a neighbourhood matrix based on geographical adjacency, which enforces spatial correlation between geographically neighbouring areas and assumes a spatially smooth risk surface. However this may not be realistic in practice, because some pairs of neighbouring areas are likely to exhibit vastly different disease risks. Therefore the aim of this thesis is to develop methodology that allows for discontinuities in the spatial risk pattern when estimating disease risk. The first two models proposed are in a purely spatial setting and account for discontinuities by identifying spatial clusters of areas that have higher or lower risks than their geographical neighbours, while the third proposed model extends this to the spatio-temporal domain to identify clusters/discontinuities and estimate the spatial pattern of disease risk over time. The final piece of work of this thesis allows for discontinuities by using a boundary analysis approach. This approach identifies the boundaries in the spatial risk surface that separate pairs of geographically adjacent areas that exhibit large differences between their risks. Each model is applied to hospital admissions data for respiratory disease from the Greater Glasgow and Clyde Health Board region. Overall, it has been found that the respiratory disease risk surface in Greater Glasgow is not globally spatially smooth. There are numerous pairs of neighbouring areas where a discontinuity in disease risk appears to exist. In addition, the respiratory disease risk in Glasgow appears to increase over time and people living in more deprived areas are at higher risk of respiratory hospital admissions than those living in more affluent areas.
Published: 2022
Full Text: View/download PDF

15. A novel statistical framework to detect complex 3D genome organisation patterns into topologically associated domains

Author: Mikheeva, Liudmila
Subjects: HA Statistics, QA76 Computer software, QH426 Genetics
Abstract: Topologically associated domains (TADs) are highly compacted regions of DNA that are suggested to be involved in proper gene regulation and cellular functioning. TADs maintain long-range interactions between distal enhancers and target genes, as well as restrict enhancers contacting genes that are not their target and, consequently, block their inappropriate regulation by these enhancers. The widely used TAD calling tools either restrict TAD borders to be allocated in a "head-to-tail" manner or allow hierarchical TAD folding to be detected. We propose a R-based TAD calling tool that detects start and end TAD border positions separately, so the partial overlapping of TADs as well as large gaps between TADs are also allowed. Using the ratio between the average upstream and downstream Hi-C interaction frequencies, our method detects where the difference between inside-TAD and outside TAD area within the Hi-C matrix is most significant. The novel TAD allocation combined with various genomics data reveals the interplay between architectural proteins and active transcription in the establishment of the TAD border insulation strength and insulation imbalances between neighbouring TADs.
Published: 2022

16. Surrogate modelling of a patient-specific mathematical model of the left ventricle in diastole

Author: Lazarus, Alan
Subjects: HA Statistics
Abstract: Personalised medicine is a relatively new area of healthcare that uses patient-specific data at multiple scales, and different scientific models, to inform disease prognosis and treatment planning. Recently, there has been particular interest in the translation of mathematical models to the clinical setting. These models are usually implemented in the form of a computer code that relates a set of model parameters with a set of observable quantities. Often these parameters have a physiological meaning, and their estimation can provide information about the level of function or dysfunction of a particular physiological process. An important example is in modelling the behaviour of the left ventricle (LV) in diastole. This model relates cardiac tissue properties (the parameters) with the kinematic behaviour of the LV that can be observed from cardiac magnetic resonance images. The personalisation of this model to different patients depends not only on the parameters, but also on the geometry of the LV, which varies from patient to patient. Improved representation of the LV geometry, combined with improved modelling capabilities, has led to increasingly accurate and personalisable models that can better replicate the real world process. This increased model fidelity is accompanied by increased computational costs, which hinders the application of these models in the clinical setting. A natural solution to the problem posed by computational cost is to use statistical emulation. In emulation, we build a model that efficiently replicates the behaviour of the expensive simulator. Although conceptually a simple idea, the application of this methodology to mathematical models can be complicated. In the context of the LV model, this complexity is largely tied to the LV geometry. By its very principle, personalised medicine relies on the ability of the emulator to generalise to different LV geometries, meaning that the LV geometry itself must be treated as an input to the model. However, the high dimension of the LV geometry representation makes it incompatible with the statistical emulation framework. To resolve this issue, the work in this thesis uses a lowdimensional representation of the LV geometry to reduce the dimension of the input space of the model and construct a generalisable emulator of the LV model. Of primary interest is the efficient estimation of the parameters of the LV model, in a time frame compatible with the clinical setting. For this purpose, the generalisable emulator allows for the efficient use of Markov chain Monte Carlo, providing a measure of uncertainty in the parameters. A common problem in complex models, as is the case in the LV model, is the presence of weak practical identifiability. This manifests as large uncertainty in the posterior distributions of the parameters. In a Bayesian framework, this issue can be tackled using a more informative prior distribution. For the LV model, an informative prior that includes information from ex vivo studies is proposed, improving the estimation of the model parameters. Also motivated by the weak identifiability of the model, a new parameterisation of the model is considered. This involves a comprehensive sensitivity and inverse uncertainty quantification study that sheds extra light on the identifiability-both practical and structural-of the LV model. Finally, the problems posed by the measurement of clinical data, and the discrepancy between the model and reality, is considered and methods are proposed that account for this in the inference framework. Critically, the culmination of the work in this thesis highlights the problems that need to be resolved before the LV model can be applied in the clinical setting.
Published: 2022
Full Text: View/download PDF

17. Tracking and nowcasting directional changes in the Forex market

Author: Ma, Shuai
Subjects: HA Statistics, HG Finance, Q Science (General)
Abstract: Price changes in financial markets are typically summarized as time series (TS). Directional Change (DC) is an alternative, data-driven way to sample data points. The main objective of this thesis is to find new ways to extract new, useful information from the market. This is broken down into three directions: (1) to summarize price changes with DC, one must first determine the threshold to be used. We ask: could a threshold be too big or too small? If so, how could we determine the range of usable thresholds? (2) Could DC indicators extract volatility information from the market that is not observable under TS? (3) In DC, the start of a new trend is only confirmed in hindsight - to be precise, at the DC Confirmation (DCC) point when the price has reversed by the threshold specified. Could we detect that a new trend has begun before the DCC point? This is known as a nowcasting problem. This thesis has made three contributions. Firstly, we have created a guideline to determine the range of useable thresholds under DC. This supports the research that follows. Secondly, we have demonstrated how DC indicators could complement TS in tracking the market for volatility information. Thirdly, we have introduced new DC indicators; by using these indicators, we have proposed an algorithm and demonstrated how it could help us nowcast whether a new trend has begun in DC.
Published: 2022

18. High resolution air quality modelling and prediction

Author: Napier, Yoana Borisova
Subjects: HA Statistics, T Technology (General)
Abstract: Air pollution is one of the leading world problems. Across the world, many organizations are in charge of researching safe levels of air pollution, which do not affect people's health. This research has resulted in regulations, which in Scotland are set by the Scottish government. However, monitoring air pollution is very expensive which leads to sparsity in the data. This thesis aims to address this issue by investigating the miniature automated sensor (MAS) networks and the emulation of air quality models data. MAS are a cheaper alternative to the current air quality monitoring stations. Therefore, the quality of the measurements from MAS (in a realistic for citizen science application) are assessed using Bland-Altman analysis and compared to the air quality monitoring stations' recordings using linear regression models. It is found that the MAS do not have the required level of accuracy, although their recordings are significantly capturing the pollutants' concentrations' fluctuations. Alternatively, in order to assess the effect of unobserved meteorological conditions on pollutants' concentrations, simulated data from ADMS-Urban for Scotland is used. Based on single station and multiple station Gaussian Process (GP) models, emulators for the NO2 annual average are produced and used to identify the meteorological conditions for which the regulations will be breached. Therefore, a variety of measures can be set in motion when such conditions occur to prevent a breach of the regulation. A quasi-Poisson generalised linear model (GLM) is used to emulate the number of NO2 hourly exceedances in a year over the regulatory limit of 200 �g m[sup]3, thus identifying the meteorological conditions for which the regulations will be breached and for measures preventing the breaches to be placed. To emulate the yearly time series for NO2 hourly concentrations, a hyperspatial-temporal emulator with a block-design matrix is proposed. In order to improve the computational speed, the emulator is produced for overlapping blocks of data for periods of interest. The results from the emulator identified periods of possible high NO2 hourly pollutant concentrations and allowed to identify the emissions levels and meteorological conditions, which lead to high hourly NO2 concentrations. Overall, all proposed emulators have very good out-of-sample performance in predicting the simulated data.
Published: 2022
Full Text: View/download PDF

19. Bayesian nonparametric methods for cyber security with applications to malware detection and classification

Author: Perusquia Cortes, Jose Antonio
Subjects: HA Statistics
Abstract: The statistical approach to cyber security has become an active and important area of research due to the growth in number and threat of cyber attacks perpetrated nowadays. In this thesis, we centre our attention on the Bayesian approach to cyber security, which provides several modelling advantages such as the flexibility achieved through the probabilistic quantification of uncertainty. In particular, we have found that Bayesian models have been mainly used to detect volume-traffic anomalies, network anomalies and malicious software. To provide a unifying view of these ideas, we first present a thorough review on Bayesian methods applied to cyber security. Bayesian models applied to detecting malware and classifying them into known malicious classes is one of the cyber security areas discussed in our review. However, and contrary to detecting traffic and network anomalies, this area has not been widely developed from a Bayesian perspective. That is why we have centred our attention on developing novel supervised learning Bayesian nonparametric models to detect and classify malware using binary features built directly from the executables' binary code. For these methods, important theoretical properties and simulation techniques are fully developed and for real malware data, we have compared their performance against well-known machine learning models which have been widely applied in this area. With respect to our methodologies, we first present a new discrete nonparametric prior specifically designed for binary data that builds on an elegant nonparametric hierarchical structure, which allows us to study the importance of each individual feature across the groups found in the data. Moreover, and due to the large, and possibly redundant, number of features, we have developed a generalised version of the model that allows the introduction of a feature selection step within the inferential learning. Finally, for a more complex modelling where there is a need to introduce dependence across the features, we have extended the capabilities of this new class of nonparametric priors by using it as the building block of a latent feature model.
Published: 2022
Full Text: View/download PDF

20. Statistical modelling on the severity of road accidents in Great Britain

Author: Hamdan, Nurhidayah
Subjects: HA Statistics
Abstract: Great Britain has a modern road network and is well-known with the advanced technology in road engineering. Although with excellent road infrastructure, road accidents remain one of the main concerns in road safety literature among researchers and policymakers. One of the main strategies for improving road safety is to identify the contributing factors and then to develop countermeasures. There have been numerous studies that analyse road accident severity including binary outcome models, ordered discrete outcome models, unordered multinomial discrete outcome models, and other data mining approaches. The aim of this thesis is to identify the contributing factors affecting road accident severity in Great Britain and estimate the accident cost for all types of accident severity. For accident severity study, three statistical models are selected: multinomial logistic regression (MNL) model, log-linear graphical model and multinomial logistic with random effects (MNLRE) model. Markov Chain Monte Carlo (MCMC) simulation method by applying random walk Metropolis-Hastings (M-H) algorithm is used for parameter estimation in the MNLRE model. Accident cost study is investigated by applying three models: Gamma, Weibull and Log-normal distribution.
Published: 2022

21. How does immigration relate to crime perceptions and crime rates? : evidence from Europe

Author: Bortoletto, Gianluca
Subjects: HA Statistics, HC Economic History and Conditions, HT Communities. Classes. Races
Abstract: This thesis analyses how immigration and migratory background relate to crime rates and crime perceptions. In order to analyse these links, two approaches have been employed. In the first approach, used in Chapters 2 and 3 of the present thesis, the link between immigration from fragile countries and crime rates have been taken into account respectively for the European context with a cross countries analysis and for the Italian context by employing province-level data. The second approach has been explored in Chapter 4 where the link between country of birth and the probability of reporting crime as a problem in the neighbourhood of residency has been analysed. Chapter 2 has found no significant link between immigration from fragile countries and various types of crime rates in Europe, except for a negative link with robberies that, however, was not confirmed by the robustness checks. This result was contrary to the hypotheses outlined in the beginning of the chapter for which a positive and significant association of this type of immigration with violent and property crimes would have been expected. On the other hand, Chapter 3 has found a positive, significant and robust link between immigration from fragile countries and mafia crimes. The finding does not support the hypotheses that immigration from fragile countries would have increased either violent or property crimes. However, it confirms the hypothesis, specific for the Italian case, for which immigration from fragile countries is expected to be positively and significantly associated with mafia crimes. The most likely explanation for this result - albeit not confirmed by the available data - is that immigrants are exploited by mafia organisations. For migratory background and crime perceptions, Chapter 4 has shown that there is not a significant relationship between being born in a foreign country and reporting crime as a problem of the area of living. Moreover, the chapter has explored the link between the interactions of various measures of deprivation and concentrated disadvantage with the country of birth of the household head and the probability of self-reporting crime as a problem of the neighbourhood. Differently from the results of previous studies and contrarily to the hypotheses formulated at the beginning of the chapter, the interaction of country of birth of the household head, namely born in an EU member country different from the country of residence, and the condition of being a single parent with children has been found to be associated with a lower probability of self-reporting crime as an issue of the neighbourhood compared to a native. The explanation of this result might relate to the perception of what is a crime for an EU migrant compared to a native and for the social support that the migrant might receive in the area where she or he resides which might be an immigrant cluster. Further research is needed in order to explain the reasons for these results. It would also be interesting to explore in more detail the link between immigration and crime by looking, for instance, at the geographical macro areas of origin of the migrants.
Published: 2021

22. Essays on macroeconomic dynamics : learning, uncertainty, and heterogeneity in credit crises

Author: Yeromonahos, Mallory
Subjects: HA Statistics, HB Economic Theory, HG Finance
Abstract: This thesis covers two research topics. Chapter 2 is an investigation into the properties of the equity risk premium and its relationships with uncertainty and macroeconomic fluctuations. A large literature suggests that the expected equity risk premium is countercyclical. Using a variety of different measures for this risk premium, we document that it also exhibits growth asymmetry, i.e. the risk premium rises sharply in recessions and declines much more gradually during the following recoveries. We show that a model with recursive preferences, in which agents cannot perfectly observe the state of current productivity, can generate the observed asymmetry in the risk premium. Key for this result are endogenous fluctuations in uncertainty which induce procyclical variations in agent's nowcast accuracy. In addition to matching moments of the risk premium, the model is also successful in generating the growth asymmetry in macroeconomic aggregates observed in the data, and in matching the cyclical relation between quantities and the risk premium. Chapters 3 and 4 are an investigation into the distribution and dynamics of household debt. We present new empirical facts on the distributional dynamics of household debt around the Great Recession in the US using survey data from the Panel Study of Income Dynamics. We document that it is the 60% of households toward the middle of the income distribution that are responsible for the aggregate reduction of debt from the onset of the financial crisis in 2007 until 2015, not the 40% in the tails of the distribution. We extend the current class of heterogeneous-household models by explicitly tracking the distributions of gross debt and gross savings separately during a simulated credit crunch - instead of calculating exclusively the net financial positions of households as in the standard framework. The model successfully replicates the relative importance of the different income groups in the aggregate reduction of household debt. The results are driven by endogenous heterogeneity in the intertemporal utility cost of debt. In addition, the models provides new insights into the effectiveness of monetary policy when households are highly indebted. We show that collateralised debt is a stronger channel than liquid savings for the transmission of monetary policy.
Published: 2021

23. Determinants and consequence of entrepreneurship : evidence from China, the UK and Russia

Author: Wang, Chenyang
Subjects: H Social Sciences (General), HA Statistics
Abstract: This thesis presents three empirical studies on entrepreneurship. Specifically, the first study investigates liquidity constraints and entrepreneurship in China. The second study examines why hybrid entrepreneurs exist and their effect on full-time self-employment entry in the UK. The third study investigates whether Russian entrepreneurs are optimistic. All studies use micro-level survey data. Using the 2010, 2012, 2014, and 2016 waves of the China Family Panel Studies, I evaluate the extent to which dynamic transition into entrepreneurship made by individuals is affected by liquidity constraints in China. In addition to analyzing the effect of wealth on entrepreneurial entry, I also use the housing value appreciation acquired by the individual as a proxy for wealth. Additionally, I explore whether wealth plays a more important role on self-employment choices in less financially developed provinces and rural area compared with high financially developed provinces and urban areas respectively. My results are robust to taking the endogeneity of wealth into account. Using the Harmonized British Household Panel Survey (BHPS) and Understanding Society datasets from the period between 1991 to 2018, I examine hybrid entrepreneurship in the UK for both males and females. After removing the heterogeneity of the individuals in our sample, I find that, for both males and females wishing to set up their own business, financial pressure and the desire for a career change, drive them from employment into hybrid entrepreneurship. Protecting against any risk of uncertainty associated with the primary job is an additional driver for male paid employees. Furthermore, for both males and females, only those hybrid entrepreneurs who wish to establish their own business during their hybrid phase, are more likely to transition into full-time self-employment than workers in full time employment. Additionally, the good performance of the secondary self-employed job will motivate hybrid entrepreneurs to transition into full-time self-employment. However, this phenomenon is only applicable to those female hybrid entrepreneurs who wish to set up their own business in the hybrid phase. Lastly, I do not find that the age of hybrid entrepreneurs plays an important role in driving them into full-time self-employment. Using the Russian Longitudinal Monitoring Survey round 5 to 27 over the period between 2000 to 2018, I investigate the association between entrepreneurship and optimism. I find that entrepreneurs are more likely to be optimistic than employed workers. Moreover, those who become entrepreneurs are more likely to become more optimistic than those who remain in employment. I do not find a significant association between entrepreneurship and overoptimism.
Published: 2021

24. Statistical modelling of lexical and syntactic complexity of postgraduate academic writing : a genre and corpus-based study of EFL, ESL, and English L1 M.A. dissertations

Author: Nasseri, Maryam
Subjects: AZ History of Scholarship The Humanities, HA Statistics, PE English
Abstract: This research is an interdisciplinary study that adopts the principles of corpus linguistics and the methods of quantitative linguistics and statistical modelling to analyse the rhetorical sections of MA dissertations written by EFL, ESL, and English L1 postgraduate students. A discipline-specific corpus was analysed for 22 lexical and 11 syntactic complexity measures using three natural language processing tools [LCA-AW, TAALED, Coh Metrix] to find differences of academic texts by English L1 vs. L2 and to investigate the relationship between these linguistic indices. Structural factor analyses as well as the two statistical modelling methods of linear mixed-effects modelling and the supervised machine learning predictive classification modelling were then employed to verify the existing classification of the complexity indices, to explore their further dimensions, to investigate the effects of English language background and rhetorical sections on the production of lexically and syntactically complex texts, and finally to predict models that can best classify the group membership and the membership to the rhetorical sections based on the values of these measures. This investigation resulted in more than 20 specific findings with important implications for academic writing assessment of English L1 vs. L2, for academic writing research on rhetorical sections of English academic texts, for academic writing instruction especially materials development and syllabus designs in the EFL contexts, and academic immersion programmes, for the measure-testing and selection processes, and for methodological aspects of statistical modelling in corpus-based academic studies.
Published: 2021

25. Random projection optimal trees ensemble

Author: Faiz, Nosheen
Subjects: 519.5, HA Statistics
Abstract: Ensemble classifiers, formed by the combination of multiple weak learners, have been shown to outperform ordinary classification methods in that the former decrease bias, variance and/or improve predictions. These classifiers, however, can still result in low prediction performance when used with the wrong choice of their hyper-parameters values and/or when there are noisy features in the data. Thus, feature selection and fine tuning hyper-parameter could improve predictive accuracy of ensemble classifiers. This thesis first investigates the effect of feature selection on three methods: Random Forest (RF), Optimal Trees Ensemble (OTE) and Random Projection Ensembles (RP) in high dimensional settings. To this end, LASSO has been considered for selecting the most important features based on training data for dimension reduction. Additionally, the influence of various hyper-parameters regulating the three methods has also been assessed. Secondly, this thesis proposes a novel idea to use random projection method in conjunction with optimal tree selection to get an improved trees ensemble. This is achieved by randomly projecting the training data into lower dimension and classification trees are grown on bootstrap samples taken from the newly projected datasets. The best performing trees are selected based on out-of-bag error rate and combined to get the final Optimal Random Projection Trees Ensemble (ORPTE). The results of ORPTE are compared with those of Tree, RF, OTE, RP, k -NN, XGBoost and SVM. Analysis on several benchmark datasets is given to illustrate the effect of feature selection and hyper-parameter tuning on the methods and the efficiency of the proposed method. The results reveal that feature selection improves the predictive performance of the RP method in addition to reducing the computational burden on benchmark and example datasets. The performance of OTE and RF is less influenced by feature selection. Moreover, ORPTE has outperformed in terms of prediction accuracy in majority of the cases.
Published: 2021

26. Experimental studies on questionnaire design in political surveys

Author: Agalioti Sgompou, Vasiliki Maria
Subjects: 303.3, H Social Sciences (General), HA Statistics, JA Political science (General)
Abstract: This thesis is a collection of three studies that contribute to the area of survey methodology. The primary aim is to contribute to questionnaire design in political surveys. The research focus is on examining the interaction of questionnaire design characteristics and respondents' background on survey response. I focus on questions that were included in political surveys and experimentally randomised (American National Election Study 2008 and 2012). I use as dependent variables questions that evaluate Presidential candidates (first and third study) and self-reported ability on understanding politics (second study). I examine whether and how question order, response scale direction, response scale type and interviewers' background (partisanship) interact with respondents' background (partisanship or political knowledge) affecting response. I use regression analysis to examine the effect of the interaction terms on survey response. The findings are mixed but there are trends across the three studies. The questionnaire design characteristics usually do not affect response, but some of the interactions between questionnaire design characteristics and respondents' background do affect response. The results suggest that 1) response can be skewed by the interaction between questionnaire design characteristics and respondents' background, 2) the statistically significant findings are concentrated on specific groups of respondents and question topics, and, 3) overall direction of response is not affected by the interaction. The thesis discusses the importance of considering respondents' background when designing questionnaires and the implications for political researchers when using data from a survey study where randomisations were applied.
Published: 2021

27. Three essays in labor economics

Author: Eser, Eike Johannes
Subjects: 331, HA Statistics, HD Industries. Land use. Labor
Abstract: This dissertation consists of three essays in empirical labor economics. All essays analyze current issues in European labor markets. The first chapter provides a comprehensive analysis of labor-market evolutions in towns and rural areas from 1970 onwards. Data on the four most populous European countries (France, Germany, Italy, UK) imply that changes in the industry structure are fast, and industry turnover is positively associated with employment growth. Furthermore, successful rural areas and towns areas experience stronger employment growth in hospitality, and in the culture industry, respectively. The evidence also indicates that there are large differences in employment growth across towns and rural areas. The second chapter studies the labor-market effects of barriers for Romanian immigration to Spain, the worldwide seventh largest bilateral migration corridor between 2000 and 2010. Effects on native workers are identified by geographical differences in the exposure to the restrictions at the province level. The results imply that the employment-growth rate of natives is systematically higher in provinces that are more exposed to the restrictions, particularly for young and low-educated individuals. However, migration barriers also induce downward adjustments in the educational composition of occupations. Short-term effects on labor productivity are uniformly negative, but imprecisely estimated. This third chapter combines worker-level occupational task data for 1979--2012 with a novel data set for the pre-computer era to study the labor-market effects of deroutinization. I find that deroutinization occurs primarily between occupations in the pre-computer era, but within occupations afterwards. Because changes in occupational tasks occur over and above changes in educational attainment, they may explain shrinking wages and falling employment shares of middle-skilled workers that cannot be easily explained by mere educational upgrading. The results imply that initially more routine occupations experience relative employment-share, but no wage declines between 1979 and 2012. Effects on labor shares of initially more routine industries are largely insignificant.
Published: 2021

28. Examining disparities in public support for knowledge about, and trust in, science

Author: Makarovs, Kirils
Subjects: H Social Sciences (General), HA Statistics, HM Sociology
Abstract: The thesis consists of three empirical articles, with each paper focusing on the specific domain of public understanding of science research field, namely racial disparities in civic scientific literacy, public support for the pro-environmental governmental policies, and public perception of scientists' trustworthiness. Chapter 2 employs the General Social Survey data to study the role of racial social identity and racial ingroup evaluation in shaping the science literacy gap between whites and African Americans. In Chapter 3, the European Social Survey Round 8 data is used to explore the relationship between public support for the welfare state and the environmental state. Finally, Chapter 4 utilizes the experimental data collected via Prolific to answer the question as to what are the attributes of scientists that make them preferable and more trustworthy in the public eyes.
Published: 2021

29. Strategic behavior in risky competitive settings

Author: Sentana, Juan
Subjects: HA Statistics, HB Economic Theory
Abstract: In the first chapter of this dissertation, I propose a novel and tractable structural model for ascending auctions with both common and private value components in which heterogeneous bidders exhibit loss aversion. Importantly, I find that loss averse bidders bid noticeably lower than risk neutral ones. I also consider a more general framework in which bidders incorporate into their strategies the information of those bidders who are present but decide not to participate after observing the item put up for auction. This results in bidders reducing the aggressiveness of their bids even further. To empirically assess my model, I use data from storage locker auctions in the popular cable TV show Storage Wars, finding that the behavior of most of its bidders is consistent with loss aversion. Thus, I document for the first time the presence of loss aversion in actual ascending auctions. In Chapter 2, I report the results of a (quasi) field experiment in the training grounds of a professional soccer team to check if individuals, when repeatedly facing the same opponents, satisfy the main mixed strategy equilibrium predictions in soccer penalty kicks, a real-life example of strategic play. This is the first time that the implications of mixed strategy equilibria are tested in the field using repeated observations on specific heterogeneous pairs of players, a situation that rarely repeats in real life. In this respect, I also study the effects of the usual practice of treating heterogeneous rivals as if they all came from the same pair because of the lack of repeated observations for specific pairs. In particular, I show that false rejections may arise when heterogeneous pairs are treated as homogeneous and suggest valid aggregate tests that combine statistics from different opponents. My empirical results suggests that the behavior of most soccer players, when repeatedly facing the same opponents, is consistent with equal scoring probabilities across strategies except for the least professional kickers, as well as with serial independence of player's actions. However, I find dependence between the kicker's and goalkeeper's actions. I also find that the least professional goalkeepers tend to replicate each other's actions. In contrast, players do not seem to follow a reinforcement learning model. In the third chapter, I prove the numerical equivalence for general categorical variables between many seemingly unrelated independent tests. Specifically, I prove that the Pearson's independence test in a contingency table is numerically equivalent to the Lagrange Multiplier test in several popular linear and non-linear regression models: the multivariate linear probability model, the conditional and unconditional multinomial model, the multinomial logit and probit models; as well as the overidentifying restrictions test in GMM. Therefore, different researchers using different econometric procedures will reach exactly the same conclusions if they use any of those tests. Additionally, I show that the asymptotically equivalent Likelihood Ratio tests in the non-linear regression models are numerically identical, and that the heteroskedasticity-robust Wald tests in the multivariate linear probability model and GMM coincide with the Wald test in the conditional multinomial model. All these equivalences also apply to tests of serial independence in a discrete Markov chain, which can be regarded as a time series analogue of the multinomial model. Finally, I use these tests to analyze if professional soccer players follow optimal mixed strategies in penalty kicks. For some players, my empirical results are not consistent with equal scoring probabilities across strategies. In contrast, I find that player's actions are serially independent.
Published: 2021

30. Pseudo-continuous spatial and spatio-temporal modelling of disease risk

Author: Sanittham, Kamol
Subjects: HA Statistics
Abstract: Disease mapping approach is the statistical methodology used to estimate disease risk over time, and is generally based on areal unit data. Conditional autoregressive (CAR) models are the most common modelling approach for disease data at the areal unit level. Such approaches assume constant disease risk within each areal unit, which may not be realistic. Therefore this study aims to address this problem by creating a pseudo continuous disease risk surface over the Greater Glasgow and Clyde Health Board. A set of regular grid squares is overlaid across the study region and the main focus of this study is to estimate disease risk in each grid square after removing grid squares with zero population. Areal unit data are transformed to the grid level via two novel approaches which are multiple imputation and data augmentation and then use these grid data to fit the standard Leroux CAR model to estimate the spatial patterns in disease risk at the grid level. The multiple imputation approach generates multiple sets of disease counts at the grid level via multinomial sampling, and each dataset is used to fit the CAR model then combine the results to estimate the grid level disease risk. While the data augmentation allows uncertainty in the disease counts by updating them in the MCMC steps. Each method is applied to respiratory hospital admission data from the Greater Glasgow and Clyde Health Board area. The final piece of work of this thesis extends the spatial model to measure health inequality in Glasgow over time. Overall, it was found that disease risk is increasing over time and the areas with higher risk correspond to the deprived areas, while areas with lower risk tend to be the wealthier areas in Glasgow.
Published: 2021
Full Text: View/download PDF

31. Bayesian optimal experimental design for the study of natural phenomena

Author: Mavrogonatou, Lida
Subjects: 519.5, HA Statistics, QA Mathematics
Abstract: Modern science has been progressively moving towards the study of increasingly complex structures, investigating not only their individual components but also their interactions, dependencies and co-existence as a whole. This thesis is concerned with optimal experimental design methodology for the study of such phenomena. A decision-theoretic framework for optimal experimental design is adopted in this thesis. The employed methods operate based on an optimality criterion, quantifying the benefit incurred from each alternative experimental design - commonly known as the expected utility. An analytical expression is, in most studies of interest, not available for this quantity and so estimation techniques are typically required for its evaluation. Currently, existing estimation methods fail to adequately address issues arising in optimal design problems within a modern scientific framework. This is predominantly attributed to the considerable computational cost incurred by consideration of mathematical models sophisticated enough to adequately capture the complexity of the studied structures. In face of this restriction, researchers often resort to consideration of rather simplistic models, hindering the progress towards a more realistic representation and better understanding of such systems. Efficient methodology for evaluation of the expected utility constitutes the first main contribution of this thesis. The presented approach adopts a flexible, non-parametric framework combined with variational approximation techniques that translate the initial evaluation problem to an alternative, more tractable problem, solution of which is achieved through more efficient and computationally inexpensive procedures. A problem shift is thus achieved under which, estimation of the expected utility is accomplished through its corresponding dual problem. This alternative representation is shown to incur considerable computational gains compared to traditionally adopted approaches without compromising the quality of the produced estimates. The proposed estimator paves the way to an autonomous, comprehensive framework for the optimal study of complex phenomena within a realistic time frame, currently posing an ongoing challenge. Establishment of such a setup composes the second main contribution of this thesis. The proposed framework attempts to emulate a typical research scheme of closed-loop data collection, knowledge update and optimal decision making which, combined with instrument control software, facilitates modern scientific studies under minimal human input. The class of Bayesian optimisation algorithms is finally considered, allowing for truly optimal decision making during the established procedure. This class of algorithms, although particularly well-suited to optimal experimental design problems, has been given little consideration in the relevant literature. Their integration to the proposed framework, thus, constitutes an additional contribution of this thesis. Application of the adopted experimental design framework is examined in three increasingly challenging case studies, addressing a broad range of issues typically encountered in optimal design problems. The first study explores the optimal experimental design for a model discrimination problem adopting a set of simpler, polynomial models. An initial assessment of the proposed estimator and a comparison with the currently adopted methodology is performed, under a setup where application of the latter is not hindered by the incurred computational complexity. The subsequent two cases represent real-life problems of optimal experimental design for model inference in Systems Biology and Spectroscopy, employing models under which, traditionally adopted methods can become from highly inefficient to intractable and thus alternative approaches are needed for the study of such phenomena.
Published: 2021
Full Text: View/download PDF

32. Risk-based inspection planning of rail infrastructure considering operational resilience

Author: Osman, Mohd Haniff
Subjects: HA Statistics, HE Transportation and Communications, QA Mathematics, TF Railroad engineering and operation
Abstract: This research proposes a response model for a disrupted railway track inspection plan. The proposed model takes the form of an active acceptance risk strategy while having been developed under the disruption risk management framework. The response model entails two components working in a series; an integrated Nonlinear Autoregressive model with eXogenous input Neural Network (iNARXNN), alongside a risk-based value measure for predicting track measurements data and an output valuation. The neural network fuses itself to Bayesian inference, risk aversion and a data-driven modelling approach, as a means of ensuring the utmost standard of prediction ability. Testing on a real dataset indicates that the iNARXNN model provides a mean prediction accuracy rate of 95%, while also successfully preserving data characteristics across both time and frequency domains. This research also proposes a network-based model that highlights the value of accepting iNARXNN's outputs. The value is formulated as the ratio of rescheduling cost to a change in the risk level from a missed opportunity to repair a defective track, i.e., late defect detection. The value model demonstrates how the resilience action is useful for determining a rescheduling strategy that has (negative) value when dealing with a disrupted track inspection plan.
Published: 2020

33. Travel choices, internet accessibility, and extreme weather : translating trends in space-time flexibility in the digital age

Author: Budnitz, Hannah Debra
Subjects: GE Environmental Sciences, GF Human ecology. Anthropogeography, HA Statistics, HE Transportation and Communications
Abstract: Extreme weather affects not only transport infrastructure, but also travel behaviour. Climate change is causing more frequent and intense severe weather events, and thus is increasing the risks to transport infrastructure, services, and travellers. Travel behaviour trends are also in flux due to shifting working and activity patterns, as space-time flexibility and accessibility choice increases, and standard commuting journeys decline. Information and communication technologies (ICT) are one reason for these changing trends in travel behaviour, and, like climate change, create uncertainty in predicting transport operations and travel choices. However, ICT also has the potential to make mobility and accessibility more sustainable and more responsive to climate change impacts. This thesis sets out to identify the opportunities that improving ICT and increasing space-time flexibility create for commuters and other travellers to maintain accessibility, particularly to work activities, that they may better respond to severe weather, risk, and transport disruption, thereby boosting resilience. The research also concludes that through the integration of travel choices and Internet accessibility and by taking action to address spatial and temporal barriers, policy might better support both resilience and sustainability.
Published: 2020

34. Applications of optimal stopping in behavioural finance

Author: Muscat, Jonathan
Subjects: 332.6, HA Statistics, HB Economic Theory, HG Finance
Abstract: We study a set of optimal stopping problems arising from three branches from within the field of Behavioural Finance. We first consider a problem of an investor having S-shaped reference-dependent preferences who wishes to liquidate a divisible asset position at times of their choosing. We prove that it may be optimal for the investor to partially liquidate the asset at distinct price thresholds above the reference level rather than liquidate all the position in one block sale. In the second part of our study we consider problems describing the behaviour of an investor who experiences realisation utility whenever they realise gains or losses after liquidating an asset. We build upon the work of Barberis and Xiong [2012] and propose two problems, which we solve by applying the methodology of Dayanik and Karatzas [2003]. The first part considers an agent whose preferences are described by the classical Cumulative Prospect Theory S-shaped Utility proposed by Tversky and Kahneman [1992]. The second problem extends upon the first, and we propose a new utility function under which the agent does not only compare their gains relative to the reference level linearly but also proportionally. As part of the solutions presented for these two problems, we provide explicit conditions differentiating between the optimal strategies arising under different parameter cases. In the final part of our study, we consider models of optimal stopping with regret. We provide a continuous time re-formulation and extension to the dynamic model presented in Strack and Viefers [2015]. This model describes an agent whose preference structure incorporates a Regret term, where Regret is defined in the context of the work of Loomes and Sugden [1982].
Published: 2020

35. The economic burden and determinants of healthcare costs of back pain in the UK : an empirical investigation using electronic health records

Author: Zemedikun, Dawit Tefra
Subjects: HA Statistics, HB Economic Theory
Abstract: Back pain is a common health problem globally and it imposes great costs associated with its treatment. In the UK, the healthcare costs associated with back pain were last estimated 20 years ago in a cost-of-illness (COI) study. Given the aging population worldwide, these costs are likely to rise putting further pressure on healthcare services. The use of robust, and more advanced econometric methods to exploit national electronic health records (EHRs) may present up-to-date and more precise estimates for the healthcare costs of back pain. This thesis utilises such data source and methods to estimate the consultations, and prescriptions costs of back pain and to investigate factors associated with these costs. The systematic review of the literature assessed the methodologies used in COI studies of back pain. The vast majority of studies used a direct method of summing up back pain related costs which underestimated the true cost of back pain compared to an incremental cost approach. The latter approach which is conducted using a matched control or regression-based methods is more accurate, and comprehensive. This thesis further explored potential data sources that could be utilised for estimating healthcare resource use and costs of back pain in the UK. The Health Improvement Network (THIN), one of the two most widely used primary care databases in clinical research, was identified providing complete records of consultations and prescriptions data. Matched control studies, using propensity score matching, were conducted to estimate annual healthcare costs (2011-2015), and to assess how estimates vary by socio-economic factors, and over time. The incremental costs obtained in comparison with a similar reference population enabled the risk of confounding effects due to differences in baseline characteristics to be minimised. The thesis further evaluated several alternative econometric methods used in modelling healthcare cost data. A cross-sectional study was then conducted to assess factors associated with healthcare costs of back pain. Regression analysis methods applied included OLS, log transformed OLS, generalised linear models (GLMs), extended estimating equations (EEE) model, and a quantile regression approach. How well the alternative estimators performed in terms of bias, accuracy and goodness-of-fit was compared by examining regression diagnostics, and predictive performance of the models. The findings demonstrate the need for researchers to examine their assumptions about the most appropriate model for analysing healthcare cost data.
Published: 2020

36. Ethnicity, parenting styles, and adolescent health behaviours

Author: Cassidy, Aidan
Subjects: 362.1, H Social Sciences (General), HA Statistics, HT Communities. Classes. Races
Abstract: Background: There are ethnic variations in health behaviours in adolescence that track into adulthood and determine health outcomes. It is important to understand how these ethnic variations are influenced by factors such as the family environment so this thesis aimed to investigate whether ethnic variations in adolescent substance use, diet, and physical activity are mediated or moderated by parenting styles. Ethnic variations in adolescent health behaviours may also be moderated in strength by acculturation, and any investigation of parenting styles as a mediator needs to account for intermediate confounding by structural inequalities. Methods: Data were taken from the second wave of the, London-based, UK DASH study. These data were collected from 4,779 adolescents, aged 14-16 years old, between 2005 and 2006. The ethnic diversity of the DASH study allows for investigation of differences between major UK ethnic groups. Outcome measures include substance use (smoking, alcohol, and illicit drug use), fruit and vegetable consumption, physical activity, body size, and clusters of health behaviours (identified by latent class analysis). Logistic regression analysis and marginal structural modelling were used to investigate whether ethnic variations in adolescent health behaviours were mediated or moderated by cultural values, or parenting styles. This approach allows for intermediate confounding by structural inequalities. Results: Adolescent health behaviours varied by ethnicity and some variations were moderated by cultural factors, tending to be weaker where adolescents were more acculturated. Ethnic minority adolescents were less likely than White UK adolescents to engage in substance use behaviours but tended to have more unhealthy diets. Structural inequalities did not fully explain these ethnic variations. Compared to White UK adolescents, ethnic minority adolescents were more likely to perceive Authoritative or Authoritarian styles of parenting, characterised by higher parental control. Adolescents who perceived more Authoritative or Permissive styles of parenting, characterised by higher parental care, tended to have healthier behaviours. In general, the results of marginal structural models indicate that intervening on parenting styles would not remove ethnic variations in adolescent health behaviours, though this may be because the effects of Authoritarian and Authoritative parenting would to some extent cancel each other out. Conclusion: Although intervening to modify parenting styles may improve adolescent health behaviours in general, further research is needed to better understand the role of cultural factors in influencing ethnic variations.
Published: 2020
Full Text: View/download PDF

37. Housing and mental well-being : evidence from China and the UK

Author: Yang, Siyao
Subjects: H Social Sciences (General), HA Statistics, HB Economic Theory
Abstract: This thesis consists of three empirical studies, and is motivated by the following social phenomena: the phenomenon that owning more than one home has become widespread in China over the past two decades, and the phenomenon that mental health has attracted increasing attention in both developed and developing countries. The first study investigates the extent to which acquiring multiple houses is affected by the presence of a son in the family in China. The second study focuses on the extent to which Chinese households’ mental well-being is determined by owning multiple homes. The third study analyses the how housing tenure affects households’ mental ill-being in the UK in the presence of financial vulnerability. I use the 2011 and 2013 waves of the China Household Finance Survey (CHFS) to test the extent to which the acquisition of multiple houses is determined by the presence of male children in the family. I conjecture that, as a result of a very high sex ratio resulting from the one-child policy, Chinese families with male children may want to purchase additional houses to enhance the marriage chances of their sons. I also investigate whether this effect is larger for families with a son born in the year of the Dragon. In Chinese culture, the ‘dragon sons’ are believed to have better fortune throughout their lives. I find that families with male children aged 25 or older are most likely to acquire additional houses. This effect is highest in regions characterised by higher sex ratios. However, having a son born in the year of Dragon does not make a difference. I use the 2011 and 2013 waves of the CHFS and the 2012 and 2014 waves of the China Family Panel Studies (CFPS) to investigate the extent to which owning multiple homes impacts households’ mental well-being. I then analyses whether the association between multiple-home ownership and mental well-being is stronger for married households who are supposed to derive more sense of stability from home ownership. Inspired by the theory that owning a house can enhance children’s marriage prospects, which may in turn, enhance the parent’s mental well-being, I also investigate that whether this association is stronger for households with children. Making use of four different measures of mental well-being, I find that owning multiple homes is strongly and positively associated with households’ mental well-being in the Chinese context. However, being married or having children make no difference. I use eight waves of the English Longitudinal Survey of Ageing (ELSA) database to test the effects of housing tenure on mental health in England. Specifically, I investigate the extent to which owning one’s house with mortgage, being a private renter, and being a public renter, affect mental ill-being. In addition, I also study whether this association is stronger for private and social renters than for mortgagors, considering that mortgage payments may be particularly burdensome for financially vulnerable people. I then test whether the association between housing tenure and mental ill-being is stronger for households characterised by a higher financial vulnerability. I find that compared to households who own their home outright, those owning a house with mortgage, private renters and public renters all show higher levels of mental ill-being, with the renter showing the highest levels. In addition, I find that being unemployed and experiencing financial fragility strengthen the association between housing tenure and mental ill-being, whilst being 65 or older weakens the association.
Published: 2020

38. Disparity in precarity : measuring insecurity and inequality in youth transitions from education into and within the labour market

Author: Holcekova, Maria
Subjects: 331.12, HA Statistics, HM Sociology, HN Social history and conditions. Social problems. Social reform
Abstract: Young people have always been argued to be disadvantaged in labour market opportunities and avoiding insecurity. Yet most of these arguments have been based on theoretical, anecdotal, and qualitative accounts, and they focus on aggregate measures of youth unemployment, which tend to hide inequality. The purpose of this thesis is to provide missing nationally representative empirical evidence on the extent of and inequality in insecurity in the contemporary English youth labour market, and in comparison to the past. After reviewing the existing literature (Chapter 1), the first analysis (Chapter 2), using data from the 1985 and 2015 Labour Force Survey, shows that there is a lot more nuance to the blanket claims of most types of insecurity increasing over time for most workers. The following two chapters (Chapters 3 and 4) investigate, using the longitudinal data from the Next Steps dataset, the mechanisms through which young people find themselves in insecure forms of employment for two groups: early-leavers from education, including those experiencing spells in NEET, and further-education graduates. My findings show that it is previous experiences of insecurity, and underlying structural factors, such as one’s socio-economic position, sex, and caring responsibilities, rather than the non-participation in education, employment, or training, that puts young people in insecure jobs later in their labour market transitions. A major policy implication of these findings is that pushing young people into employment without considering its security, both in terms of career progression and stability, might potentially make youth transitions more chaotic and less advantageous. Furthermore, my findings put recent government strategies of shifting responsibility onto young people and away from the state, and increasingly conditional welfare support, into question, because they fail to address the structural inequalities in access to, and returns from, education for young people in different socio-economic positions.
Published: 2020

39. Text-mining in macroeconomics : the wealth of words

Author: Azqueta Gavaldon, Andres
Subjects: 330.285, HA Statistics, HB Economic Theory, HG Finance
Abstract: The coming to life of the Royal Society in 1660 surely represented an important milestone in the history of science, not least in Economics. Yet, its founding motto, "Nullius in verba", could be somewhat misleading. Words in fact may play an important role in Economics. In order to extract relevant information that words provide, this thesis relies on state-of-the-art methods from the information retrieval and computer science communities. Chapter 1 shows how policy uncertainty indices can be constructed via unsupervised machine learning models. Using unsupervised algorithms proves useful in terms of the time and resources needed to compute these indices. The unsupervised machine learning algorithm, called Latent Dirichlet Allocation (LDA), allows obtaining the different themes in documents without any prior information about their context. Given that this algorithm is widely used throughout this thesis, this chapter offers a detailed while intuitive description of its underlying mechanics. Chapter 2 uses the LDA algorithm to categorize the political uncertainty embedded in the Scottish media. In particular, it models the uncertainty regarding Brexit and the Scottish referendum for independence. These referendum-related indices are compared with the Google search queries "Scottish independence" and "Brexit", showing strong similarities. The second part of the chapter examines the relationship of these indices on investment in a longitudinal panel dataset of 2,589 Scottish firms over the period 2008-2017. It presents evidence of greater sensitivity for firms that are financially constrained or whose investment is to a greater degree irreversible. Additionally, it is found that Scottish companies located on the border with England have a stronger negative correlation with Scottish political uncertainty than those operating in the rest of the country. Contrary to expectations, we notice that investment coming from manufacturing companies appears less sensitive to political uncertainty. Chapter 3 builds eight different policy-related uncertainty indicators for the four largest euro area countries using press-media in German, French, Italian and Spanish from January 2000 until May 2019. This is done in two steps. Firstly, a continuous bag of word model is used to obtain semantically similar words to ``economy'' and ``uncertainty'' across the four languages and contexts. This allows for the retrieval of all news-articles relevant to economic uncertainty. Secondly, LDA is again employed to model the different sources of uncertainty for each country, highlighting how easily LDA can adapt to different languages and contexts. Using a Bayesian Structural Vector Autoregressive set up (BSVAR) a strong heterogeneity in the relationship between uncertainty and investment in machinery and equipment is then documented. For example, while investment in France, Italy and Spain reacts heavily to political uncertainty shocks, in Germany it is more sensitive to trade uncertainty shocks. Finally, Chapter 4 analyses English language media from Europe, India and the United States, augmented by a sentiment analysis to study how different narratives concerning cryptocurrencies influence their prices. The time span ranges from April 2013 to December 2018 a period where cryptocurrency prices experienced a parabolic behaviour. In addition, this case study is motivated by Shiller's belief that narratives around cryptocurrencies might have led to this price behaviour. Nonetheless, the relationship between narratives and prices ought to be driven by complex interactions. For example, articles written in the media about a specific phenomenon will attract or detract new investors depending on their content and tone (sentiment). Moreover, the press might also react to price changes by increasing the coverage of a given topic. For this reason, a recent causal model, Convergent Cross Mapping (CCM), suited to discovering causal relationships in complex dynamical ecosystems is used. I find bidirectional causal relationships between narratives concerning investment and regulation while a mild unidirectional causal association exists in narratives that relate technology and security to prices.
Published: 2020
Full Text: View/download PDF

40. Scaling kNN queries using statistical learning

Author: Cahsai, Atoshum Samuel
Subjects: 006.3, HA Statistics, QA75 Electronic computers. Computer science
Abstract: The k-Nearest Neighbour (kNN) method is a fundamental building block for many sophisticated statistical learning models and has a wide application in different fields; for instance, in kNN regression, kNN classification, multi-dimensional items search, location-based services, spatial analytics, etc. However, nowadays with the unprecedented spread of data generated by computing and communicating devices has resulted in a plethora of low-dimensional large-scale datasets and their users' community, the need for efficient and scalable kNN processing is pressing. To this end, several parallel and distributed approaches and methodologies for processing exact kNN in low-dimensional large-scale datasets have been proposed; for example Hadoop-MapReduce-based kNN query processing approaches such as Spatial-Hadoop (SHadoop), and Spark-based approaches like Simba. This thesis contributes with a variety of methodologies for kNN query processing based on statistical and machine learning techniques over large-scale datasets. This study investigates the exact kNN query performance behaviour of the well-known Big Data Systems, SHadoop and Simba, that proposes building multi-dimensional Global and Local Indexes over low dimensional large-scale datasets. The rationale behind such methods is that when executing exact kNN query, the Global and Local indexes access a small subset of a large-scale dataset stored in a distributed file system. The Global Index is used to prune out irrelevant subsets of the dataset; while the multiple distributed Local Indexes are used to prune out unnecessary data elements of a partition (subset). The kNN execution algorithm of SHadoop and Simba involves loading data elements that reside in the relevant partitions from disks/network points to memory. This leads to significantly high kNN query response times; so, such methods are not suitable for low-latency applications and services. An extensive literature review showed that not enough attention has been given to access relatively small-sized but relevant data using kNN query only. Based on this limitation, departing from the traditional kNN query processing methods, this thesis contributes two novel solutions: Coordinator With Index (COWI) and Coordinator with No Index(CONI) approaches. The essence of both approaches rests on adopting a coordinator-based distributed processing algorithm and a way to structure computation and index the stored datasets that ensures that only a very small number of pieces of data are retrieved from the underlying data centres, communicated over the network, and processed by the coordinator for every kNN query. The expected outcome is that scalability is ensured and kNN queries can be processed in just tens of milliseconds. Both approaches are implemented using a NoSQL Database (HBase) achieving up to three orders of magnitude of performance gain compared with state of the art methods -SHadoop and Simba. It is common practice that the current state-of-the-art approaches for exact kNN query processing in low-dimensional space use Tree-based multi-dimensional Indexing methods to prune out irrelevant data during query processing. However, as data sizes continue to increase, (nowadays it is not uncommon to reach several Petabytes), the storage cost of Tree-based Index methods becomes exceptionally high, especially when opted to partition a dataset into smaller chunks. In this context, this thesis contributes with a novel perspective on how to organise low-dimensional large-scale datasets based on data space transformations deriving a Space Transformation Organisation Structure (STOS). STOS facilitates kNN query processing as if underlying datasets were uniformly distributed in the space. Such an approach bears significant advantages: first, STOS enjoys a minute memory footprint that is many orders of magnitude smaller than Index-based approaches found in the literature. Second, the required memory for such meta-data information over large-scale datasets, unlike related work, increases very slowly with dataset size. Hence, STOS enjoys significantly higher scalability. Third, STOS is relatively efficient to compute, outperforming traditional multivariate Index building times, and comparable, if not better, query response times. In the literature, the exact kNN query in a large-scale dataset was limited to low-dimensional space; this is because the query response time and memory space requirement of the Tree-based index methods increase with dimension. Unable to solve such exponential dependency on the dimension, researchers assume that no efficient solution exists and propose approximation kNN in high dimensional space. Unlike the approximated kNN query that tries to retrieve approximated nearest neighbours from large-scale datasets, in this thesis a new type of kNN query referred to as ‘estimated kNN query’ is proposed. The estimated kNN query processing methodology attempts to estimate the nearest neighbours based on the marginal cumulative distribution of underlying data using statistical copulas. This thesis showcases the performance trade-off of exact kNN and the estimate kNN queries in terms of estimation error and scalability. In contrast, kNN regression predicts that a value of a target variable based on kNN; but, particularly in a high dimensional large-scale dataset, a query response time of kNN regression, can be a significantly high due to the curse of dimensionality. In an effort to tackle this issue, a new probabilistic kNN regression method is proposed. The proposed method statistically predicts the values of a target variable of kNN without computing distance. In different contexts, a kNN as missing value algorithm in high dimensional space in Pytha, a distributed/parallel missing value imputation framework, is investigated. In Pythia, a different way of indexing a high-dimensional large-scale dataset is proposed by the group (not the work of the author of this thesis); by using such indexing methods, scaling-out of kNN in high dimensional space was ensured. Pythia uses Adaptive Resonance Theory (ART) -a machine learning clustering algorithm- for building a data digest (aka signatures) of large-scale datasets distributed across several data machines. The major idea is that given an input vector, Pythia predicts the most relevant data centres to get involved in processing, for example, kNN. Pythia does not retrieve exact kNN. To this end, instead of accessing the entire dataset that resides in a data-node, in this thesis, accessing only relevant clusters that reside in appropriate data-nodes is proposed. As we shall see later, such method has comparable accuracy to that of the original design of Pythia but has lower imputation time. Moreover, the imputation time does not significantly grow with a size of a dataset that resides in a data node or with the number of data nodes in Pythia. Furthermore, as Pythia depends utterly on the data digest built by ART to predict relevant data centres, in this thesis, the performance of Pythia is investigated by comparing different signatures constructed by a different clustering algorithms, the Self-Organising Maps. In this thesis, the performance advantages of the proposed approaches via extensive experimentation with multi-dimensional real and synthetic datasets of different sizes and context are substantiated and quantified.
Published: 2020
Full Text: View/download PDF

41. On explosive time series

Author: Korkos, Ioannis
Subjects: 519.5, HA Statistics, HG Finance
Abstract: The first chapter of this thesis, discusses the characteristics of an asset bubble episode outlining the reasons these episodes have attracted so much interest nowadays and provides an overview of historical bubble episodes motivating the testing procedures proposed in Chapters 2-4. The second chapter proposes a right-tailed bootstrap implementation of the covariate Augmented Dickey-Fuller (CADF) unit root test of Hansen (1995), motivated by the work of Chang, Sickles and Song (2017). We apply the right-tailed bootstrap BCADF test in a recursive manner and provide evidence that the inclusion of relevant covariates offers significant power gains. An empirical application of the proposed methodology is conducted, utilising the Moody's Seasoned Aaa and Baa Corporate Bond Yields, the Ten-Year Treasury Rate and the Volatility Index (VXO) as covariates. The third chapter intends to examine the size and power properties of right-tailed Dickey-Fuller unit root test processes when testing for market efficiency in the commodity markets by applying a wild bootstrap approach to Phillips et al. (2015) tests. The simulations results show that the proposed wild bootstrap test offers better size control and power performance in finite samples. In the empirical exercise, our proposed test suggests periods of market inefficiency prior to the existence of the bubble episode as identified by the conventional tests during two periods of oil crises. The fourth chapter studies the hypothesis of an asset bubble in a rational expectations framework using a bivariate coexplosive vector autoregression as in Nielsen (2010). Firstly, we apply a co-explosive vector autoregression to model whether the WTI crude oil price run-up of 2007-2008 can be attributed to the existence of a bubble as well as whether the WTI crude oil collapse of 2014-2015 exhibits characteristics of bubble implosion. In the fifth and final chapter, concluding remarks are made regarding and directions for future research are proposed.
Published: 2020

42. P-spline additive modeling and partial derivative estimation for environmental data

Author: Vazanellis, George
Subjects: 628.1, HA Statistics, QA Mathematics
Abstract: This thesis addresses the construction of complex additive mixed models for environmental data and the use of those models to estimate partial derivatives for the purpose of detecting impacts of known events. The methods developed are applied to a data set collected by the Scottish Environment Protection Agency in an effort to monitor the dissolved oxygen of the River Clyde. There are many metrics recorded along the River. Exploratory analysis is carried out to pinpoint some possible drivers of the dissolved oxygen. The River Clyde contains processes which are diffcult to represent by conventional parametric models. P-splines offer a means of fitting a flexible model to this data set. There is also the possibility of the presence of interactions between some explanatory covariates. Because of the sampling regime, a random effects component is appropriate. An additive mixed model with interactions allows for all the above-mentioned components to be included in a representative model for the River Run data. The methodology for fitting such a model, along with descriptions of four information criteria which are intended to aid in smoothing parameter selection, are explained in this thesis. Two options for performing analysis of variance for additive models with interactions are considered: A simple F-test and a quadratic approach. The performance and computational expense of each is compared to a parametric bootstrap and to various other standard tests. A simple additive model with no interactions is initially fitted with varying degrees of freedom for each main effect. The four information criteria scores are calculated for every main effect across all degrees of freedom. The information criterion which performs best is then used to select the optimal smoothing parameter for every main effect in an additive model and an additive mixed model, both with no interactions. Before an additive mixed model with interactions is fitted, a simulation study is conducted to see if the order of optimization of the main effect degrees of freedom is of any importance. An additive mixed model with interactions is subsequently fitted and interpreted. One aim of this thesis is to determine if upgrades to two wastewater treatment facilities have had positive impacts to the levels of dissolved oxygen in the river. Partial derivatives with respect to time are discussed as a means of detecting subtle changes in a system which has shown gradual increases in dissolved oxygen over the past four decades. An argument is made for the use of P-splines with penalty orders other than 2 if the main goal is derivative estimation. A simulation study is conducted and the optimal penalty order is then used to construct a derivative additive mixed model with interactions for the River Run data. This model is used to see if there is evidence the wastewater facility upgrades had a positive impact. One positive result of this research is that the quadratic forms method of analysis of variance for additive models with interactions was found to out-perform the simple F-test and was less computationally expensive than the parametric bootstrap. A second positive result was finding a preferred information criterion for smoothing parameter selection and using the optimal degrees of freedom to subsequently fit such a complex additive mixed model with interactions. A third positive result was finding that penalty order three outperformed penalty order two in estimating partial derivatives. Finally, the fourth positive result was constructing a derivative model and subsequently using it to provide evidence the wastewater treatment facility upgrades had a positive impact on the dissolved oxygen.
Published: 2020
Full Text: View/download PDF

43. "You're either part of the solution or you're part of the problem" : exploring the view of practitioners from a local authority educational psychology service, of a socio-political approach within UK educational psychology

Author: Chase, Julie
Subjects: 370.15, BF Psychology, H Social Sciences (General), HA Statistics, HN Social history and conditions. Social problems. Social reform, HT Communities. Classes. Races, HX Socialism. Communism. Anarchism, L Education (General)
Abstract: Empirical literature on educational psychologists’ (EPs) views of socio-political or critical community psychology (CCP) focuses on single-issue aspects of oppression such as sexuality or racism. Some research examined EPs’ views of psychology from a broader ideological perspective, examining individualism, neo-liberal austerity, colonialist practices within educational psychology, and social justice. Having identified a gap in the empirical literature; research was modelled on Thompson (2007), with an emancipatory aim of contributing to EPs’ socio-political conscientisation. Critical realist-based, discursive, Q-methodology involved 16 UK local authority EP service participants ranking 51 expertly updated socio-political statements by relevance to the future of EP practice. Following three-Factor resolution from Factor analysis, interpretation was supported by qualitative data. Findings were considered theoretically and alongside current literature, deriving practice implications. Research limitations and possible future research were discussed. The aim was to contribute to addressing Fox’s (2005) hypothesis that UK EPs do not appreciate, or know how to respond to, socio-politically rooted suffering and so risk colluding with a non-emancipatory status-quo. In conclusion, the EP practitioner group viewed CCP ideas as highly relevant but varied in their responses to them such that the three core discourses derived in factorisation mapped onto the areas of mainstream psychology, mainstream community psychology, and critical community psychology.
Published: 2020

44. The determinants of the timing of retirement : a cross-country comparison

Author: Deng, Shulin
Subjects: H Social Sciences (General), HA Statistics, HB Economic Theory, HD Industries. Land use. Labor
Abstract: To ensure fiscal stability in the face of an ageing population, it is essential to stimulate labour market participation among older people across the world. Early retirement could increase the burden of supporting the older generation in the society, while delayed retirement could potentially improve the wealth accumulation for individuals. This thesis aims to examine how the observed retirement situation reflects health status and wealth position among people aged 45 to 80. The sample countries cover Austria, Germany, Sweden, Spain, Italy, France, Denmark, Switzerland, Belgium, and China, with comparative analysis carried out. The second chapter investigates early retirement in nine European countries using dynamic models to analyse longitudinal associations which were explored by linking health, wealth and working conditions of respondents in the previous wave with prospective labour market participation in the follow-up waves. The analysis is based on five waves of the Survey on Health, Ageing and Retirement in Europe (SHARE). The study comprises two parts. One part is based on the two-year dynamic model to predict premature retire two years later. Another part presents the health and wealth effects observed four years previously on early retirement behaviour in the current survey year. We find that severe health problems in the past few years can lead to a decrease in the participation rate of individuals in the labour market in Europe. Moreover, there is a strong negative relationship between net non-housing wealth, housing mortgage and early retirement. High debt burdens are less likely to stimulate an early retirement. Furthermore, people who retire earlier than the state pension age are more frequently from the disadvantaged working environment. Chapter 3 makes a comparison analysis before and after the 2007-2008 Global Financial Crisis by employing the same five waves and nine countries in SHARE dataset as of chapter 2. The study examines the associations of health status, wealth position, and pension income with delayed retirement for a cohort of people aged 65 to 80. This study delivers the evidence that a better situation in either mental or physical health shall widely promote the growth of the labour supply for those 65 to 80 years of age in old age in Europe. Additionally, those individuals with higher housing values and non-housing values can remain on old-age employment. Evidence has provided for Sweden, Denmark and France. Moreover, people with high pension benefits are less likely to extend their working life. Self-employment presents a significant factor in the extension of labour market participation into older age. We next turn to a Chinese study in Chapter 4. The study does both longitudinal analysis and treatment effects. We pool all three waves of the China Health and Retirement Longitudinal Study (CHARLS) and do comparison analysis by regions. We estimate treatment effects in the Propensity Score Matching (PSM) section. We observe a strong negative association between diseases, poor health and labour market participation. Stroke, memory problems and mental health problems are the top three chronic diseases that could push respondents out of the labour market at an early age. By contrast, stroke, heart problems, diabetes are the top three diseases that reduce working life. A high level of pension income received suggests a lower rate of labour participation. Pension contributors are less likely to retire before compulsory retirement age. Furthermore, rural residents show higher retention rate in the labour market, compared to urban residents.
Published: 2019

45. Bayesian inference in the M-open world

Author: Jewson, Jack E.
Subjects: 519.5, HA Statistics, QA Mathematics
Abstract: This thesis examines Bayesian inference and its suitability for modern statistical applications. Motivated by the vast quantities of data currently available for analysis, we forgo the M-closed assumption that the model used for inference is correctly specified and place ourselves in the more realistic M-open world. Here, we assume that the model used for statistical inference is at best an approximation. In the M-open world Bayes' rule updating has been shown [Berk et al., 1966; Bissiri et al., 2016] to learn about the model parameters minimising the log-score, or equivalently the Kullback-Leibler divergence (KLD) to the data generating process (DGP). It is also known that minimising the log-score puts great emphasis on correctly capturing the tails of the sample distribution of the data. We observe, that this emphasis is so great, that the majority of the data can be ignored to sufficiently account an outlier. This is purportedly desirable when inference is the goal of the analysis. However, in Chapter 2 we show that when informed decision making via the minimisation of expected losses is the goal of the statistical analysis, as it so often is, Bayes' rule inferences are less desirable. This motivates us to consider minimising alternative divergences to the KLD. Bayesian updating minimising alternative divergences to the KLD has briefly been considered in the literature. However, those methods are neither sufficiently well motivated or properly justified as a principled updating of beliefs. We are able to use the foundations of general Bayesian inference (GBI) to produce belief updates minimising any statistical divergence. This allows us to consider the divergence as a subjective judgement and motivate several divergences from a decision making perspective. Chapter 3 extends the motivation for minimising divergences alternative to the KLD. Here, we consider the model to be one among a equivalence class of belief models all respecting the belief judgements the decision maker (DM) has been able to make. It is therefore desirable for inference to be stable across this equivalence class. This is a well studied problem with respect to the prior component of the Bayesian analysis, but we believe we are one of the first to consider extending these result to the likelihood model. We prove that, unlike Bayes' rule updating, inference designed at minimising the total-variation divergence (TVD), the Hellinger divergence (HD), and the β-divergence (βD), are able to provide provably stable inferences. Chapter 4 is inspired by the computation required to infer posteriors in modern Bayesian inference. We derive a generalised optimisation problem defining Bayesian inference. This is axiomatically motivated and contains Bayes' rule inference, GBI and variational inference (VI) as special cases. This generalised Bayesian inference problem is composed of three interpretable components: a loss function defining the limiting parameter of interest for the analysis; a prior regularising divergence describing how the posterior should quantify uncertainty; and a set of admissible posterior densities to optimise over. Chapters 2 and 3 examined changing the target parameter of inference to deal with model misspecification. Chapter 4 then shows that changing the prior regularising divergence can resolve VI's tendency to allow posteriors to over-concentrate, we call these methods generalised variational inference (GVI). We also show situations where methods failing to satisfy our axioms produces undesirable and non-transparent inference. We show that GVI is able improve upon state of the art performances for deep Gaussian processes and Bayesian neural networks. The final chapter considers the challenging and widely applicable problem of detecting regime changes in multi-dimensional on-line streaming data, Bayesian online changepoint detection (BOCPD). BOCPD must use simple computable models in order to run in real time. The current methodology allows model misspecifications and outliers associated with these simple models to cause the detection of spurious changepoints (CP). We robustify this analysis using the βD. We are able to prove results demonstrating that greater evidence is required in order to force the declaration of a CP when using the βD instead of the KLD. Additionally, we deploy a type of GVI algorithm to produces fast and accurate posterior inference that are suitable for on-line application. Applying this robustified algorithm to data recording air pollution in London finds a changepoint around the introduction of the congestion charge but, unlike previous methods does not detect any further regime changes.
Published: 2019

46. Statistical practice and reproducibility in behavioural science

Author: Lim, Kenneth Teck Kiat
Subjects: 330.01, BF Psychology, HA Statistics
Abstract: Psychology and economics are undergoing a 'reproducibility crisis', with researchers attempting to replicate more published studies to verify existing findings. This thesis investigates the role of statistical practice in the reproducibility crisis. A 100-item checklist of recommended statistical practices was developed based on guidelines by the American Psychological Association. The checklist was used to evaluate a sample of psychology and economics studies that were already independently replicated. On average, the evaluated studies adhered to 30% of recommended statistical practice. Incomplete reporting hampered meaningful evaluation of the association between adherence to the checklist items and replication success. Next, the thesis focusses on the sign effect, which is an established intertemporal choice anomaly. Verbal descriptions of the sign effect are formalised, a hypothesis testing framework is proposed and the concept of a discount rate is critically discussed. Then, the first systematic review and meta-analysis of the sign effect was attempted. Results suggested substantial heterogeneity within and between participants, which is not apparent when the convention of analysing aggregated data is used. There was a surprising amount of observations where no discounting occurred and where discount rates could not be estimated. Then, individual participants' responses to questions are modelled to estimate the extent to which the outcome sign and other factors influence choices. Results suggested that the later amount was chosen more often for gains than losses. There was substantial heterogeneity within and between studies. Good statistical practice is central to tackling the reproducibility crisis. Definitions need to be explicitly formalised, data need to be described sufficiently, assumptions need to be explored empirically, study designs need to be informative, and the different types of heterogeneity need to be documented and accounted for.
Published: 2019

47. Nonparametric clustering for spatio-temporal data

Author: Venkatasubramaniam, Ashwini Kolumam
Subjects: 519.5, HA Statistics
Abstract: Clustering algorithms attempt the identification of distinct subgroups within heterogeneous data and are commonly utilised as an exploratory tool. The definition of a cluster is dependent on the relevant dataset and associated constraints; clustering methods seek to determine homogeneous subgroups that each correspond to a distinct set of characteristics. This thesis focuses on the development of spatial clustering algorithms and the methods are motivated by the complexities posed by spatio-temporal data. The examples in this thesis primarily come from spatial structures described in the context of traffic modelling and are based on occupancy observations recorded over time for an urban road network. Levels of occupancy indicate the extent of traffic congestion and the goal is to identify distinct regions of traffic congestion in the urban road network. Spatial clustering for spatio-temporal data is an increasingly important research problem and the challenges posed by such research problems often demand the development of bespoke clustering methods. Many existing clustering algorithms, with a focus on accommodating the underlying spatial structure, do not generate clusters that adequately represent differences in the temporal pattern across the network. This thesis is primarily concerned with developing nonparametric clustering algorithms that seek to identify spatially contiguous clusters and retain underlying temporal patterns. Broadly, this thesis introduces two clustering algorithms that are capable of accommodating spatial and temporal dependencies that are inherent to the dataset. The first is a functional distributional clustering algorithm that is implemented within an agglomerative hierarchical clustering framework as a two-stage process. The method is based on a measure of distance that utilises estimated cumulative distribution functions over the data and this unique distance is both functional and distributional. This notion of distance utilises the differences in densities to identify distinct clusters in the graph, rather than raw recorded observations. However, distinct characteristics may not necessarily be identified and distinguishable by a densities-based distance measure, as defined within the agglomerative hierarchical clustering framework. In this thesis, we also introduce a formal Bayesian clustering approach that enables the researcher to determine spatially contiguous clusters in a data-driven manner. This framework varies from the set of assumptions introduced by the functional distributional clustering algorithm. This flexible Bayesian model employs a binary dependent Chinese restaurant process (binDCRP) to place a prior over the geographical constraints posed by a graph-based network. The binDCRP is a special case of the distance dependent Chinese restaurant process that was first introduced by Blei and Frazier (2011); the binDCRP is modified to account for data that poses spatial constraints. The binDCRP seeks to cluster data such that adjacent or neighbouring regions in a spatial structure are more likely to belong to the same cluster. The binDCRP introduces a large number of singletons within the spatial structure and we modify the binDCRP to enable the researcher to restrict the number of clusters in the graph. It is also reasonable to assume that individual junctions within a cluster are spatially correlated to adjacent junctions, due to the nature of traffic and the spread of congestion. In order to fully account for spatial correlation within a cluster structure, the model utilises a type of the conditional auto-regressive (CAR) model. The model also accounts for temporal dependencies using a first order auto-regressive (AR-1) model. In this mean-based flexible Bayesian model, the data is assumed to follow a Gaussian distribution and we utilise Kronecker product identities within the definition of the spatio-temporal precision matrix to improve the computational efficiency. The model utilises a Metropolis within Gibbs sampler to fully explore all possible partition structures within the network and infer the relevant parameters of the spatio-temporal precision matrix. The flexible Bayesian method is also applicable to map-based spatial structures and we describe the model in this context as well. The developed Bayesian model is applied to a simulated spatio-temporal dataset that is composed of three distinct known clusters. The differences in the clusters are reflected by distinct mean values over time associated with spatial regions. The nature of this mean-based comparison differs from the functional distributional clustering approach that seeks to identify differences across the distribution. We demonstrate the ability of the Bayesian model to restrict the number of clusters using a simulated data structure with distinctly defined clusters. The sampler is also able to explore potential cluster structures in an efficient manner and this is demonstrated using a simulated spatio-temporal data structure. The performance of this model is illustrated by an application to a dataset over an urban road network that presents traffic as a process varying continuously across space and time. We also apply this model to an areal unit dataset composed of property prices over a period of time for the Avon county in England.
Published: 2019
Full Text: View/download PDF

48. Essays on intrahousehold relationships and decision-making

Author: Sobrevilla, Alma
Subjects: 330.9, H Social Sciences (General), HA Statistics, HQ The family. Marriage. Woman
Abstract: This thesis aims to study three specific research questions in child health, women's empowerment and children's education from an intrahousehold perspective, using panel data from the Mexican Family Life Survey. The first essay (Chapter 2) aims to shed light into the problem of obesity in Mexico. The chapter studies the intergenerational transmission of obesity in children and adolescents offering quantitative measures of the parent-child link in terms of the Body Mass Index (BMI). Starting by following a simple Ordinary Least Squares approach, the analysis progresses to the use of fixed effect methodologies in order to isolate shared and non-shared genetic factors from the parent-child BMI relationship. Results suggest a strong link between the BMI of fathers and children, which is not only associated to genetic elements but also to time-variant factors that could be related to eating and exercising habits; this relationship is highly significant and stronger for children living in households with a high socioeconomic status. The mother-child link, on the other hand, seems to be slightly weaker and almost exclusively explained by time-invariant factors (such as genetics) however this relationship tends to be stronger for children whose mothers are in paid employment. In the second essay (Chapter 3) this thesis explores the relationship between women's employment and education on their level of participation on seven different aspects of intrahousehold decision-making. Unlike previous research papers on the matter, this work considers three possible results for women's involvement in decision-making: i) exclusive decision-making, ii) shared decision-making with at least one other family member, or iii) non-participation. Results show that having one additional year of education will increase the likelihood of a woman sharing decision-making power with at least some other family member, but will reduce the probability of her being the exclusive decision-maker. On the other hand, being in paid employment tends to increase women's likelihood of both, sharing power and becoming exclusive decision-makers. The analysis then goes on to explore the role of social norms on women's behaviour and finds that having a higher level of education than the average in the community seems to decrease women's level of intrahousehold decision-making power, supporting the notion that women seem to compensate their success outside the household with submissive attitudes at home. Finally, the third essay (Chapter 4) studies the association between children's cognitive ability and their time allocation on school, work and housework. The relationship between children's endowment and the amount of resources parents allocate to them has been widely studied in the past; however, most of the previous research on this matter has only considered monetary resources as a measure of parental investments. Alternatively, this work considers time allocation as a more basic form of parental investment. Using fixed effects and instrumental variables methodologies, the chapter analyses the relationship between children's IQ z-scores and a set of six variables indicating children's participation or enrolment in work, housework and school, as well as the number of hours dedicated to each activity. Results suggest that cognitive ability does not seem to have a significant effect on children's participation or time allocated to work; nevertheless, it does have a strong link with school enrolment, number of hours spent at school and participation in housework, some of these effects being significantly different for boys and girls.
Published: 2019
Full Text: View/download PDF

49. Analysis of spatially correlated functional data objects

Author: Alghamdi, Salihah Safar
Subjects: 910.285, HA Statistics
Abstract: Space-time data are of great interest in many fields of research, but they are inherently complex in nature which leads to practical issues when formulating statistical models to analyse them. In classical analysis of space-time data the temporal variation is modelled using traditional time-series analysis. This thesis focuses on build- ing a comprehensive framework for analysing space-time data, where the temporal component is considered to be a continuous function and modelled using functional data analytic tools. There are several approaches for analysis spatially correlated functional data, but most of them are designed for specific applications and there is no easy way of comparing these methods. In summary, the challenge in modelling space-time data using functional data analytic techniques is that there is no clear rule regarding which method is most appropriate for analysing a new dataset. Existing methods have been developed for specific applications without giving a clear indication for a practitioner regarding their appropriateness. This motivates us to propose a clear flow chart of the analysis of space-time data using functional data analysis methods and develop a framework under which different existing methods can be compared. In this research, we provide a clear comparison between two widely different methods of modelling spatial dependence one using parametric and the other using non-parametric spatial dependence. These techniques were developed for datasets with different complexities. First, we had to generalise the methodologies and codes of both of these methods to analyse data with features they were not originally designed for. We then compared the performance of these two methods on two real life datasets, the enhanced vegetation index (EVI) data and the electroencephalography (EEG) data. Further we have generalised our framework to accommodate replicated data and used it to build classification tools that outperforms all existing approaches. One major contribution of this thesis is the development of the methodological framework and computational tool for the analysis of spatially correlated functional data. We have also clearly demonstrated, theoretically, and through simulations that our approach outperforms existing methods. Finally, for the EEG data we have demonstrated that classification tools built on representations from our models can outperform classification tools using the raw data.
Published: 2019
Full Text: View/download PDF

50. Bayesian nonparametric inference in mechanistic models of complex biological systems

Author: Noè, Umberto
Subjects: 519.2, HA Statistics
Abstract: Parameter estimation in expensive computational models is a problem that commonly arises in science and engineering. With the increase in computational power, modellers started developing simulators of real life phenomena that are computationally intensive to evaluate. This, however, makes inference prohibitive due to the unit cost of a single function evaluation. This thesis focuses on computational models of biological and biomechanical processes such as the left-ventricular dynamics or the human pulmonary blood circulatory system. In the former model a single forward simulation is in the order of 11 minutes CPU time, while the latter takes approximately 23 seconds in our machines. Markov chain Monte Carlo methods or likelihood maximization using iterative algorithms would take days or weeks to provide a result. This makes them not suitable for clinical decision support systems, where a decision must be taken in a reasonable time frame. I discuss how to accelerate the inference by using the concept of emulation, i.e. by replacing a computationally expensive function with a statistical approximation based on a finite set of expensive training runs. The emulation target could be either the output-domain, representing the standard approach in the emulation literature, or the loss-domain, which is an alternative and different perspective. Then, I demonstrate how this approach can be used to estimate the parameters of expensive simulators. First I apply loss-emulation to a nonstandard variant of the Lotka-Volterra model of prey-predator interactions, in order to assess if the approach is approximately unbiased. Next, I present a comprehensive comparison between output-emulation and loss-emulation on a computational model of left ventricular dynamics, with the goal of inferring the constitutive law relating the myocardial stretch to its strain. This is especially relevant for assessing cardiac function post myocardial infarction. The results show how it is possible to estimate the stress-strain curve in just 15 minutes, compared to the one week required by the current best literature method. This means a reduction in the computational costs of 3 orders of magnitude. Next, I review Bayesian optimization (BO), an algorithm to optimize a computationally expensive function by adaptively improving the emulator. This method is especially useful in scenarios where the simulator is not considered to be a ``stable release''. For example, the simulator could still be undergoing further developments, bug fixing, and improvements. I develop a new framework based on BO to estimate the parameters of a partial differential equation (PDE) model of the human pulmonary blood circulation. The parameters, being related to the vessel structure and stiffness, represent important indicators of pulmonary hypertension risk, which need to be estimated as they can only be measured with invasive experiments. The results using simulated data show how it is possible to estimate a patient's vessel properties in a time frame suitable for clinical applications. I demonstrate a limitation of standard improvement-based acquisition functions for Bayesian optimization. The expected improvement (EI) policy recommends query points where the improvement is on average high. However, it does not account for the variance of the random variable Improvement. I define a new acquisition function, called ScaledEI, which recommends query points where the improvement on the incumbent minimum is expected to be high, with high confidence. This new BO algorithm is compared to acquisition functions from the literature on a large set of benchmark functions for global optimization, where it turns out to be a powerful default choice for Bayesian optimization. ScaledEI is then compared to standard non-Bayesian optimization solvers, to confirm that the policy still leads to a reduction in the number of forward simulations required to reach a given tolerance level on the function value. Finally, the new algorithm is applied to the problem of estimating the PDE parameters of the pulmonary circulation model previously discussed.
Published: 2019
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

4,098 results on '"HA Statistics"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources