1. Developing transferable real-time crash prediction models for highly imbalanced data
- Author
-
Man, Cheuk Ki
- Subjects
imbalanced data sets ,Deep Neural Network (DNN) ,Generative Adversarial Networks (GANs) ,Transfer learning ,artificial intelligence ,Traffic engineering ,Safety measures - Abstract
The advent of Intelligent Transport Systems (ITS) has facilitated a shift towards proactive safety measures in which a crash occurrence is anticipated and prevented before it happens under the proactive safety approach. Given the capability to identify pre-crash conditions, real-time crash prediction has become widely examined with the availability of disaggregated real-time traffic data. In spite of this, real-time crash prediction models have yet to be introduced to the market due to ongoing concerns pertaining to the models' predictability, transferability, and explainability. Existing real-time crash prediction studies examined crash prediction as a classification task. Turbulent traffic dynamics imminently before crashes are used to classify between crash cases and more stable traffic dynamics in normal traffic conditions (i.e., non-crash cases). Matched case-control sampling is the most commonplace methodology employed for real-time crash prediction models. Crash cases are predicted from several non-crash cases sampled for controls. Due to their simplicity in application, various statistical and machine learning models have utilised this methodology in real-time crash prediction which attained satisfactory predictability. Yet, non-crash cases are under-utilised due to matched case-control sampling. Further limitations of this approach are low ecological validity and sampling bias. To overcome the shortcomings from matched case-control sampling, studies have begun to investigate using the full dataset for crash prediction. Using the full dataset supports a data-driven analysis, which is in-line with the advocate from ITS. Furthermore, machine learning and deep learning models leverage the power of Big Data and produce more accurate prediction results. However, crashes are rare occurrence. The compiled dataset for real-time crash prediction is naturally imbalanced with crash cases being the minority. Predicting an imbalanced dataset would give rise to an undesirable 'high accuracy, low sensitivity' circumstance in that non-crash cases are predicted whilst none or very little crash cases can be predicted. Multiple studies have attempted to tackle class imbalance through data sampling, in which a Synthetic Minority Oversampling Technique (SMOTE) has been adopted, to oversample crash cases by generating synthetic crashes from its nearest neighbours. Despite the successful application of SMOTE, it has been criticised with overgeneralisation and prone to overfitting. As such, the emergence of deep generative models, artificial intelligence (AI) models such as Variational Encoder (VAEs) and Generative Adversarial Network (GAN) can generate high quality synthetic data. This thesis has compiled a heavily imbalanced dataset consisting of 257 crash cases and 10 million non-crash cases along the M1 Motorway between Junction 1 and Junction 30 in the United Kingdom for 2017. The dataset is aggregated in 5-minute intervals from minute-level real-time traffic data. 195 variables were calculated containing information about flow, speed, occupancy and headway. To ameliorate class imbalance, this thesis adopted the Wasserstein Generative Adversarial Network (WGAN) to generate high quality synthetic data to oversample crash cases. The class balancing performance is compared with other oversampling methods such as SMOTE, Adaptive Synthetic sampling (ADASYN) and class balancing methods such as cost-sensitive learning and ensemble methods. Deep Neural Network (DNN) has been leveraged for its ability to learn features thoroughly in a non-linear manner. The model is trained on a sample of 500,000 non-crash cases due to computational limitations. The effects of oversampling methods (i.e., WGAN, SMOTE, ADASYN) are compared with crash predictions performed under different crash to non-crash ratios, whereas comparison with other class balancing methods has been made in the balanced dataset. The issue of class imbalance alleviates when additional synthetic data are entered in the training set at ratio of 1:10 onwards. Between class balancing methods, WGAN is proved to outperform the others with superior AUC values. Model predictability of DNN is also compared with Logistic Regression at a balanced data setting. The optimal model attained is trained with DNN with data oversampled by WGAN with the area under the curve (AUC) value of 0.86, sensitivity with 0.63 at a 7% false alarm rate under the ratio of 1:2. Apart from data imbalance, model transferability and explainability are two other major limitations that prohibit the practical deployment of real-time crash prediction models. Transferability of a model refers to its ability to generalise the model from one setting to another temporal, spatial or spatio-temporal settings. Current studies demonstrated limited transferability with models directly tested on other settings under matched case-control sampling approach. In this thesis, five other heavily imbalanced datasets collected along different motorways (i.e., M1, M4 and M6) in 2017 and 2018 were compiled for transferability assessment. Transfer learning is utilised which shared the weights trained from the baseline model developed from the M1 2017 dataset and applied to five other datasets to be transferred. The performance of transfer learning is compared to direct transfer and the standalone models developed from the transferred datasets. The transferability assessment revealed that direct transfer is not possible whereas transfer learning can improve model transferability with the AUC ranging between 0.69 to 0.95. The best transferred model can predict 89% of crashes accurately at a 7% false alarm rate. Model explainability has not been a challenge when statistical models are employed as they are inherently interpretable with parameter estimates provided. However, model explainability become a significant concern when neural networks are employed. These models are described as "black-box" models of which the decision function of the model is not human interpretable. This thesis addresses model explainability through employing Shapley Additive explanation (SHAP) and Counterfactual Explanations (CE) to explain predicted crash cases from the best DNN model developed from the M1 2017 dataset. SHAP is used to explain why a crash is predicted as a crash and CE is used to demonstrate how can the predicted crash be altered as a non-crash. Explanations from CE is compared with SHAP summary plot, consistent results were found that aggregated flow and the standard deviation of flow played an important part in crash prediction. To potentially reduce crash likelihood, the aggregated flow should be reduced with standard deviation of flow increased. These suggestions are also in-line with what literature suggested. Application of SHAP and CE can demystify the "black-box" limitation from the AI models which traffic managers can trust and act up on the model predictions. This PhD thesis has made methodological advancements in current limitations in real-time crash prediction models in handling class imbalance, improving model transferability and explainability. Hence, this study gives rise to a shift from predictive to a pragmatic approach for real-time crash prediction models that can be readily deployed in the future.
- Published
- 2022
- Full Text
- View/download PDF