1. A novel generative adversarial network for improving crash severity modeling with imbalanced data
- Author
-
Chen, Junlan, Pu, Ziyuan, Zheng, Nan, Wen, Xiao, Ding, Hongliang, Guo, Xiucheng, Chen, Junlan, Pu, Ziyuan, Zheng, Nan, Wen, Xiao, Ding, Hongliang, and Guo, Xiucheng
- Abstract
Traffic crash data is often greatly imbalanced with the majority of non-fatal crashes and only a small number of fatal crashes. Such data imbalance issue poses a challenge for crash severity modelling, especially for classifying and interpreting fatal crashes with very limited samples. To address the data imbalance issues, the data resampling techniques are commonly used methods to rebalance the number of samples among all categories of the dataset, such as under-sampling and over-sampling techniques. However, it is challenging for most traditional and existing deep learning-based resampling methods, e.g., synthetic minority oversampling technique (SMOTE) and Generative Adversarial Networks (GAN), to handle both continuous and discrete risk factors in traffic crash datasets, since they are built upon by smooth and continuous functions which are not applicable for processing discrete variables. Though some resampling methods are capable of handling both continuous and discrete variables, they may struggle with mode collapse issues associated with sparse discrete risk factors so that the diversity of the underlying data distribution can not be captured due to oversampling repetitive and similar samples. To address the aforementioned issues, the current study proposes a traffic crash data generation method based on the Conditional Tabular GAN (CTGAN) to rebalance crash datasets for improving performance of crash severity classification and interpretation. The designed experiments are conducted to evaluate contributions of the synthetic data for improving crash severity classification, the distribution consistency between synthetic and benchmark datasets, and the parameter recovery (i.e., the accuracy of parameter estimation and probability prediction) for various resampling strategies. A 4-year real-world dataset collected in Washington State, U.S., and Monte Carlo simulations are utilized for demonstrating the designed experiments. The results indicate that crash seve
- Published
- 2024