Rau, Hsiao-Hsien, Hsu, Chien-Yeh, Lin, Yu-An, Atique, Suleman, Fuad, Anis, Wei, Li-Ming, and Hsu, Ming-Huei
Background Diabetes mellitus is associated with an increased risk of liver cancer, and these two diseases are among the most common and important causes of morbidity and mortality in Taiwan. Purpose To use data mining techniques to develop a model for predicting the development of liver cancer within 6 years of diagnosis with type II diabetes. Methods Data were obtained from the National Health Insurance Research Database (NHIRD) of Taiwan, which covers approximately 22 million people. In this study, we selected patients who were newly diagnosed with type II diabetes during the 2000–2003 periods, with no prior cancer diagnosis. We then used encrypted personal ID to perform data linkage with the cancer registry database to identify whether these patients were diagnosed with liver cancer. Finally, we identified 2060 cases and assigned them to a case group (patients diagnosed with liver cancer after diabetes) and a control group (patients with diabetes but no liver cancer). The risk factors were identified from the literature review and physicians’ suggestion, then, chi-square test was conducted on each independent variable (or potential risk factor) for a comparison between patients with liver cancer and those without , those found to be significant were selected as the factors . We subsequently performed data training and testing to construct artificial neural network (ANN) and logistic regression (LR) prediction models. The dataset was randomly divided into 2 groups: a training group and a test group. The training group consisted of 1442 cases (70% of the entire dataset), and the prediction model was developed on the basis of the training group. The remaining 30% (618 cases) were assigned to the test group for model validation. Results The following 10 variables were used to develop the ANN and LR models: sex, age, alcoholic cirrhosis, nonalcoholic cirrhosis, alcoholic hepatitis, viral hepatitis, other types of chronic hepatitis, alcoholic fatty liver disease, other types of fatty liver disease, and hyperlipidemia. The performance of the ANN was superior to that of LR, according to the sensitivity (0.757), specificity (0.755), and the area under the receiver operating characteristic curve (0.873). After developing the optimal prediction model, we base on this model to construct a web-based application system for liver cancer prediction, which can provide support to physicians during consults with diabetes patients. Conclusion In the original dataset ( n = 2060), 33% of diabetes patients were diagnosed with liver cancer ( n = 515). After using 70% of the original data to training the model and other 30% for testing, the sensitivity and specificity of our model were 0.757 and 0.755, respectively; this means that 75.7% of diabetes patients can be predicted correctly to receive a future liver cancer diagnosis, and 75.5% can be predicted correctly to not be diagnosed with liver cancer. These results reveal that this model can be used as effective predictors of liver cancer for diabetes patients, after discussion with physicians; they also agreed that model can assist physicians to advise potential liver cancer patients and also helpful to decrease the future cost incurred upon cancer treatment. [ABSTRACT FROM AUTHOR]