Improving the prediction of blood glucose concentration may improve the quality of life of people living with type 1 diabetes by enabling them to better manage their care. Given the anticipated benefits of such a prediction, numerous methods have been proposed. Rather than attempting to predict glucose concentration, a deep learning framework for prediction is proposed in which prediction is performed using a scale for hypo- and hyper-glycemia risk. Using the blood glucose risk score formula proposed by Kovatchev et al., models with different architectures were trained, including, a recurrent neural network (RNN), a gated recurrent unit (GRU), a long short-term memory (LSTM) network, and an encoder-like convolutional neural network (CNN). The models were trained using the OpenAPS Data Commons data set, comprising 139 individuals, each with tens of thousands of continuous glucose monitor (CGM) data points. The training set was composed of 7% of the data set, while the remaining was used for testing. Performance comparisons between the different architectures are presented and discussed. To evaluate these predictions, performance results are compared with the last measurement (LM) prediction, through a sample-and-hold approach continuing the last known measurement forward. The results obtained are competitive when compared to other deep learning methods. A root mean squared error (RMSE) of 16 mg/dL, 24 mg/dL, and 37 mg/dL were obtained for CNN prediction horizons of 15, 30, and 60 min, respectively. However, no significant improvements were found for the deep learning models compared to LM prediction. Performance was found to be highly dependent on architecture and the prediction horizon. Lastly, a metric to assess model performance by weighing each prediction point error with the corresponding blood glucose risk score is proposed. Two main conclusions are drawn. Firstly, going forward, there is a need to benchmark model performance using LM prediction to enable the comparison between results obtained from different data sets. Secondly, model-agnostic data-driven deep learning models may only be meaningful when combined with mechanistic physiological models; here, it is argued that neural ordinary differential equations may combine the best of both approaches. These findings are based on the OpenAPS Data Commons data set and are to be validated in other independent data sets. [ABSTRACT FROM AUTHOR]