Start Over

Modelling of a health attribute for grocery data with Bradley-Terry model

Authors :: Öster, Sanna
Informaatioteknologian ja viestinnän tiedekunta - Faculty of Information Technology and Communication Sciences
Tampere University
Publication Year :: 2022
Abstract: A healthiness attribute for grocery data is needed for recommending healthier food choices for customers. Providing healthier choices encourages people to eat more healthily and may also put pressure on the food suppliers. The method used is statistical modelling which provides a solution where all the products are compared based on their healthiness. The target of the case study is Finland’s largest grocery chain, S group, with a 46 % market share (Päivittäistavarakauppa ry, 2021). The work is limited to muesli and cereal products, but the solution can also be implemented in the other food categories. The explanatory variables chosen for the modelling are nutritional content and ingredients of the products. Modelling the healthiness of grocery products is a complex and novel problem. There are many theories behind food healthiness and the relations between food and certain diseases. Also, healthiness is a subjective problem since people have different attitudes and diets. These factors lead to the problem of no existing training data in advance. That is data with labelling which products are healthy and which are not. Therefore, the construction of the training data must have been done first. In addition, one main problem was putting the ingredient data in useable form since it was a string variable. Training data was constructed with a survey, where respondents were asked to rank food products based on their health. These rankings were used as winner-loser pairs for modelling. To get String format ingredient data in a usable form for modelling, the NLP tool Finbert was tested. However, the final solution was a binary variable based on the occurrence of predefined ingredients. The actual modelling was done with the Bradley-Terry model, which is a special case of logistic regression analysis. It interprets binary data as winner-loser pairs and produces probabilities for each food product pair; how likely the winner wins over the loser. All the variables were tested separately first to find the proper model. That is, to check how well the variables explain the healthiness. The nutrient variables, especially energy and fat, gave the best results. The backward stepwise selection was used to discover the interaction effects between the variables. The modelling resulted in three different models: One with nutrient variables, one with ingredient variables and one with both. Finally, the predictions for all products were made based on these three models. The testing was made to observe the predictions of the training data, or in other words, the 29 products explored in the survey. The prediction was correct if it was more than 0.5 for each winner-loser pair. The accuracy was calculated by what proportion of the predictions were correct. The accuracies were at 63.2-64.4% during all three models. The weaknesses of the solution are the lack of training data, a large amount of manual work, lack of data overall and laboriousness in finding and monitoring the models. The strengths are the good responses to the use case since the solutions satisfy the use case defined at the beginning of the work. In addition, the work gives a solution for the training data construction if there are other subjective attributes to handle in the future.