Back to Search Start Over

On clustering categories of categorical predictors in generalized linear models.

Authors :
Carrizosa, Emilio
Galvis Restrepo, Marcela
Romero Morales, Dolores
Source :
Expert Systems with Applications. Nov2021, Vol. 182, pN.PAG-N.PAG. 1p.
Publication Year :
2021

Abstract

• The paper proposes a method to cluster categorical features in Generalized Linear Models. • The proposed approach uses a numerical method guided by the learning performance. • The underlying structure of the categories and their relationship is identified using proximity graphs. • Complexity is reduced and accuracy results are competitive against benchmark one-hot encoding of categorical features. We propose a method to reduce the complexity of Generalized Linear Models in the presence of categorical predictors. The traditional one-hot encoding, where each category is represented by a dummy variable, can be wasteful, difficult to interpret, and prone to overfitting, especially when dealing with high-cardinality categorical predictors. This paper addresses these challenges by finding a reduced representation of the categorical predictors by clustering their categories. This is done through a numerical method which aims to preserve (or even, improve) accuracy, while reducing the number of coefficients to be estimated for the categorical predictors. Thanks to its design, we are able to derive a proximity measure between categories of a categorical predictor that can be easily visualized. We illustrate the performance of our approach in real-world classification and count-data datasets where we see that clustering the categorical predictors reduces complexity substantially without harming accuracy. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
09574174
Volume :
182
Database :
Academic Search Index
Journal :
Expert Systems with Applications
Publication Type :
Academic Journal
Accession number :
152077015
Full Text :
https://doi.org/10.1016/j.eswa.2021.115245