Encoding methods for categorical data

A comparative analysis for linear models, decision trees, and support vector machines

More Info
expand_more

Abstract

This paper presents a comprehensive evaluation and comparison of encoding methods for categorical data in the context of machine learning. The study focuses on five popular encoding techniques: one-hot, ordinal, target, catboost, and count encoders. These methods are evaluated using linear models, decision trees, and support vector machines (SVMs).

The results demonstrate that one-hot encoding consistently achieves the highest accuracy across all evaluated machine learning algorithms. However, it also incurs a higher runtime, especially when feature cardinality is high. Catboost encoding emerges as a promising alternative, striking a balance between accuracy and runtime efficiency. The ordinal, target, and catboost encoders perform similarly, with small variations depending on the specific machine learning algorithm used.

Based on the findings, practitioners are advised to select one-hot encoding when accuracy is of utmost importance and computational resources are sufficient. For scenarios where runtime efficiency is critical, the catboost encoder offers competitive accuracy while minimizing training time. The ordinal encoder can be a suitable alternative when dealing with high feature cardinality.