Encoding methods for categorical data

A comparative analysis for linear models, decision trees, and support vector machines

Bachelor thesis (2023)

Authors

A. Udilă Electrical Engineering, Mathematics and Computer Science

Contributors

A. Ionescu Web Information Systems - (mentor)

A Katsifodimos Web Information Systems - (mentor)

E. Isufi (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

To reference this document use:

http://resolver.tudelft.nl/uuid:10b91b99-2685-4a45-b44e-48fbbf808ce2

More Info

expand_more

Published Date

28-06-2023

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

This paper presents a comprehensive evaluation and comparison of encoding methods for categorical data in the context of machine learning. The study focuses on five popular encoding techniques: one-hot, ordinal, target, catboost, and count encoders. These methods are evaluated using linear models, decision trees, and support vector machines (SVMs).

The results demonstrate that one-hot encoding consistently achieves the highest accuracy across all evaluated machine learning algorithms. However, it also incurs a higher runtime, especially when feature cardinality is high. Catboost encoding emerges as a promising alternative, striking a balance between accuracy and runtime efficiency. The ordinal, target, and catboost encoders perform similarly, with small variations depending on the specific machine learning algorithm used.

Based on the findings, practitioners are advised to select one-hot encoding when accuracy is of utmost importance and computational resources are sufficient. For scenarios where runtime efficiency is critical, the catboost encoder offers competitive accuracy while minimizing training time. The ordinal encoder can be a suitable alternative when dealing with high feature cardinality.

Files

Encoding_Methods_for_Categoric... (pdf)

(pdf | 0.268 Mb)