Most tabular datasets contain categorical features. The simplest way to work with these is to encode them with Label Encoder. It is simple, yet sometimes not accurate.

In this post, I would like to show better approaches which could be used “out of the box” (thanks to Category Encoders Python library). I’m going to start by describing different strategies to encode categorical variables. Then I will show you how those could be improved through Single and Double Validation. The final part of the paper is devoted to the discussion of benchmarks results (which also could be found in my GitHub repo — CategoricalEncodingBenchmark).

- There is no free lunch. You have to try multiple types of encoders to find the best one for your data;
- However, the most stable and accurate encoders are target-based encoders with Double Validation: Catboost Encoder, James-Stein Encoder, and Target Encoder;
*encoder.fit_transform()*on the whole train is a road to nowhere: it turned out that Single Validation is a much better option than commonly used None Validation. If you want to achieve a stable high score, Double Validation is your choice, but bear in mind that it requires much more time to be trained;- Regularization is a must for target-based encoders.

If you are looking for a better understanding of categorical encoding, I recommend you to grab a pen and some paper and make your own calculations with the formulas I provided below. It wouldn’t take much time, but it is really helpful. In the formulas, I’m going to use the following parameters:

*y*and*y+*— the total number of observations and the total number of positive observations (*y*=1);*xi, yi*— the*i-th*value of category and target;*n*and*n+*— the number of observations and the number of positive observations (*y*=1) for a given value of a categorical column;*a*— a regularization hyperparameter (selected by a user),*prior*— an average value of the target.

The example train dataset looks like this:

*y=10, y+=5;**ni=”D”, yi*=1 for 9th line of the dataset (the last observation)*;*- For category
*B*:*n=3, n+=1;* *prior*=*y+*/*y*= 5/10 = 0.5.

With this in mind, let’s start from simple ones while gradually increasing the encoder’s complexity.

## Label Encoder (LE) or Ordinal Encoder (OE)

The most common way to deal with categories is to simply map each category with a number. By applying such transformation, a model would treat categories as ordered integers, which in most cases is wrong. Such transformation should not be used “as is” for several types of models (Linear Models, KNN, Neural Nets, etc.). While applying gradient boosting it could be used only if the type of a column is specified as *“category”*:

`df[“category_representation”] = df[“category_representation”].astype(“category”)`

New categories in Label Encoder are replaced with “-1” or None. If you are working with tabular data and your model is gradient boosting (especially LightGBM library), LE is the simplest and efficient way for you to work with categories in terms of memory (the category type in python consumes much less memory than the object type).

## One-Hot-Encoder (OHE) (dummy encoding)

The One Hot Encoding is another simple way to work with categorical columns. It takes a categorical column that has been Label Encoded and then splits the column into multiple columns. The numbers are replaced by 1s and 0s depending on which column has what value.

OHE expands the size of your dataset, which makes it memory-inefficient encoder. There are several strategies to overcome the memory problem with OHE, one of which is working with sparse not dense data representation.

## Sum Encoder (Deviation Encoding or E**ffect Encoding**)

Sum Encoder compares the mean of the dependent variable (target) for a given level of a categorical column to the overall mean of the target. Sum Encoding is very similar to OHE and both of them are commonly used in Linear Regression (LR) types of models.

However, the difference between them is the interpretation of LR coefficients: whereas in OHE model the intercept represents the mean for the baseline condition and coefficients represents simple effects (the difference between one particular condition and the baseline), in Sum Encoder model the intercept represents the grand mean (across all conditions) and the coefficients can be interpreted directly as the main effects.

## Helmert Encoder

Helmert coding is a third commonly used type of categorical encoding for regression along with OHE and Sum Encoding. It compares each level of a categorical variable to the mean of the subsequent levels. Hence, the first contrast compares the mean of the dependent variable for “A” with the mean of all of the subsequent levels of categorical column (“B”, “C”, “D”), the second contrast compares the mean of the dependent variable for “B” with the mean of all of the subsequent levels (“C”, “D”), and the third contrast compares the mean of the dependent variable for “C” with the mean of all of the subsequent levels (in our case only one level — “D”).

This type of encoding can be useful in certain situations where levels of the categorical variable are ordered, say, from lowest to highest, or from smallest to largest.

## Frequency Encoder

Frequency Encoding counts the number of a category’s occurrences in the dataset. New categories in test dataset encoded with either “1” or counts of category in a test dataset, which makes this encoder a little bit tricky: encoding for different sizes of test batch might be different. You should think about it beforehand and make preprocessing of the train as close to the test as possible.

To avoid such problem, you might also consider using a Frequency Encoder variation — Rolling Frequency Encoder (RFE). RFE counts the number a category’s occurrences for the last *dt* timesteps from a given observation (for example, for *dt*= 24 hours).

Nevertheless, Frequency Encoding and RFE are especially efficient when your categorical column has “long tails”, i.e. several frequent values and the remaining ones have only a few examples in the dataset. In such a case, Frequency Encoding would catch the similarity between rare columns.

## Target Encoder (TE)

Target Encoding has probably become the most popular encoding type because of Kaggle competitions. It takes information about the target to encode categories, which makes it extremely powerful. The encoded category values are calculated according to the following formulas:

## Pipeline

The whole dataset was split into the train (the first 60% of the data) and the test samples (the remaining 40% of the data). The test part is unseen during the training and it is used only once — for final scoring (scoring metric used — *ROC AUC*). No encoding on whole data, no pseudo labeling, no TTA, etc. were applied during the training or predicting stage. I wanted the experiments to be as close to possible production settings as possible.

After the train-test split, the training data was split into 5 folds with shuffle and stratification. After that, 4 of them were used for the fitting encoder (for a case of None Validation — encoder if fitted to the whole train dataset before splitting) and LightGBM model (LlightGBM — library for gradient boosting from Microsoft), and 1 more fold was used for early stopping. The process was repeated 5 times and in the end, we had 5 trained encoders and 5 LGB models. The LightGBM model parameters were as follows:

`"metrics": "AUC", `

"n_estimators": 5000,

"learning_rate": 0.02,

"random_state": 42,

"early_stopping_rounds": 100

During the predicting stage, test data was processed with each of encoders and prediction was made via each of the models. Predictions were then ranked and summed (*ROC AUC* metric doesn’t care if predictions are averaged or summed; the only importance is the order).

This section contains the processed results of the experiments. If you’d like to see the raw scores for each dataset, please visit my GitHub repository — CategoricalEncodingBenchmark.

To determine the best encoder, I scaled the *ROC AUC* scores of each dataset (min-max scale) and then averaged results among the encoder. The obtained result represents the average performance score for each encoder (higher is better). The encoders performance scores for each type of validation are shown in Tables 2.1–2.3.

In order to determine the best validation strategy, I compared the top score of each dataset for each type of validation. The scores improvement (top score for a dataset and an average score for encoder) are shown in Tables 2.4 and 2.5 below.