Most tabular datasets contain categorical features. The simplest way to work with these is to encode them with Label Encoder. It is simple, yet sometimes not accurate.
In this post, I would like to show better approaches which could be used “out of the box” (thanks to Category Encoders Python library). I’m going to start by describing different strategies to encode categorical variables. Then I will show you how those could be improved through Single and Double Validation. The final part of the paper is devoted to the discussion of benchmarks results (which also could be found in my GitHub repo — CategoricalEncodingBenchmark).
- There is no free lunch. You have to try multiple types of encoders to find the best one for your data;
- However, the most stable and accurate encoders are target-based encoders with Double Validation: Catboost Encoder, James-Stein Encoder, and Target Encoder;
- encoder.fit_transform() on the whole train is a road to nowhere: it turned out that Single Validation is a much better option than commonly used None Validation. If you want to achieve a stable high score, Double Validation is your choice, but bear in mind that it requires much more time to be trained;
- Regularization is a must for target-based encoders.
If you are looking for a better understanding of categorical encoding, I recommend you to grab a pen and some paper and make your own calculations with the formulas I provided below. It wouldn’t take much time, but it is really helpful. In the formulas, I’m going to use the following parameters:
- y and y+ — the total number of observations and the total number of positive observations (y=1);
- xi, yi — the i-th value of category and target;
- n and n+ — the number of observations and the number of positive observations (y=1) for a given value of a categorical column;
- a — a regularization hyperparameter (selected by a user), prior — an average value of the target.
The example train dataset looks like this:
- y=10, y+=5;
- ni=”D”, yi=1 for 9th line of the dataset (the last observation);
- For category B: n=3, n+=1;
- prior = y+/ y= 5/10 = 0.5.
With this in mind, let’s start from simple ones while gradually increasing the encoder’s complexity.
Label Encoder (LE) or Ordinal Encoder (OE)
The most common way to deal with categories is to simply map each category with a number. By applying such transformation, a model would treat categories as ordered integers, which in most cases is wrong. Such transformation should not be used “as is” for several types of models (Linear Models, KNN, Neural Nets, etc.). While applying gradient boosting it could be used only if the type of a column is specified as “category”:
df[“category_representation”] = df[“category_representation”].astype(“category”)
New categories in Label Encoder are replaced with “-1” or None. If you are working with tabular data and your model is gradient boosting (especially LightGBM library), LE is the simplest and efficient way for you to work with categories in terms of memory (the category type in python consumes much less memory than the object type).
One-Hot-Encoder (OHE) (dummy encoding)
The One Hot Encoding is another simple way to work with categorical columns. It takes a categorical column that has been Label Encoded and then splits the column into multiple columns. The numbers are replaced by 1s and 0s depending on which column has what value.
OHE expands the size of your dataset, which makes it memory-inefficient encoder. There are several strategies to overcome the memory problem with OHE, one of which is working with sparse not dense data representation.
Sum Encoder (Deviation Encoding or Effect Encoding)
Sum Encoder compares the mean of the dependent variable (target) for a given level of a categorical column to the overall mean of the target. Sum Encoding is very similar to OHE and both of them are commonly used in Linear Regression (LR) types of models.
However, the difference between them is the interpretation of LR coefficients: whereas in OHE model the intercept represents the mean for the baseline condition and coefficients represents simple effects (the difference between one particular condition and the baseline), in Sum Encoder model the intercept represents the grand mean (across all conditions) and the coefficients can be interpreted directly as the main effects.
Helmert coding is a third commonly used type of categorical encoding for regression along with OHE and Sum Encoding. It compares each level of a categorical variable to the mean of the subsequent levels. Hence, the first contrast compares the mean of the dependent variable for “A” with the mean of all of the subsequent levels of categorical column (“B”, “C”, “D”), the second contrast compares the mean of the dependent variable for “B” with the mean of all of the subsequent levels (“C”, “D”), and the third contrast compares the mean of the dependent variable for “C” with the mean of all of the subsequent levels (in our case only one level — “D”).
This type of encoding can be useful in certain situations where levels of the categorical variable are ordered, say, from lowest to highest, or from smallest to largest.
Frequency Encoding counts the number of a category’s occurrences in the dataset. New categories in test dataset encoded with either “1” or counts of category in a test dataset, which makes this encoder a little bit tricky: encoding for different sizes of test batch might be different. You should think about it beforehand and make preprocessing of the train as close to the test as possible.
To avoid such problem, you might also consider using a Frequency Encoder variation — Rolling Frequency Encoder (RFE). RFE counts the number a category’s occurrences for the last dt timesteps from a given observation (for example, for dt= 24 hours).
Nevertheless, Frequency Encoding and RFE are especially efficient when your categorical column has “long tails”, i.e. several frequent values and the remaining ones have only a few examples in the dataset. In such a case, Frequency Encoding would catch the similarity between rare columns.
Target Encoder (TE)
Target Encoding has probably become the most popular encoding type because of Kaggle competitions. It takes information about the target to encode categories, which makes it extremely powerful. The encoded category values are calculated according to the following formulas:
Here, mdl — min data (samples) in leaf, a — smoothing parameter, representing the power of regularization. Recommended values for mdl and a are in the range of 1 to 100. New values of category and values with just single appearance in train dataset are replaced with the prior ones.
Target Encoder is a powerful tool, yet it has a huge disadvantage — target leakage: it uses information about the target. Because of the target leakage, model overfits the training data which results in unreliable validation and lower test scores. To reduce the effect of target leakage, we may increase regularization (it’s hard to tune those hyperparameters without unreliable validation), add random noise to the representation of the category in train dataset (some sort of augmentation), or use Double Validation.
M-Estimate Encoder is a simplified version of Target Encoder. It has only one hyperparameter — m, which represents the power of regularization. The higher value of m results into stronger shrinking. Recommended values for m is in the range of 1 to 100.
In different sources, you may find another formula of M-Estimator. Instead of y+ there is n in the denominator. I found that such representation has similar scores.
Weight Of Evidence Encoder (WOE)
Weight Of Evidence is a commonly used target-based encoder in credit scoring. It is a measure of the “strength” of a grouping for separating good and bad risk (default). It is calculated from the basic odds ratio:
a = Distribution of Good Credit Outcomes
b = Distribution of Bad Credit Outcomes
WoE = ln(a / b)
However, if we use formulas as is, it might lead to target leakage and overfit. To avoid that, regularization parameter a is induced and WoE is calculated in the following way:
James-Stein Encoder is a target-based encoder. This encoder is inspired by James–Stein estimator — the technique named after Charles Stein and Willard James, who simplified Stein’s original Gaussian random vectors mean estimation method of 1956. Stein and James proved that a better estimator than the “perfect” (i.e. mean) estimator exists, which seems to be somewhat of a paradox. However, the James-Stein estimator outperforms the sample mean when there are several unknown populations means — not just one.
The idea behind James-Stein Encoder is simple. Estimation of the mean target for category k could be calculated according to the following formula:
Encoding is aimed to improve the estimation of the category’s mean target (first member of the amount) by shrinking them towards a more central average (second member of the amount). The only hyperparameter in the formula is B — the power of shrinking. It could be understood as the power of regularization, i.e. the bigger values of B will result in the bigger weight of global mean (underfit), while the lower values of B are, the bigger weight of condition mean (overfit).
One way to select B is to tune it like a hyperparameter via cross-validation, but Charles Stein came up with another solution to the problem:
Intuitively, the formula can be seen in the following sense: if we could not rely on the estimation of category mean to target (it has high variance), it means we should assign a bigger weight to the global mean.
Wait, but how we could trust the estimation of variance if we could not rely on the estimation of the mean? Well, we may either say that the variance among all categories is the same and equal to the global variance of y (which might be a good estimation, if we don’t have too many unique categorical values; it is called pooled variance or pooled model) or replace the variances with squared standard errors, which penalize small observation counts (independent model).
Seems quite fair, but James-Stein Estimator has a big disadvantage — it is defined only for normal distribution (which is not the case for any classification task). To avoid that, we can either convert binary targets with a log-odds ratio as it was done in WoE Encoder (which is used by default because it is simple) or use beta distribution.
Leave-one-out Encoder (LOO)
Leave-one-out Encoding (LOO or LOOE) is another example of target-based encoders. The name of the method clearly speaks for itself: we calculate mean target of category k for observation j if observation j is removed from the dataset:
While encoding the test dataset, a category is replaced with the mean target of the category k in the train dataset:
One of the problems with LOO, just like with all other target-based encoders, is target leakage. But when it comes to LOO, this problem gets really dramatic, as far as we may perfectly classify the training dataset by making a single split: the optimal threshold for category k could be calculated with the following formula:
Another problem with LOO is a shift between values in the train and the test samples. You could observe it from the picture above. Possible values for category “A” in the train sample are 0.67 and 0.33, while in the test one — 0.5. It is a result of the different number of counts in train and test datasets: for category “A” denominator is equal to n for test and n-1 for train dataset. Such a shift may gradually reduce the performance of tree-based models.
Catboost is a recently created target-based categorical encoder. It is intended to overcome target leakage problems inherent in LOO. In order to do that, the authors of Catboost introduced the idea of “time”: the order of observations in the dataset. Clearly, the values of the target statistic for each example rely only on the observed history. To calculate the statistic for observation j in train dataset, we may use only observations, which are collected before observation j, i.e. i≤j:
To prevent overfitting, the process of target encoding for train dataset is repeated several times on shuffled versions of the dataset and results are averaged. Encoded values of the test data are calculated the same way as in LOO Encoder:
Catboost “on the fly” Encoding is one of the core advantages of CatBoost — library for gradient boosting, which showed state of the art results on several tabular datasets when it was presented by Yandex.
Model validation is probably the most important aspect of Machine Learning. While working with data that contains categorical variables, we may want to use one of the three types of validation. None Validation is the simplest one, yet least accurate. Double Validation could show great scores, but it is as slow as a turtle. And Single Validation is kind of a cross between the first two methods.
This section is devoted to the discussion of each validation type in details. For better understanding, for each type of validation, I added block diagrams of the pipeline.
The whole dataset was split into the train (the first 60% of the data) and the test samples (the remaining 40% of the data). The test part is unseen during the training and it is used only once — for final scoring (scoring metric used — ROC AUC). No encoding on whole data, no pseudo labeling, no TTA, etc. were applied during the training or predicting stage. I wanted the experiments to be as close to possible production settings as possible.
After the train-test split, the training data was split into 5 folds with shuffle and stratification. After that, 4 of them were used for the fitting encoder (for a case of None Validation — encoder if fitted to the whole train dataset before splitting) and LightGBM model (LlightGBM — library for gradient boosting from Microsoft), and 1 more fold was used for early stopping. The process was repeated 5 times and in the end, we had 5 trained encoders and 5 LGB models. The LightGBM model parameters were as follows:
During the predicting stage, test data was processed with each of encoders and prediction was made via each of the models. Predictions were then ranked and summed (ROC AUC metric doesn’t care if predictions are averaged or summed; the only importance is the order).
This section contains the processed results of the experiments. If you’d like to see the raw scores for each dataset, please visit my GitHub repository — CategoricalEncodingBenchmark.
To determine the best encoder, I scaled the ROC AUC scores of each dataset (min-max scale) and then averaged results among the encoder. The obtained result represents the average performance score for each encoder (higher is better). The encoders performance scores for each type of validation are shown in Tables 2.1–2.3.
In order to determine the best validation strategy, I compared the top score of each dataset for each type of validation. The scores improvement (top score for a dataset and an average score for encoder) are shown in Tables 2.4 and 2.5 below.