Benchmarking Categorical Encoders – Towards Data Science

Denis Vorotyntsev
Example train dataset on the left and test dataset on the right

Label Encoder (LE) or Ordinal Encoder (OE)

Category representation — Label Encoding
df[“category_representation”] = df[“category_representation”].astype(“category”)

One-Hot-Encoder (OHE) (dummy encoding)

Sum Encoder (Deviation Encoding or Effect Encoding)

Category representation — Sum Encoding ()

Helmert Encoder

Frequency Encoder

Category representation — Frequency Encoding

Target Encoder (TE)

Category representation — Target Encoding

M-Estimate Encoder

Category representation — M-Estimate Encoder

Weight Of Evidence Encoder (WOE)

Category representation — Weight Of Evidence Encoder
a = Distribution of Good Credit Outcomes
b = Distribution of Bad Credit Outcomes
WoE = ln(a / b)

James-Stein Encoder

Category representation — James-Stein Encoder Encoder

Leave-one-out Encoder (LOO)

Category representation — Leave-one-out Encoding

Catboost Encoder

Category representation — CatBoost Encoder

None Validation

Pipeline

"metrics": "AUC", 
"n_estimators": 5000,
"learning_rate": 0.02,
"random_state": 42,
"early_stopping_rounds": 100