Target Encoding (Mean Encoding)

Target Encoding, also known as Mean Encoding, is a powerful technique that replaces each category with the mean of the target variable for that category. This method directly encodes the relationship between the categorical feature and the target, often leading to highly predictive features. However, it comes with a significant risk of data leakage and requires careful implementation to be effective.

Example: For a binary classification task (target = 0 or 1)

Encoded Value = \frac{Category Mean \times Category Count + Global Mean \times Smoothing Factor​}{Category Count + Smoothing Factor}

$Global Mean = \frac{1 + 0 + 1 + 0 + 1 + 0 + 0 + 0}{8} = 0.375$
$Assume: Smoothing Factor ⇢ λ = 10$
$Category Mean$
- Of the 3 people that like the color Blue, only 1 of them likes the Movie. $∴ Mean of Blue = \frac{1}{3} = 0.33$
- Of the 4 people that like the color Green, 2 of them likes the Movie. $∴ Mean of Green = \frac{2}{4} = 0.5$
- Of the 1 people that like the color Red, does not likes the Movie. $∴ Mean of Green = \frac{0}{1} = 0$

Favorite	Height (m)	Likes Movie		Category Mean	Category Count	Favorite Encoded
Blue	1.77	1	→	0.33	3	$\frac{(0.33 \times 3) + (0.375 \times 10) }{3 + 10} = 0.36461$
Red	1.32	0	→	0	1	$\frac{(0.00 \times 1) + (0.375 \times 10) }{1 + 10} = 0.34090$
Green	1.81	1	→	0.5	4	$\frac{(0.5 \times 4) + (0.375 \times 10) }{4 + 10} = 0.41071$
Blue	1.56	0	→	0.33	3	0.36461
Green	1.64	1	→	0.5	4	0.41071
Green	1.61	0	→	0.5	4	0.41071
Green	1.73	0	→	0.5	4	0.41071
Blue	1.73	0	→	0.33	3	0.36461

How Target Encoding Works

Group by category: Group the dataset by the unique values of the categorical feature.
Calculate mean: For each category, calculate the mean of the target variable.
Map values: Replace each category instance with its corresponding target mean.

For Classification: The encoding is the probability of the positive class (e.g., mean(target)).
For Regression: The encoding is the average value of the target (e.g., mean(price)).

Why Target Encoding Matters

Captures Target Relationship Directly

Target Encoding creates a feature that is monotonically correlated with the target variable by design. This provides a very strong signal to the model, especially for tree-based algorithms.

Handles High Cardinality

Like Count Encoding, it produces a single numerical column regardless of cardinality, making it highly memory-efficient for features with thousands of categories.

Creates a Powerful Predictive Feature

By encoding the historical outcome associated with each category, it often becomes one ofthe most important features in the model.

When to Use Target Encoding

High-cardinality categorical features
- User IDs, ZIP codes, product categories, etc.
- When One-Hot Encoding is not feasible.
Tree-based models
- Random Forests, XGBoost, LightGBM, and CatBoost excel with target-encoded features. CatBoost has a highly optimized built-in implementation.
When a strong correlation exists between the category and the target.
- E.g., certain cities have a consistently higher conversion rate.

When to Avoid Target Encoding

When interpretability is critical
- The encoded feature's meaning is tied to the target, which can be circular and hard to explain.
If you cannot implement it carefully
- A naive implementation will lead to severe overfitting due to data leakage.
When the relationship between feature and target is unstable
- If the target mean for a category changes drastically over time, the encoding will become stale.
Linear models (with caution)
- The direct encoding of the target can create a "too perfect" feature, leading to overfitting and potentially multicollinearity issues if not regularized.

The Critical Challenge:

Target Encoding's greatest strength — encoding the target directly into the feature — is also its greatest weakness. If done naively, the feature becomes a shortcut the model memorizes instead of a signal it learns from. Most of the failure modes below share the same root cause: the encoding is computed using target values the model should not "see" for those rows.

A running example: predicting customer churn

Throughout this section we use one consistent problem: predict whether a telecom customer will churn (target = 1) or stay (target = 0), using the customer's Plan as a feature. We have 10,000 customers spread across many plans. Every problem below is illustrated on this same dataset.

1. Target Leakage (Data Leakage)

This is the most severe issue. Because the encoding is derived from the target, computing it over rows that later act as validation/test data lets information about the answer "seep" into the feature. The model learns patterns that exist only because it already saw the labels — patterns that will not exist on truly unseen data.

In the churn example:

The plan "Legacy_Gold" appears only once in the data, and that single customer churned (target = 1), so its encoding is 1.0 — a value computed entirely from that customer's own outcome.
During cross-validation the model learns the rule "if Plan_encoded == 1.0 → predict churn" and scores near-perfect accuracy.
In production this rule is meaningless: it was built from one row's answer.

Impact: Overly optimistic offline scores, but poor generalization in production — the model looks brilliant offline and fails when deployed.

2. Bias & Overfitting on Rare Categories

Categories with few samples produce unreliable means, because a small group is easily dominated by individual data points, noise, or coincidence.

In the churn example:

The plan "Rural_Basic" appears only twice, and by chance both customers churned (target = 1), so its encoding becomes exactly 1.0.
The model now treats "Rural_Basic" as a 100% guaranteed churn signal, even though two data points tell us almost nothing about the true churn rate.
When new "Rural_Basic" customers arrive (many of whom stay), predictions are badly miscalibrated — the model memorized noise.

Impact: Overfitting on low-frequency categories and poor generalization.

Root issue: a raw mean gives equal trust to a category seen 2 times and one seen 5,000 times. Rare categories should be pulled toward the global average — this is exactly what smoothing does (see Solutions).

3. The "Unseen Category" Problem

In production the model will eventually meet categories it never saw during training.

In the churn example:

The company launches a new plan, "Fiber_Max", after the model was trained.
There is no historical target mean for "Fiber_Max", so standard target encoding has nothing to map it to.
The usual fallback is to impute the global target mean (the overall churn rate, e.g. 0.18). This is safe but strips the feature of predictive power for new customers — every new-plan customer is treated as perfectly average.

Impact: New categories silently lose all signal until enough history accumulates.

4. Over-Reliance on a Target-Correlated Feature

By construction, the encoded feature is strongly correlated with the target. Models — especially high-capacity ones — may lean heavily on it and under-weight other useful inputs.

In the churn example:

If Plan_encoded closely tracks churn on the training data, the model may effectively predict churn from the plan alone and ignore signals like tenure, monthly charges, or support tickets.
The moment plan-level churn rates shift (a promotion, a price change), the model becomes fragile because it over-relied on one feature.

Impact: Initial accuracy gains, but a brittle model that degrades when the encoded relationship drifts.

5. Reduced Interpretability

Target encoding replaces a human-readable category with a target-derived number, making the feature harder to explain.

In the churn example:

A stakeholder can understand "the plan is Rural_Basic", but not "Plan_encoded = 0.83" — the value only makes sense relative to the target it was built from.

Impact: Harder debugging and harder communication of why the model made a prediction.

Solutions to Prevent Data Leakage

Cross-Validation Based Encoding

This is the most robust method. For each fold in a K-fold cross-validation scheme, the target encoding for the validation part is calculated using only the data from the other K-1 folds.

Process:

Split the training data into K folds.
For each fold i:
a. Use the other K-1 folds to calculate the target means for each category.
b. Apply these means to encode the categorical feature in fold i.
Concatenate the encoded folds to get a complete, leak-free encoding for the entire training set.
For the test set, use the target means calculated from the entire training set.

Advantages and Limitations

Advantages:

✅ Highly predictive: Directly captures the target-feature relationship.
✅ Memory efficient: Creates only one new feature.
✅ Handles high cardinality with ease.
✅ Works well with tree-based models.

Limitations:

⚠️ High risk of overfitting if not implemented correctly.
⚠️ Prone to data leakage.
⚠️ Less interpretable than other methods.
⚠️ Sensitive to rare categories and outliers (mitigated by smoothing).
⚠️ Requires careful validation and implementation (e.g., CV-based).

Python Implementation

Best Practices Summary

Never use naive target encoding. It will always overfit.
Always use a robust method:
- Cross-validation based encoding is the gold standard.
- Smoothing is a simpler but effective alternative.
Compute encodings on the training set only. Apply the learned mappings to the test set.
Handle unseen categories in the test set by filling with the global target mean from the training set.
For time-series data, be careful. Use an expanding window or rolling window to calculate means to avoid leaking future information.
Use a dedicated library like category_encoders for a reliable and tested implementation.

Target Encoding vs Other Encoders

Criterion	Target Encoding	One-Hot Encoding	Count Encoding
Cardinality	High	Low	High
Dimensionality	Low (1 column)	High (K columns)	Low (1 column)
Leakage Risk	Very High	None	Low
Predictive Power	Very High	Moderate	Moderate
Interpretability	Low	High	Moderate
Implementation	Complex	Simple	Simple

K-Fold Target Encoding

K-Fold Target Encoding is a robust method to implement target encoding while avoiding data leakage. It ensures that the encoding for each fold is computed using only the training data from the other folds, thus preventing the model from "seeing" the target values of the validation fold during training. This method is particularly useful in scenarios with high-cardinality categorical features and when using tree-based models.

The Core Idea

Recall the Data Leakage problem: If a row's encoding is computed from a dataset that includes that row's own target, the answer leaks into the feature. K-Fold encoding fixes this with a simple rule:

The golden rule

A row is never used to compute its own encoding. Each row is encoded using target statistics learned from other rows only.

To achieve this, we split the training data into $K$ folds. To encode the rows in one fold (the "hold-out" fold), we compute category means using only the other $K - 1$ folds. We repeat this for every fold, so each row gets an "out-of-fold" encoding that never saw its own label.

Step-by-Step Process

Split the training data into $K$ folds (e.g., $K = 5$ ).
For each fold $i$ (the hold-out fold):
- Compute each category's target mean using only the other $K - 1$ folds.
- Apply those means to encode the rows in fold $i$ .
Concatenate the encoded folds → a complete, leak-free encoding for the entire training set.
For the test set, compute category means from the entire training set (the test rows never contributed to any mean, so there is no leakage).
Unseen categories in a fold or in the test set → fall back to the global target mean.

Worked Example

We want to encode the Favorite color to predict Likes Movie (target = 1) means the person liked the movie). We have 8 rows and use $K = 4$ folds (2 rows per fold).

Row	Favorite	Height (m)	Likes Movie	Fold	From Rows	Global Mean	Category	Category Mean	Favourite Encoded Smoothing=0
1	Blue	1.77	1	1	3,4,5,6,7,8	$\frac{2}{6} = 0.33$	2	Blue (4,8) → $\frac{0}{2} = 0.00$	$0.00$
2	Red	1.32	0	1	3,4,5,6,7,8	$\frac{2}{6} = 0.33$	0	Red → none	$0.33$ (global fallback)
3	Green	1.81	1	2	1,2,5,6,7,8	$\frac{2}{6} = 0.33$	2	Green (5,6,7) → $\frac{1}{3} = 0.33$	$0.33$
4	Blue	1.56	0	2	1,2,5,6,7,8	$\frac{2}{6} = 0.33$	2	Blue (1,8) → $\frac{1}{2} = 0.50$	$0.50$
5	Green	1.64	1	3	1,2,3,4,7,8	$\frac{2}{6} = 0.33$	2	Green (3,7) → $\frac{1}{2} = 0.50$	$0.50$
6	Green	1.61	0	3	1,2,3,4,7,8	$\frac{2}{6} = 0.33$	2	Green (3,7) → $\frac{1}{2} = 0.50$	$0.50$
7	Green	1.73	0	4	1,2,3,4,5,6	$\frac{3}{6} = 0.5$	3	Green (3,5,6) → $\frac{2}{3} = 0.67$	$0.67$
8	Blue	1.73	0	4	1,2,3,4,5,6	$\frac{3}{6} = 0.5$	2	Blue (1,4) → $\frac{1}{2} = 0.50$	$0.50$

Encoding the test set:

For the test set we no longer need folds — each category is encoded with its mean over the entire training set (the test rows never contributed to any mean, so there is no leakage). Suppose the test set has 3 rows: one Green, one Blue, one Red.

Test Row	Favorite	Training Rows Used	Category Mean $\frac{S}{n}$	Encoded Value
T1	Green	3, 5, 6, 7	$\frac{1 + 1 + 0 + 0}{4} = \frac{2}{4}$	$0.50$
T2	Blue	1, 4, 8	$\frac{1 + 0 + 0}{3} = \frac{1}{3}$	$0.33$
T3	Red	2	$\frac{0}{1}$	$0.00$

Unseen categories

Any test category not present in the training set (e.g., Yellow) has no mean to look up and falls back to the global target mean $\frac{3}{8} = 0.375$ .

Why It Works

No self-leakage: because a row's encoding is always computed from other rows, the model can never "read" its own label through the feature.
Honest validation scores: offline metrics now reflect true generalization, not memorized answers.
Still predictive: the encoding still captures the real category–target relationship (Green ≈ 0.50, Blue ≈ 0.33), just without the leak.

Common pitfalls

Encode inside the CV loop, not before it. If you K-Fold-encode the full dataset once and then run cross-validation on top, you re-introduce leakage across the outer folds.
Combine with smoothing for rare categories — K-Fold prevents self-leakage but a category appearing only a few times per fold can still yield noisy means.
For time-series data, use an expanding/rolling window instead of random K-Fold, so you never encode using future rows.

Ordered Target Encoding

Ordered Target Encoding is the leakage-free strategy popularized by CatBoost. Instead of splitting the data into folds, it imposes an artificial ordering (a random permutation) on the rows and encodes each row using only the rows that come before it in that order. Because a row can never see its own target — nor any "future" target — the encoding is leak-free by construction, yet it still uses every available prior observation.

The Core Idea

Recall the Data Leakage problem: a row must never contribute to its own encoding. K-Fold solves this by hiding a whole fold; LOO by hiding a single row. Ordered encoding solves it with time:

The golden rule (ordered version)

A row is encoded using only the rows that appear before it in a random ordering. Row $i$ sees rows $1 \dots i - 1$ of its category, never row $i$ itself and never rows $i + 1 \dots n$ .

This mimics an online / streaming setting: pretend the data arrived one row at a time, and encode each row using only the history available at the moment it "arrived."

Mathematical Formula

For row $i$ (in the chosen order) belonging to category $C$ , let the prior rows of the same category be ${j \in C : j < i}$ . The encoding is:

{\hat{x}}_{i} = \frac{(\sum_{j \in C, j < i} y_{j}) + a \cdot p}{(n_{C, < i}) + a}

Where:

$\sum_{j \in C, j < i} y_{j}$ — sum of targets of same-category rows seen before row $i$ .
$n_{C, < i}$ — count of same-category rows seen before row $i$ .
$p$ — a prior (usually the global target mean), used when little or no history exists.
$a$ — a smoothing/weight parameter controlling how strongly to trust the prior for early rows.

The very first row of a category has no history ( $n_{C, < i} = 0$ ), so it falls back to the prior $p$ .

Step-by-Step Process

Shuffle the training data into a random order (a permutation). CatBoost actually uses several permutations and averages, to reduce variance.
Walk through the rows in order. For each row $i$ :
- Gather the same-category rows already seen ( $j < i$ ).
- Encode row $i$ with their running mean, blended with the prior $p$ via the formula above.
- Then "reveal" row $i$ 's target so it becomes history for later rows of that category.
For the test set, encode with the full-training-set category mean (test targets are unknown, so there is no leakage).
Unseen categories (or a row with no prior history) → fall back to the global target mean $p$ .

Worked Example

We reuse the Favorite color dataset (target = Likes Movie), taking the rows in their given order (rows 1–8). Prior $p = \frac{3}{8} = 0.375$ (the global mean); for clarity we show the unsmoothed running mean ( $a = 0$ ) and fall back to $p$ only when there is no prior history.

Order	Favorite	Target $(y)$	Same-color rows before it	Running Mean of Priors	Ordered Encoding
1	Blue	1	— (none)	no history	$0.375$ (prior)
2	Red	0	— (none)	no history	$0.375$ (prior)
3	Green	1	— (none)	no history	$0.375$ (prior)
4	Blue	0	Row 1 → ${1}$	$\frac{1}{1}$	$1.00$
5	Green	1	Row 3 → ${1}$	$\frac{1}{1}$	$1.00$
6	Green	0	Rows 3,5 → ${1, 1}$	$\frac{2}{2}$	$1.00$
7	Green	0	Rows 3,5,6 → ${1, 1, 0}$	$\frac{2}{3}$	$0.67$
8	Blue	0	Rows 1,4 → ${1, 0}$	$\frac{1}{2}$	$0.50$

Notice how the encoding for a category stabilizes as more history accumulates (Green: $0.375 \to 1.00 \to 1.00 \to 0.67$ ). Early rows lean on the prior; later rows reflect the true category rate. A different random order would give different early values — which is exactly why CatBoost averages several permutations.

Encoding the test set: identical to standard target encoding — map each category to its full training-set mean (Green $= \frac{2}{4} = 0.50$ , Blue $= \frac{1}{3} = 0.33$ , Red $= \frac{0}{1} = 0.00$ ); unseen categories → prior $0.375$ .

Ordered vs K-Fold vs LOO

Aspect	K-Fold	Leave-One-Out	Ordered
What is hidden	A whole fold	A single row	All "future" rows
History used per row	Other $K - 1$ folds	All other same-category rows	Only prior same-category rows
Leakage protection	Strong	Strongest per-row, but leak-prone	Strong, streaming-safe
Variance	Low	High	Reduced by averaging permutations
Best known for	General pipelines	Kaggle-style quick features	CatBoost built-in encoding

Why Use It (and When Not To)

✅ Leak-free by design: no row ever sees its own or any future target.
✅ Uses all prior data: unlike K-Fold, it does not "waste" a hold-out fold — every earlier row contributes.
✅ Great for gradient boosting: it is the native scheme in CatBoost and pairs naturally with ordered boosting.
⚠️ Order-dependent: a single permutation gives noisy encodings for early rows; always average multiple permutations.
⚠️ Cold start for early rows: the first few rows of each category rely heavily on the prior, so tune the smoothing $a$ carefully.

Practical tip

You rarely implement ordered encoding by hand — CatBoost does it internally (with multiple random permutations and built-in smoothing). Reach for it when you are already using CatBoost, or when your problem has a natural time order and you want a streaming-safe, leak-free encoding.

Leave One Out Target Encoding

Leave-One-Out (LOO) Target Encoding is the extreme case of K-Fold encoding where $K$ equals the number of rows — each row is its own "fold." To encode a given row, we compute the category mean over every other row that shares the same category, excluding the current row itself. By removing the current observation from its own average, it directly mitigates data leakage and reduces the overfitting that plagues standard target encoding.

The Core Idea

Recall the Data Leakage problem: a row must never contribute to its own encoding. LOO takes this rule to its finest granularity:

The golden rule (leave-one-out version)

A row is never used to compute its own encoding. To encode row $i$ , drop only row $i$ and average the target of all other rows in the same category.

Mathematical Formula

For a row $i$ belonging to category $C$ , the encoded value ${\hat{x}}_{i}$ is:

{\hat{x}}_{i} = \frac{\sum_{j \in C} y_{j} - y_{i}}{n_{C} - 1}

Where:

$\sum_{j \in C} y_{j}$ — sum of the target values for all rows in category $C$ .
$y_{i}$ — the target value of the current row $i$ (subtracted out).
$n_{C}$ — the total count of rows belonging to category $C$ .

Step-by-Step Process

For each category, compute the sum of targets $\sum_{j \in C} y_{j}$ and the count $n_{C}$ across the training set.
For every training row $i$ in that category, encode it as $\frac{S - y_{i}}{n - 1}$ (leave its own target out).
For the test set, no leave-out is needed (test targets are unknown), so encode with the ordinary full-category mean $\frac{S}{n}$ from the training set.
Unseen categories (or a training category with $n = 1$ ) → fall back to the global target mean.

Worked Example

We reuse the Favorite color dataset (target = Likes Movie). Focus on the Green color, which has 4 training rows (rows 3, 5, 6, 7):

For Green: sum $\sum_{j \in C} y_{j} = 1 + 1 + 0 + 0 = 2$ , count $n = 4$ .

Applying ${\hat{x}}_{i} = \frac{\sum_{j \in C} y_{j} - y_{i}}{n - 1} = \frac{2 - y_{i}}{3}$ :

Row	Favorite	Target $(y)$		LOO encoding ${\hat{x}}_{i} = \frac{2 - y_{i}}{3}$	Standard Target Encoding
3	Green	1	Excludes its own target value of 1.	$\frac{2 - 1}{4 - 1} = \frac{1}{3} = 0.33$	$0.50$
5	Green	1	Excludes its own target value of 1.	$\frac{2 - 1}{4 - 1} = \frac{1}{3} = 0.33$	$0.50$
6	Green	0	Excludes its own target value of 0.	$\frac{2 - 0}{4 - 1} = \frac{2}{3} = 0.67$	$0.50$
7	Green	0	Excludes its own target value of 0.	$\frac{2 - 0}{4 - 1} = \frac{2}{3} = 0.67$	$0.50$

Notice that two identical Green rows can receive different encodings depending on their own target — rows that liked the movie get a slightly lower Green score (because their own 1 was removed), and rows that didn't get a slightly higher one. This tiny per-row variation is what prevents the model from memorizing the answer.

Encoding the test set: the model acts exactly like standard target encoding — test rows map to the full, unadjusted mean from the training group. A Green test row uses $S / n = 2 / 4 = 0.50$ (nothing to leave out, since the test target is unknown).

Ordered vs K-Fold vs LOO

Aspect	K-Fold	Leave-One-Out	Ordered
What is hidden	A whole fold	A single row	All "future" rows
History used per row	Other $K - 1$ folds	All other same-category rows	Only prior same-category rows
Leakage protection	Strong	Strongest per-row, but leak-prone	Strong, streaming-safe
Variance	Low	High	Reduced by averaging permutations
Best known for	General pipelines	Kaggle-style quick features	CatBoost built-in encoding

Why Use It (and When Not To)

✅ Maximum data usage: every row except one contributes to each encoding, so almost no data is "wasted."
✅ Fast: the sum/count formula avoids re-fitting per fold.
⚠️ Prone to overfitting: because only one row is removed, the encoding still carries a faint trace of structure that flexible models (e.g., gradient boosting) can latch onto. In practice LOO is almost always paired with additive noise or smoothing to blur this trace.
⚠️ Unstable for rare categories: with $n = 2$ , LOO reduces to "the other single row's target," which is pure noise; categories with $n = 1$ have no other row at all and must fall back to the global mean.

Practical tip

LOO is popular in Kaggle-style pipelines for its simplicity and speed, but it is more leakage-prone than K-Fold. If you use it, add a small amount of Gaussian noise to the training encodings (e.g., multiply by $1 + N (0, σ)$ ) and combine with smoothing to keep rare-category estimates stable.