Label Encoding

Label Encoding is a straightforward technique for converting categorical variables into numerical format by assigning each unique category a unique integer value. Unlike One-Hot Encoding, it represents all categories within a single column, making it memory-efficient but potentially problematic for certain algorithms.

Example: $size \in {Small, Medium, Large}$ → encoded as ${0, 1, 2}$ in a single column.

How Label Encoding Works

Label Encoding assigns an integer to each unique category, typically in alphabetical or order-of-appearance sequence:

Original		Encoded
Small	→	0
Medium	→	1
Large	→	2
Small	→	0
Large	→	2

The mapping is learned from the training data and applied consistently to new data. This creates a single numerical column instead of multiple binary columns.

Why Label Encoding Matters

Memory Efficiency

Label Encoding transforms categorical variables into a single numerical column, regardless of the number of categories. For high-cardinality features (hundreds or thousands of unique values), this is dramatically more efficient than One-Hot Encoding.

Space comparison:

One-Hot Encoding: 10 categories → 10 columns
Label Encoding: 10 categories → 1 column

Algorithm Compatibility

Many machine learning algorithms require numerical input. Label Encoding provides this conversion with minimal overhead, making it particularly useful for tree-based models that can handle the encoded integers without assuming ordering.

Simplicity

Label Encoding is conceptually simple and computationally fast, requiring minimal preprocessing overhead.

When to Use Label Encoding

Ideal scenarios:

Tree-based models: Decision Trees, Random Forests, XGBoost, LightGBM, CatBoost
- These models make binary splits and don't assume ordering
- Can handle the integer encoding without misinterpreting relationships
Ordinal categorical variables: When categories have a natural, meaningful order
- Education level: "High School" < "Bachelor's" < "Master's" < "PhD"
- Satisfaction rating: "Very Unsatisfied" < "Unsatisfied" < "Neutral" < "Satisfied" < "Very Satisfied"
- Size: "Small" < "Medium" < "Large" < "Extra Large"
High-cardinality features: When One-Hot Encoding would create too many dimensions
- Country codes (200+ categories)
- ZIP codes
- Product IDs
- User IDs (with tree models)
Target variable encoding: For classification tasks, encoding the target variable (y)
- This is actually called "Label Encoding" in scikit-learn's LabelEncoder

When to Avoid Label Encoding

Not recommended for:

Nominal categories with linear models: Linear Regression, Logistic Regression, Linear SVM
- The model will incorrectly interpret the integers as having meaningful order and magnitude
- Example: Encoding "Red"=0, "Blue"=1, "Green"=2 makes the model think "Green" is "twice" "Red"
Distance-based algorithms (without care): k-NN, K-means clustering
- The arbitrary integer distances don't reflect true categorical relationships
- Example: "Red"=0 and "Blue"=1 would be "closer" than "Red"=0 and "Green"=2, which is meaningless
Neural Networks (generally):
- Better to use One-Hot Encoding or embeddings
- Label encoding can introduce false ordinal relationships in the learned representations
Nominal categories without natural order:
- City names, department names, product categories
- Use One-Hot Encoding or other methods instead

Advantages and Limitations

Advantages:

Extremely memory efficient: One column regardless of cardinality
Fast and simple to implement
Perfect for tree-based models with categorical data
Ideal for ordinal data where order matters
Handles high cardinality gracefully
No dummy variable trap issues
Can encode unseen categories (assign a new integer or default value)

Limitations:

Introduces artificial ordering for nominal categories
Incompatible with linear models for nominal data (will learn spurious relationships)
Arbitrary integer assignment doesn't reflect semantic similarity
Can mislead distance-based algorithms
Order dependency: Different training runs may produce different encodings (alphabetical vs order-of-appearance)
Magnitude issues: The algorithm might think category 10 is "10 times more" than category 1

Critical Considerations

1. Ordinal vs Nominal Categories

Ordinal categories (natural order exists):
✅ Label Encoding is appropriate

Example: "Low" (0) < "Medium" (1) < "High" (2)

Nominal categories (no natural order):
⚠️ Label Encoding can be problematic with many algorithms

Example: "Red", "Blue", "Green" have no inherent order

2. Train/Test Consistency

Always fit the encoder on training data and apply the same mapping to test/production data:

# Correct approach
encoder.fit(X_train)
X_train_encoded = encoder.transform(X_train)
X_test_encoded = encoder.transform(X_test)  # Uses same mapping

# Wrong approach - causes data leakage
encoder.fit(X_test)  # Never fit on test data!

3. Handling Unseen Categories

When test data contains categories not seen during training:

Options:

Assign a special "unknown" value (e.g., -1)
Use the most frequent category
Raise an error (default behavior in scikit-learn)

4. Model Selection Impact

The same data encoded with Label Encoding can produce dramatically different results across algorithms:

Algorithm Type	Label Encoding Performance
Tree-based	✅ Excellent
Linear models (nominal data)	❌ Poor
Linear models (ordinal data)	✅ Good
Distance-based	⚠️ Requires careful consideration
Neural networks	⚠️ Better alternatives exist

5. Custom Ordering for Ordinal Data

For ordinal categories, you should specify the order explicitly rather than relying on alphabetical ordering:

# Specify correct order for ordinal data
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large', 'XLarge']])

Python Implementation

Best Practices Summary

✅ DO Use Label Encoding When:

Working with tree-based models (Random Forest, XGBoost, LightGBM)
Features are ordinal (have natural ordering)
Dealing with high-cardinality categorical features
Encoding the target variable in classification tasks
Memory and computational efficiency are critical

❌ DON'T Use Label Encoding When:

Features are nominal AND you're using linear models
You need to preserve categorical distances in distance-based algorithms
Categories have no meaningful order but you need interpretable feature importance
Working with neural networks (use embeddings or OHE instead)

🔑 Key Guidelines:

Always specify the order for ordinal features using OrdinalEncoder(categories=[[...]])
Use pipelines to ensure consistent encoding between train/test
Configure unknown handling for production: handle_unknown='use_encoded_value'
Document your encoding logic — the integer assignments should be reproducible
Consider the algorithm — what works for trees may fail for linear models
Test both approaches — compare Label vs One-Hot encoding with cross-validation
Use OrdinalEncoder over LabelEncoder for features (LabelEncoder is mainly for targets)

Label Encoding vs One-Hot Encoding: Decision Guide

Criterion	Label Encoding	One-Hot Encoding
Cardinality	High (50+)	Low-Medium (<15)
Feature Type	Ordinal or Tree-models	Nominal
Algorithm	Tree-based	Linear models
Memory Usage	Very low	High
Interpretability	Ordinal: Good; Nominal: Poor	Excellent
Risk	False ordering	Dimensionality explosion

Common Mistakes to Avoid

Using LabelEncoder for nominal features with linear models

# ❌ Wrong: Linear model + nominal feature
encoder = LabelEncoder()
X['city_encoded'] = encoder.fit_transform(X['city'])
LogisticRegression().fit(X, y)  # Will learn spurious relationships!

Not specifying order for ordinal features

# ❌ Wrong: Alphabetical order for ordinal feature
encoder = OrdinalEncoder()
encoder.fit_transform([['High', 'Low', 'Medium']])  # Wrong order!

# ✅ Correct: Specify proper order
encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])

Fitting encoder on test data

# ❌ Wrong: Data leakage
encoder.fit(X_test)

# ✅ Correct: Fit on train, transform on test
encoder.fit(X_train)
X_test_encoded = encoder.transform(X_test)

Ignoring unseen categories

# ❌ Wrong: Default behavior raises error
encoder = OrdinalEncoder()

# ✅ Correct: Handle unknown values
encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)