Label Encoding
Label Encoding is a straightforward technique for converting categorical variables into numerical format by assigning each unique category a unique integer value. Unlike One-Hot Encoding, it represents all categories within a single column, making it memory-efficient but potentially problematic for certain algorithms.
Example:
How Label Encoding Works
Label Encoding assigns an integer to each unique category, typically in alphabetical or order-of-appearance sequence:
| Original | Encoded | |
|---|---|---|
| Small | → | 0 |
| Medium | → | 1 |
| Large | → | 2 |
| Small | → | 0 |
| Large | → | 2 |
The mapping is learned from the training data and applied consistently to new data. This creates a single numerical column instead of multiple binary columns.
Why Label Encoding Matters
Memory Efficiency
Label Encoding transforms categorical variables into a single numerical column, regardless of the number of categories. For high-cardinality features (hundreds or thousands of unique values), this is dramatically more efficient than One-Hot Encoding.
Space comparison:
- One-Hot Encoding: 10 categories → 10 columns
- Label Encoding: 10 categories → 1 column
Algorithm Compatibility
Many machine learning algorithms require numerical input. Label Encoding provides this conversion with minimal overhead, making it particularly useful for tree-based models that can handle the encoded integers without assuming ordering.
Simplicity
Label Encoding is conceptually simple and computationally fast, requiring minimal preprocessing overhead.
When to Use Label Encoding
Ideal scenarios:
-
Tree-based models: Decision Trees, Random Forests, XGBoost, LightGBM, CatBoost
- These models make binary splits and don't assume ordering
- Can handle the integer encoding without misinterpreting relationships
-
Ordinal categorical variables: When categories have a natural, meaningful order
- Education level: "High School" < "Bachelor's" < "Master's" < "PhD"
- Satisfaction rating: "Very Unsatisfied" < "Unsatisfied" < "Neutral" < "Satisfied" < "Very Satisfied"
- Size: "Small" < "Medium" < "Large" < "Extra Large"
-
High-cardinality features: When One-Hot Encoding would create too many dimensions
- Country codes (200+ categories)
- ZIP codes
- Product IDs
- User IDs (with tree models)
-
Target variable encoding: For classification tasks, encoding the target variable (y)
- This is actually called "Label Encoding" in scikit-learn's
LabelEncoder
- This is actually called "Label Encoding" in scikit-learn's
When to Avoid Label Encoding
Not recommended for:
-
Nominal categories with linear models: Linear Regression, Logistic Regression, Linear SVM
- The model will incorrectly interpret the integers as having meaningful order and magnitude
- Example: Encoding "Red"=0, "Blue"=1, "Green"=2 makes the model think "Green" is "twice" "Red"
-
Distance-based algorithms (without care): k-NN, K-means clustering
- The arbitrary integer distances don't reflect true categorical relationships
- Example: "Red"=0 and "Blue"=1 would be "closer" than "Red"=0 and "Green"=2, which is meaningless
-
Neural Networks (generally):
- Better to use One-Hot Encoding or embeddings
- Label encoding can introduce false ordinal relationships in the learned representations
-
Nominal categories without natural order:
- City names, department names, product categories
- Use One-Hot Encoding or other methods instead
Advantages and Limitations
Advantages:
- Extremely memory efficient: One column regardless of cardinality
- Fast and simple to implement
- Perfect for tree-based models with categorical data
- Ideal for ordinal data where order matters
- Handles high cardinality gracefully
- No dummy variable trap issues
- Can encode unseen categories (assign a new integer or default value)
Limitations:
- Introduces artificial ordering for nominal categories
- Incompatible with linear models for nominal data (will learn spurious relationships)
- Arbitrary integer assignment doesn't reflect semantic similarity
- Can mislead distance-based algorithms
- Order dependency: Different training runs may produce different encodings (alphabetical vs order-of-appearance)
- Magnitude issues: The algorithm might think category 10 is "10 times more" than category 1
Critical Considerations
1. Ordinal vs Nominal Categories
Ordinal categories (natural order exists):
✅ Label Encoding is appropriate
- Example: "Low" (0) < "Medium" (1) < "High" (2)
Nominal categories (no natural order):
⚠️ Label Encoding can be problematic with many algorithms
- Example: "Red", "Blue", "Green" have no inherent order
2. Train/Test Consistency
Always fit the encoder on training data and apply the same mapping to test/production data:
# Correct approach
encoder.fit(X_train)
X_train_encoded = encoder.transform(X_train)
X_test_encoded = encoder.transform(X_test) # Uses same mapping
# Wrong approach - causes data leakage
encoder.fit(X_test) # Never fit on test data!
3. Handling Unseen Categories
When test data contains categories not seen during training:
Options:
- Assign a special "unknown" value (e.g., -1)
- Use the most frequent category
- Raise an error (default behavior in scikit-learn)
4. Model Selection Impact
The same data encoded with Label Encoding can produce dramatically different results across algorithms:
| Algorithm Type | Label Encoding Performance |
|---|---|
| Tree-based | ✅ Excellent |
| Linear models (nominal data) | ❌ Poor |
| Linear models (ordinal data) | ✅ Good |
| Distance-based | ⚠️ Requires careful consideration |
| Neural networks | ⚠️ Better alternatives exist |
5. Custom Ordering for Ordinal Data
For ordinal categories, you should specify the order explicitly rather than relying on alphabetical ordering:
# Specify correct order for ordinal data
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large', 'XLarge']])
Python Implementation
Best Practices Summary
✅ DO Use Label Encoding When:
- Working with tree-based models (Random Forest, XGBoost, LightGBM)
- Features are ordinal (have natural ordering)
- Dealing with high-cardinality categorical features
- Encoding the target variable in classification tasks
- Memory and computational efficiency are critical
❌ DON'T Use Label Encoding When:
- Features are nominal AND you're using linear models
- You need to preserve categorical distances in distance-based algorithms
- Categories have no meaningful order but you need interpretable feature importance
- Working with neural networks (use embeddings or OHE instead)
🔑 Key Guidelines:
- Always specify the order for ordinal features using
OrdinalEncoder(categories=[[...]]) - Use pipelines to ensure consistent encoding between train/test
- Configure unknown handling for production:
handle_unknown='use_encoded_value' - Document your encoding logic — the integer assignments should be reproducible
- Consider the algorithm — what works for trees may fail for linear models
- Test both approaches — compare Label vs One-Hot encoding with cross-validation
- Use OrdinalEncoder over LabelEncoder for features (LabelEncoder is mainly for targets)
Label Encoding vs One-Hot Encoding: Decision Guide
| Criterion | Label Encoding | One-Hot Encoding |
|---|---|---|
| Cardinality | High (50+) | Low-Medium (<15) |
| Feature Type | Ordinal or Tree-models | Nominal |
| Algorithm | Tree-based | Linear models |
| Memory Usage | Very low | High |
| Interpretability | Ordinal: Good; Nominal: Poor | Excellent |
| Risk | False ordering | Dimensionality explosion |
Common Mistakes to Avoid
-
Using LabelEncoder for nominal features with linear models
# ❌ Wrong: Linear model + nominal feature encoder = LabelEncoder() X['city_encoded'] = encoder.fit_transform(X['city']) LogisticRegression().fit(X, y) # Will learn spurious relationships! -
Not specifying order for ordinal features
# ❌ Wrong: Alphabetical order for ordinal feature encoder = OrdinalEncoder() encoder.fit_transform([['High', 'Low', 'Medium']]) # Wrong order! # ✅ Correct: Specify proper order encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']]) -
Fitting encoder on test data
# ❌ Wrong: Data leakage encoder.fit(X_test) # ✅ Correct: Fit on train, transform on test encoder.fit(X_train) X_test_encoded = encoder.transform(X_test) -
Ignoring unseen categories
# ❌ Wrong: Default behavior raises error encoder = OrdinalEncoder() # ✅ Correct: Handle unknown values encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)