Multicollinearity for Categorial Data
I. Categorical variables must be encoded first
VIF is defined on a numeric design matrix, so categorical variables need to be turned into dummy/indicator columns.
Preferred approach: one-hot encode with a reference category (drop one level).
X_enc = pd.get_dummies(X_raw, columns=["neighborhood", "style"], drop_first=True)
vif_table = compute_vif(X_enc)
★ Do NOT include all dummy columns + intercept (dummy variable trap)
If a categorical variable has
- One-hot encoding creates
dummy columns - If you also include an intercept (a constant 1 column), you can end up with an exact linear relationship among columns → perfect multicollinearity.
Lets understand how?
1) What the intercept does?
In a linear regression, the intercept (
Technically, the intercept means we include a column that is always 1:
So the model looks like:
That
★ Including an intercept is usually recommended because:
-
Most real relationships don’t pass through zero
If you force the line/plane through the origin (no intercept), you’re assumingwhen all . That’s often not true and can bias slope estimates. -
It captures the overall mean / baseline level
With an intercept, the model can shift up/down to fit the data better. -
Better interpretation in dummy-coded models
With one-hot encoding and a dropped reference level, the intercept becomes the expected outcome for the reference group (baseline category). That’s very interpretable.
2) One-hot encoding a categorical variable with levels
Suppose you have a categorical feature Color with 3 categories:
- Red
- Green
- Blue
You one-hot encode it into 3 dummy columns:
Each row has exactly one “1” and the rest “0”.
Example data (5 rows):
| Color | ( |
( |
( |
|---|---|---|---|
| Red | 1 | 0 | 0 |
| Blue | 0 | 0 | 1 |
| Green | 0 | 1 | 0 |
| Red | 1 | 0 | 0 |
| Blue | 0 | 0 | 1 |
Now notice something always true for every row:
Because each row belongs to exactly one category.
3) Why “all dummies + intercept” creates perfect multicollinearity
If you also include an intercept column (call it
But we just showed:
So that means:
This is an exact linear relationship between columns.
That’s the definition of perfect multicollinearity: one column can be written as an exact combination of others.
Consequence:
The regression cannot uniquely estimate coefficients, because multiple coefficient combinations produce the same predictions.
4) Intuition: “You’re trying to estimate one thing twice”
The intercept already provides a “baseline level”. If you keep all dummy variables too, you’re over-describing the same baseline.
A categorical variable with 3 levels only has 2 degrees of freedom once you already have an intercept. (Because if you know two dummies, the third is determined.)
5) The fix
➛ Alternate 1: Drop one dummy (reference category)
Drop one category dummy, and treat it as the reference.
For example, drop
Now your columns are
- If both dummies are 0, that means the row is Blue (the reference).
Model:
Interpretation:
: expected y for Blue (reference) : how much Red differs from Blue : how much Green differs from Blue
No redundancy anymore.
➛ Alternative 2: Keep all dummies, remove intercept
If you remove the intercept, then you can keep all m dummies, because you no longer have that “1” column.
Interpretation:
- each
becomes the mean/level for that category (in a simple model)
This is valid, but most standard workflows keep an intercept and drop one dummy.
★ How this relates to VIF?
VIF for a column is based on regressing it on the others. If there is perfect multicollinearity, then for at least one column:
- tolerance =
So you’ll see VIF blow up or fail.