Multicollinearity for Categorial Data

I. Categorical variables must be encoded first

VIF is defined on a numeric design matrix, so categorical variables need to be turned into dummy/indicator columns.
Preferred approach: one-hot encode with a reference category (drop one level).

X_enc = pd.get_dummies(X_raw, columns=["neighborhood", "style"], drop_first=True)
vif_table = compute_vif(X_enc)

★ Do NOT include all dummy columns + intercept (dummy variable trap)

If a categorical variable has $m$ levels:

One-hot encoding creates $m$ dummy columns
If you also include an intercept (a constant 1 column), you can end up with an exact linear relationship among columns → perfect multicollinearity.

Lets understand how?

1) What the intercept does?

In a linear regression, the intercept ( $β_{0}$ ) is the model’s baseline prediction when all predictors are zero:
Technically, the intercept means we include a column that is always 1:

Intercept column = [1, 1, 1, \dots, 1]

So the model looks like:

y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots

That $β_{0}$ is the intercept.

★ Including an intercept is usually recommended because:

Most real relationships don’t pass through zero
If you force the line/plane through the origin (no intercept), you’re assuming $\hat{y} = 0$ when all $x = 0$ . That’s often not true and can bias slope estimates.
It captures the overall mean / baseline level
With an intercept, the model can shift up/down to fit the data better.
Better interpretation in dummy-coded models
With one-hot encoding and a dropped reference level, the intercept becomes the expected outcome for the reference group (baseline category). That’s very interpretable.

2) One-hot encoding a categorical variable with $m$ levels

Suppose you have a categorical feature Color with 3 categories:

Red
Green
Blue

You one-hot encode it into 3 dummy columns:

$D_{r e d}$
$D_{G r e e n}$
$D_{B l u e}$

Each row has exactly one “1” and the rest “0”.

Example data (5 rows):

Color	( $D_{r e d}$ )	( $D_{G r e e n}$ )	( $D_{B l u e}$ )
Red	1	0	0
Blue	0	0	1
Green	0	1	0
Red	1	0	0
Blue	0	0	1

Now notice something always true for every row:

D_{r e d} + D_{G r e e n} + D_{B l u e} = 1

Because each row belongs to exactly one category.

3) Why “all dummies + intercept” creates perfect multicollinearity

If you also include an intercept column (call it $C$ ), then $C = 1$ for every row.
But we just showed:

D_{r e d} + D_{G r e e n} + D_{B l u e} = 1

So that means:

C = D_{r e d} + D_{G r e e n} + D_{B l u e}

This is an exact linear relationship between columns.

That’s the definition of perfect multicollinearity: one column can be written as an exact combination of others.

Consequence:

The regression cannot uniquely estimate coefficients, because multiple coefficient combinations produce the same predictions.

4) Intuition: “You’re trying to estimate one thing twice”

The intercept already provides a “baseline level”. If you keep all dummy variables too, you’re over-describing the same baseline.
A categorical variable with 3 levels only has 2 degrees of freedom once you already have an intercept. (Because if you know two dummies, the third is determined.)

5) The fix

➛ Alternate 1: Drop one dummy (reference category)

Drop one category dummy, and treat it as the reference.
For example, drop $D_{B l u e}$ :

Now your columns are $D_{R e d}$ , $D_{G r e e n}$ , plus intercept.

If both dummies are 0, that means the row is Blue (the reference).

Model:

y = β_{0} + β_{R e d} D_{r e d} + β_{G r e e n} D_{G r e e n} + ϵ

Interpretation:

$β_{0}$ : expected y for Blue (reference)
$β_{R e d}$ : how much Red differs from Blue
$β_{G r e e n}$ : how much Green differs from Blue
No redundancy anymore.

➛ Alternative 2: Keep all dummies, remove intercept

If you remove the intercept, then you can keep all m dummies, because you no longer have that “1” column.

y = β_{R e d} D_{r e d} + β_{G r e e n} D_{G r e e n} + β_{B l u e} D_{B l u e} + ϵ