Multicollinearity for Categorial Data

I. Categorical variables must be encoded first

VIF is defined on a numeric design matrix, so categorical variables need to be turned into dummy/indicator columns.
Preferred approach: one-hot encode with a reference category (drop one level).

X_enc = pd.get_dummies(X_raw, columns=["neighborhood", "style"], drop_first=True)
vif_table = compute_vif(X_enc)

★ Do NOT include all dummy columns + intercept (dummy variable trap)

If a categorical variable has m levels:

Lets understand how?

1) What the intercept does?

In a linear regression, the intercept (β0) is the model’s baseline prediction when all predictors are zero:
Technically, the intercept means we include a column that is always 1:

Intercept column=[1,1,1,,1]

So the model looks like:

y=β0+β1x1+β2x2+

That β0​ is the intercept.

★ Including an intercept is usually recommended because:

  1. Most real relationships don’t pass through zero
    If you force the line/plane through the origin (no intercept), you’re assuming y^=0 when all x=0. That’s often not true and can bias slope estimates.

  2. It captures the overall mean / baseline level
    With an intercept, the model can shift up/down to fit the data better.

  3. Better interpretation in dummy-coded models
    With one-hot encoding and a dropped reference level, the intercept becomes the expected outcome for the reference group (baseline category). That’s very interpretable.

2) One-hot encoding a categorical variable with m levels

Suppose you have a categorical feature Color with 3 categories:

You one-hot encode it into 3 dummy columns:

Each row has exactly one “1” and the rest “0”.

Example data (5 rows):

Color (Dred) (DGreen) (DBlue)
Red 1 0 0
Blue 0 0 1
Green 0 1 0
Red 1 0 0
Blue 0 0 1

Now notice something always true for every row:

Dred+DGreen+DBlue=1

Because each row belongs to exactly one category.

3) Why “all dummies + intercept” creates perfect multicollinearity

If you also include an intercept column (call it C), then C=1 for every row.
But we just showed:

Dred+DGreen+DBlue=1

So that means:

C=Dred+DGreen+DBlue

This is an exact linear relationship between columns.

That’s the definition of perfect multicollinearity: one column can be written as an exact combination of others.

Consequence:

The regression cannot uniquely estimate coefficients, because multiple coefficient combinations produce the same predictions.

4) Intuition: “You’re trying to estimate one thing twice”

The intercept already provides a “baseline level”. If you keep all dummy variables too, you’re over-describing the same baseline.
A categorical variable with 3 levels only has 2 degrees of freedom once you already have an intercept. (Because if you know two dummies, the third is determined.)

5) The fix

➛ Alternate 1: Drop one dummy (reference category)

Drop one category dummy, and treat it as the reference.
For example, drop DBlue:

Now your columns are DRed, DGreen​, plus intercept.

Model:

y=β0+βRedDred+βGreenDGreen+ϵ

Interpretation:

➛ Alternative 2: Keep all dummies, remove intercept

If you remove the intercept, then you can keep all m dummies, because you no longer have that “1” column.

y=βRedDred+βGreenDGreen+βBlueDBlue+ϵ

Interpretation:

★ How this relates to VIF?

VIF for a column is based on regressing it on the others. If there is perfect multicollinearity, then for at least one column: