🛡 Multicollinearity

I. What is multicollinearity?

Multicollinearity happens when two or more independent variables (predictors) in a regression model contain overlapping information, meaning one predictor can be predicted well from the others.

In other words Multicollinearity means that two or more independent variables are highly correlated with each other as a result the effect of individual variables cannot be clearly separated.

Example: In a house-price model, square footage and number of rooms tend to move together. They’re not the same thing, but they often carry similar information about “house size.”

Important nuance

Multicollinearity is not about predictors being correlated with the target.
It’s about predictors being correlated with each other.

II. Why is multicollinearity a problem?

Consider a multiple linear regression:

y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{k} x_{k} + ϵ

If $x_{1}$ and $x_{2}$ are very similar (highly correlated), the model struggles to decide:

how much of the effect belongs to $x_{1}$ (i.e., $β_{1}$ ) and
how much of the effect belongs to $x_{2}$ (i.e., $β_{2}$ )

★ What you’ll observe in practice

Multicollinearity mainly causes unstable coefficient estimates:

Coefficients become “wobbly”: small data changes can flip the sign or drastically change values.
Standard errors increase: predictors look statistically insignificant even if they matter.
Confidence intervals get wide.
Interpretation becomes unreliable: you can’t confidently say “holding other variables constant, this feature changes y by …” because “holding other variables constant” is unrealistic when predictors move together.

★ Does it hurt prediction?

Often less than it hurts interpretation.

If your goal is prediction, multicollinearity can be acceptable (though it may still cause numerical instability in some models).
If your goal is inference/interpretation (which variables matter and how), multicollinearity is a serious issue.

III. How do we detect multicollinearity?

★ Quick first check: correlation matrix (pairwise)

High correlation (like > 0.8 or 0.9) between two predictors is a warning sign.
But pairwise correlation can miss cases where one variable is predicted by a combination of several others.
That’s why we use Tolerance and VIF

🛡 Tolerance and Variance Inflation Factor (VIF)

I. The key idea behind Tolerance and VIF

For each predictor $x_{j}$ :

Regress $x_{j}$ on all the other predictors.
- Example: treat $x_{1}$ as the temporary target, and predict it using $x_{2}$ , $x_{3}$ ,..., $x_{k}$
Compute $R_{j}^{2}$ from that regression.
- This tells us: how well the other predictors explain $x_{j}$

If $R_{j}^{2}$ is high, then $x_{j}$ is largely redundant.

II. What is "Tolerance"

{Tolerance}_{j} = 1 - R_{j}^{2}

Interpretation

Tolerance is the fraction of $x_{j}$ 's variance that is not explained by other predictors.
Low tolerance means “this variable is mostly predictable from the others.”

Rules of thumb

Tolerance < 0.10 ➛ strong multicollinearity concern
Tolerance < 0.20 ➛ “moderate concern”

III. What is Variance Inflation Factor (VIF)?

{VIF}_{j} = \frac{1}{1 - R_{j}^{2}} = \frac{1}{{Tolerance}_{j}}

★ Interpretation (very important)

VIF tells you how much the variance of $β_{j}$ (the coefficient estimate) is inflated because of multicollinearity.

$V I F = 1$ :
- No multicollinearity.
$1 < V I F \leq 5$ :
- Indicates the presence of moderate multicollinearity.
- Generally acceptable in models.
- Variance of that coefficient is inflated by $5 \times$ (standard error increases by $\sqrt{5}$ )
$5 < V I F \leq 10$ :
- Suggests moderate multicollinearity that may be problematic.
$V I F \geq 10$ :
- Serious high multicollinearity, reduces model reliability
- Suggests coefficients are poorly estimated

Rules of thumb

$V I F > 5$ → investigate
$V I F > 10$ → serious multicollinearity warning

What to do about multicollinearity (practical fixes)

★ Option 1: Remove one of the redundant predictors

Keep the one that is:
- more meaningful for the problem domain
- more reliable / easier to measure
- more directly related to the target (conceptually)
Don’t just drop based on P-Value alone (multicollinearity can make p-values misleading).

★ Option 2: Combine predictors

If variables measure the same concept, combine them:
- average / sum
- create an index/score
- use ratios (carefully)

★ Option 3: Do NOT include all dummy columns + intercept

If a categorical variable has $m$ levels:
- one-hot encoding creates $m$ dummy columns but with an intercept, those dummies sum to 1 → perfect multicollinearity
So you must do one of:
- drop one dummy level (drop_first=True) and keep intercept (most common)
- OR keep all dummy levels but remove the intercept (less common)
Refer the "Multicollinearity for Categorial Data" for details, as why

★ Option 4: Use dimensionality reduction

PCA: converts correlated predictors into uncorrelated components.
Good for prediction; harder for interpretation.

★ Option 5: Use regularization

Ridge regression is specifically helpful with multicollinearity (stabilizes coefficients).
Lasso may drop one of the correlated variables (but can be unstable when predictors are highly correlated).
Elastic Net often works well when predictors are correlated.

★ Option 6: Collect more data (sometimes)

More data can reduce coefficient variance, but it does not remove the underlying redundancy.

Python code Example

1. Compute VIF in Python (statsmodels)

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

def compute_vif(X: pd.DataFrame, add_intercept=True):
    X_ = X.copy()

    # statsmodels expects numeric dtype
    X_ = X_.apply(pd.to_numeric)

    if add_intercept:
        X_ = sm.add_constant(X_, has_constant='add')

    vif = pd.DataFrame({
        "feature": X_.columns,
        "VIF": [variance_inflation_factor(X_.values, i)
                for i in range(X_.shape[1])]
    })

    # Usually we don't interpret the intercept's VIF
    return vif.sort_values("VIF", ascending=False)

👉 Notes

VIF requires no missing values in X.
It assumes predictors are numeric (Categorical variables must be encoded first.).
Don’t obsess over the constant’s VIF; interpret VIF for real predictors.

Example 1: Using compute_vif on the "California housing dataset (sklearn)"

from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load dataset
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)

# Compute VIF for all features
vif_table = compute_vif(df)
print(vif_table)

Output

feature           VIF
0   const          17082.623698
7   Latitude      9.297624
8   Longitude      8.962263
3   AveRooms      8.342786
4   AveBedrms      6.994995
1   MedInc           2.501295
2   HouseAge      1.241254
5   Population      1.138125
6   AveOccup      1.008324

2. VIF with scikit-learn (manual approach)

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

def compute_vif_sklearn(X: pd.DataFrame):
    X = X.astype(float)
    cols = X.columns
    vifs = []

    for j, col in enumerate(cols):
        y_j = X[col].values
        X_others = X.drop(columns=[col]).values

        model = LinearRegression()
        model.fit(X_others, y_j)
        r2 = model.score(X_others, y_j)

        vif = 1.0 / (1.0 - r2) if r2 < 1 else np.inf
        vifs.append(vif)

    return pd.DataFrame({"feature": cols, "VIF": vifs}).sort_values("VIF", ascending=False)

Output (Same example as 1)

feature       VIF
6   Latitude  9.297624
7   Longitude  8.962263
2   AveRooms  8.342786
3   AveBedrms  6.994995
0   MedInc  2.501295
1   HouseAge  1.241254
4   Population  1.138125
5   AveOccup  1.008324

Tips

I got below awesome tips online, adding it here for reference.

1. Interpreting VIF with one-hot encoded features

Dummies from the same categorical variable are naturally related.
VIF for individual dummy columns can be inflated, especially with:
- many levels
- rare categories
- strong relationship with other predictors
Practical guidance
- If VIF is high mainly among the dummy columns of one categorical feature, that’s not always a “bug”—it can reflect that categories are imbalanced or strongly associated with other predictors.
- If VIF is high between a dummy and some numeric variable, that can indicate redundancy (e.g., a category almost perfectly implies a numeric range).