🛡 Multicollinearity

I. What is multicollinearity?

Multicollinearity happens when two or more independent variables (predictors) in a regression model contain overlapping information, meaning one predictor can be predicted well from the others.

In other words Multicollinearity means that two or more independent variables are highly correlated with each other as a result the effect of individual variables cannot be clearly separated.

Important nuance

II. Why is multicollinearity a problem?

Consider a multiple linear regression:

y=β0+β1x1+β2x2++βkxk+ϵ

If x1​ and x2 are very similar (highly correlated), the model struggles to decide:

★ What you’ll observe in practice

Multicollinearity mainly causes unstable coefficient estimates:

★ Does it hurt prediction?

Often less than it hurts interpretation.

III. How do we detect multicollinearity?

★ Quick first check: correlation matrix (pairwise)

🛡 Tolerance and Variance Inflation Factor (VIF)

I. The key idea behind Tolerance and VIF

For each predictor xj​:

  1. Regress xj on all the other predictors.
    • Example: treat x1​ as the temporary target, and predict it using x2,x3,...,xk
  2. Compute Rj2​ from that regression.
    • This tells us: how well the other predictors explain xj

If Rj2​​ is high, then xj is largely redundant.

II. What is "Tolerance"

Tolerancej=1Rj2

Interpretation

Rules of thumb

III. What is Variance Inflation Factor (VIF)?

VIFj=11Rj2=1Tolerancej

★ Interpretation (very important)

VIF tells you how much the variance of βj (the coefficient estimate) is inflated because of multicollinearity.

Rules of thumb


What to do about multicollinearity (practical fixes)

★ Option 1: Remove one of the redundant predictors

★ Option 2: Combine predictors

★ Option 3: Do NOT include all dummy columns + intercept

★ Option 4: Use dimensionality reduction

★ Option 5: Use regularization

★ Option 6: Collect more data (sometimes)


Python code Example

1. Compute VIF in Python (statsmodels)

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

def compute_vif(X: pd.DataFrame, add_intercept=True):
    X_ = X.copy()

    # statsmodels expects numeric dtype
    X_ = X_.apply(pd.to_numeric)

    if add_intercept:
        X_ = sm.add_constant(X_, has_constant='add')

    vif = pd.DataFrame({
        "feature": X_.columns,
        "VIF": [variance_inflation_factor(X_.values, i)
                for i in range(X_.shape[1])]
    })

    # Usually we don't interpret the intercept's VIF
    return vif.sort_values("VIF", ascending=False)
👉 Notes

  • VIF requires no missing values in X.
  • It assumes predictors are numeric (Categorical variables must be encoded first.).
  • Don’t obsess over the constant’s VIF; interpret VIF for real predictors.

Example 1: Using compute_vif on the "California housing dataset (sklearn)"
from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load dataset
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)

# Compute VIF for all features
vif_table = compute_vif(df)
print(vif_table)

Output

feature           VIF
0   const          17082.623698
7   Latitude      9.297624
8   Longitude      8.962263
3   AveRooms      8.342786
4   AveBedrms      6.994995
1   MedInc           2.501295
2   HouseAge      1.241254
5   Population      1.138125
6   AveOccup      1.008324

2. VIF with scikit-learn (manual approach)

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

def compute_vif_sklearn(X: pd.DataFrame):
    X = X.astype(float)
    cols = X.columns
    vifs = []

    for j, col in enumerate(cols):
        y_j = X[col].values
        X_others = X.drop(columns=[col]).values

        model = LinearRegression()
        model.fit(X_others, y_j)
        r2 = model.score(X_others, y_j)

        vif = 1.0 / (1.0 - r2) if r2 < 1 else np.inf
        vifs.append(vif)

    return pd.DataFrame({"feature": cols, "VIF": vifs}).sort_values("VIF", ascending=False)

Output (Same example as 1)

feature       VIF
6   Latitude  9.297624
7   Longitude  8.962263
2   AveRooms  8.342786
3   AveBedrms  6.994995
0   MedInc  2.501295
1   HouseAge  1.241254
4   Population  1.138125
5   AveOccup  1.008324

Tips

I got below awesome tips online, adding it here for reference.

1. Interpreting VIF with one-hot encoded features