🛡 Multicollinearity
I. What is multicollinearity?
Multicollinearity happens when two or more independent variables (predictors) in a regression model contain overlapping information, meaning one predictor can be predicted well from the others.
In other words Multicollinearity means that two or more independent variables are highly correlated with each other as a result the effect of individual variables cannot be clearly separated.
- Example: In a house-price model, square footage and number of rooms tend to move together. They’re not the same thing, but they often carry similar information about “house size.”
Important nuance
- Multicollinearity is not about predictors being correlated with the target.
- It’s about predictors being correlated with each other.
II. Why is multicollinearity a problem?
Consider a multiple linear regression:
If
- how much of the effect belongs to
(i.e., ) and - how much of the effect belongs to
(i.e., )
★ What you’ll observe in practice
Multicollinearity mainly causes unstable coefficient estimates:
- Coefficients become “wobbly”: small data changes can flip the sign or drastically change values.
- Standard errors increase: predictors look statistically insignificant even if they matter.
- Confidence intervals get wide.
- Interpretation becomes unreliable: you can’t confidently say “holding other variables constant, this feature changes y by …” because “holding other variables constant” is unrealistic when predictors move together.
★ Does it hurt prediction?
Often less than it hurts interpretation.
- If your goal is prediction, multicollinearity can be acceptable (though it may still cause numerical instability in some models).
- If your goal is inference/interpretation (which variables matter and how), multicollinearity is a serious issue.
III. How do we detect multicollinearity?
★ Quick first check: correlation matrix (pairwise)
- High correlation (like > 0.8 or 0.9) between two predictors is a warning sign.
- But pairwise correlation can miss cases where one variable is predicted by a combination of several others.
That’s why we use Tolerance and VIF
🛡 Tolerance and Variance Inflation Factor (VIF)
I. The key idea behind Tolerance and VIF
For each predictor
- Regress
on all the other predictors. - Example: treat
as the temporary target, and predict it using , ,...,
- Example: treat
- Compute
from that regression. - This tells us: how well the other predictors explain
- This tells us: how well the other predictors explain
If
II. What is "Tolerance"
Interpretation
- Tolerance is the fraction of
's variance that is not explained by other predictors. - Low tolerance means “this variable is mostly predictable from the others.”
Rules of thumb
- Tolerance < 0.10 ➛ strong multicollinearity concern
- Tolerance < 0.20 ➛ “moderate concern”
III. What is Variance Inflation Factor (VIF)?
★ Interpretation (very important)
VIF tells you how much the variance of
: - No multicollinearity.
: - Indicates the presence of moderate multicollinearity.
- Generally acceptable in models.
- Variance of that coefficient is inflated by
(standard error increases by )
: - Suggests moderate multicollinearity that may be problematic.
: - Serious high multicollinearity, reduces model reliability
- Suggests coefficients are poorly estimated
Rules of thumb
→ investigate → serious multicollinearity warning
What to do about multicollinearity (practical fixes)
★ Option 1: Remove one of the redundant predictors
- Keep the one that is:
- more meaningful for the problem domain
- more reliable / easier to measure
- more directly related to the target (conceptually)
- Don’t just drop based on P-Value alone (multicollinearity can make p-values misleading).
★ Option 2: Combine predictors
- If variables measure the same concept, combine them:
- average / sum
- create an index/score
- use ratios (carefully)
★ Option 3: Do NOT include all dummy columns + intercept
- If a categorical variable has
levels: - one-hot encoding creates
dummy columns but with an intercept, those dummies sum to 1 → perfect multicollinearity
- one-hot encoding creates
- So you must do one of:
- drop one dummy level (
drop_first=True) and keep intercept (most common) - OR keep all dummy levels but remove the intercept (less common)
- drop one dummy level (
- Refer the "Multicollinearity for Categorial Data" for details, as why
★ Option 4: Use dimensionality reduction
- PCA: converts correlated predictors into uncorrelated components.
- Good for prediction; harder for interpretation.
★ Option 5: Use regularization
- Ridge regression is specifically helpful with multicollinearity (stabilizes coefficients).
- Lasso may drop one of the correlated variables (but can be unstable when predictors are highly correlated).
- Elastic Net often works well when predictors are correlated.
★ Option 6: Collect more data (sometimes)
- More data can reduce coefficient variance, but it does not remove the underlying redundancy.
Python code Example
1. Compute VIF in Python (statsmodels)
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
def compute_vif(X: pd.DataFrame, add_intercept=True):
X_ = X.copy()
# statsmodels expects numeric dtype
X_ = X_.apply(pd.to_numeric)
if add_intercept:
X_ = sm.add_constant(X_, has_constant='add')
vif = pd.DataFrame({
"feature": X_.columns,
"VIF": [variance_inflation_factor(X_.values, i)
for i in range(X_.shape[1])]
})
# Usually we don't interpret the intercept's VIF
return vif.sort_values("VIF", ascending=False)
- VIF requires no missing values in
X. - It assumes predictors are numeric (Categorical variables must be encoded first.).
- Don’t obsess over the constant’s VIF; interpret VIF for real predictors.
Example 1: Using compute_vif on the "California housing dataset (sklearn)"
from sklearn.datasets import fetch_california_housing
import pandas as pd
# Load dataset
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
# Compute VIF for all features
vif_table = compute_vif(df)
print(vif_table)
Output
feature VIF
0 const 17082.623698
7 Latitude 9.297624
8 Longitude 8.962263
3 AveRooms 8.342786
4 AveBedrms 6.994995
1 MedInc 2.501295
2 HouseAge 1.241254
5 Population 1.138125
6 AveOccup 1.008324
2. VIF with scikit-learn (manual approach)
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
def compute_vif_sklearn(X: pd.DataFrame):
X = X.astype(float)
cols = X.columns
vifs = []
for j, col in enumerate(cols):
y_j = X[col].values
X_others = X.drop(columns=[col]).values
model = LinearRegression()
model.fit(X_others, y_j)
r2 = model.score(X_others, y_j)
vif = 1.0 / (1.0 - r2) if r2 < 1 else np.inf
vifs.append(vif)
return pd.DataFrame({"feature": cols, "VIF": vifs}).sort_values("VIF", ascending=False)
Output (Same example as 1)
feature VIF
6 Latitude 9.297624
7 Longitude 8.962263
2 AveRooms 8.342786
3 AveBedrms 6.994995
0 MedInc 2.501295
1 HouseAge 1.241254
4 Population 1.138125
5 AveOccup 1.008324
Tips
I got below awesome tips online, adding it here for reference.
1. Interpreting VIF with one-hot encoded features
- Dummies from the same categorical variable are naturally related.
- VIF for individual dummy columns can be inflated, especially with:
- many levels
- rare categories
- strong relationship with other predictors
- Practical guidance
- If VIF is high mainly among the dummy columns of one categorical feature, that’s not always a “bug”—it can reflect that categories are imbalanced or strongly associated with other predictors.
- If VIF is high between a dummy and some numeric variable, that can indicate redundancy (e.g., a category almost perfectly implies a numeric range).