Variance Threshold
Cutting the Noise: How to Use Variance Threshold to Shine a Spotlight on Your Data’s Strongest Features
Imagine you’re at a lively party, surrounded by hundreds of people, but only a handful of them are actually engaged in meaningful conversations. The others are either nodding silently, barely saying a word, or repeating the same old phrases over and over again. Wouldn’t it make sense to focus your attention on those who have something valuable to contribute?
One simple but powerful tool to filter out these “quiet” features is Variance Threshold. Think of it as your conversation detector — it spots the features that are essentially silent and helps you shift the spotlight to the ones that truly matter.
In this article, I’ll break down exactly what Variance Threshold is, how it works, and when to use it. Along the way, we’ll explore code examples, real-world scenarios, and some pitfalls to avoid while using this technique.
Formula
The variance measures the spread of data points around their mean.
★ Code Example (without normalization)
import pandas as pd
import numpy as np
# Loading Dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
# Convert to Dataframe and target
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target
X.sample()
If we get the default variance of the existing dataset without any preprocessing or normalization.
# get variance of dataset
default_variance = X.var()
# To Dataframe
df_coef = pd.DataFrame({"coef":np.round(default_variance, 2)}).sort_values("coef")
# display
df_coef
```python ln=false
from sklearn.feature_selection import VarianceThreshold
Remove features with variance < 5.0
selector = VarianceThreshold(threshold=5.0)
selected_features = selector.fit_transform(X)
List all selected features
(df_coef.loc[X.columns[selector.get_support()]]).sort_values(by='coef')
<img src="Learning/Stats/Pictures/vt_3.png" width="8%">
### ★ The Critical Need for Normalization ⚖️
As we observe, the variance between the features is very wide. If one feature has values between 0–1 and another between 1000–5000, their variances will be very different. This is obviously because the variance of each feature is dependent on the range. While it works on non-linear data, applying Variance Threshold without first normalizing your data is a common and critical mistake.
Variance is **scale-dependent**. This means the value of the variance is directly influenced by the units and range of the feature.
**Analogy:** Imagine you have two features:
- **Age:** Ranging from 20 to 60 (Variance might be around 150).
- **Annual Salary:** Ranging from $50,000 to $120,000 (Variance will be in the hundreds of millions).
If you apply a variance threshold without normalizing, the algorithm will conclude that **Age** has almost zero variance compared to **Salary** and will likely discard it. This conclusion is misleading; it's based on the feature's scale, not its actual information content.
#### ★ Pros of Normalization
- Applying VarianceThreshold on raw data might remove only low-scale features, which may not be ideal.
- Normalization ensures fair variance comparisons.
- Some methods (e.g., PCA, L1-based selection) work better when features are normalized.
- Normalization before VarianceThreshold can ensure smoother integration with other techniques.
### ★ Code Example (with normalization)
```python ln=false
from sklearn.preprocessing import MinMaxScaler
# Normalize
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
# get variance of normalized dataset
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
default_variance = X_scaled.var()
# To Dataframe
df_coef = pd.DataFrame({"coef":np.round(default_variance, 4)}).sort_values("coef")
# display
df_coef
As seen, the with MinMaxScaler, variance is now ranged 0–1, this now ignores all the variance of different scales, when applied Variance Threshold, it will pick a very different set of features from the one seen above.
# Remove features with variance < 0.01
selector = VarianceThreshold(threshold=0.01)
X_selected_scaled = selector.fit_transform(X_scaled)
# List all selected features
(df_coef.loc[X.columns[selector.get_support()]]).sort_values(by='coef')
Surprise! Surprise! Surprise! After Normalization, a very different set of features are determined to more relevant with high variance.
🏆 Strategic Advantages
- Fast and Simple: It is computationally inexpensive since it only involves calculating the variance of each feature.
- Removing Redundant Features: Features with low variance (e.g., nearly constant values across all samples) typically don’t contribute significantly to model performance, as they don’t differentiate between examples.
- Preprocessing for Dimensionality Reduction: Helps reduce the dimensionality of your dataset to speed up training and improve model interpretability.
- Prevents Overfitting: Eliminating constant or near-constant features reduces noise and redundancy.
⚠️ Constraints
- Does Not Consider Target Variable: Unlike mutual information or correlation-based methods, it does not check whether the feature is important for prediction.
- May Remove Useful Low-Variance Features: Some features might have low variance but still be useful for distinguishing between classes (e.g., binary features like “Has Disease: Yes/No”).
- Not Effective for Normalized Data: If data is standardized (e.g., mean = 0, variance = 1), variance thresholding may not be meaningful.
- Doesn’t Handle Multicollinearity: Variance threshold doesn’t detect or remove features that are highly correlated or redundant in relation to other features.
- Numerical Only: VarianceThreshold works only on numerical data directly, as variance is a numerical concept. It cannot be directly applied to categorical features unless they are encoded numerically (e.g., one-hot encoding).
🔺 Caution
- Threshold Selection: Selecting an appropriate variance threshold requires domain knowledge; too high a threshold may remove informative features, while too low may retain irrelevant ones.
- Encoded Variables: If categorical variables are one-hot encoded, low variance can remove important categories appearing infrequently, potentially losing critical information.
When is Variance Threshold the Best Choice?
- Eliminating Constant or Near-Constant Features — Removes features with minimal variation that contribute little information, filtering out low-variance features.
- Preprocessing High-Dimensional Data — In datasets with many features (e.g., text, genetics, image data), quickly discarding uninformative features can enhance model efficiency.
- Reducing Feature Space Before Expensive Selection Methods — Helps trim down the number of features before using more complex techniques like Recursive Feature Elimination (RFE) or tree-based feature importance.
When to Avoid Variance Threshold?
- If feature scaling differs, variance values might not be comparable, may require normalization.
- If the dataset contains categorical features with low variance but high predictive power (e.g., fraud detection, where 99% of cases are “No Fraud” and 1% are “Fraud”).
- When a method that considers the relationship between features and the target variable is needed (instead, use techniques like mutual information or tree-based feature importance).
Common Questions
-
What is difference between Variance threshold and Dispersion Ratio?
- Its worth checking Dispersion Ratio and see the difference between Variance Threshold and Dispersion Ratio
-
Is Variance threshold used for feature selection or elimination?
- ✅ Feature Selection:
- If you use
VarianceThreshold()to retain only the high-variance features, it behaves as a feature selection method. - Example: Keeping only numerical features with significant variance.
- If you use
- ✅ Feature Elimination:
- If you use
VarianceThreshold()to remove low-variance features, it behaves as a feature elimination method. - Example: Removing a column with the same value (0 variance) across all rows.
from sklearn.feature_selection import VarianceThreshold selector = VarianceThreshold(threshold=0.01) # Removes features with variance < 0.01 X_selected = selector.fit_transform(X)
- If you use
- ✅ Feature Selection: