Linear vs. Non-Linear Relationships
I. Linear Features
- In a linear dataset, if feature
increases, feature increases/decreases at a constant rate. ➛ A feature has a linear relationship if a constant change in the input results in a proportional, constant change in the output. - Mathematical Form:
. - Example: Predict house price based on square footage. As square footage increases, the price generally increases at a steady rate.
II. Non-Linear Features
- In a non-linear dataset,
increases/decreases at a variable rate. ➛ A feature is non-linear if the relationship with the target changes at different levels of the input. These patterns are often more complex and "curved." In other words if a dataset is non-linear if its relationship between input (X) and output (Y) does not follow a straight line. - Example: Predict crop yield based on rainfall. Too little rain is bad, a medium amount is perfect, and too much rain causes flooding (a U-shaped or parabolic relationship).
How to Identify Them
Identifying these relationships is a critical step in Feature Selection, as some algorithms (like Linear Regression) struggle with non-linear data.
A. Visual Inspection (The "Eye Test")
| Plot | Linear | Non-Linear |
|---|---|---|
| Scatter | Straight trend | Curve |
| Box | Steady increase | Irregular |
| Violin | Smooth shift | Uneven shift |
| Pair plot | Diagonal cloud | Curved cloud |
| Residual | Random | Pattern |
| Heatmap | High correlation | Might miss relationship |
| LOESS | Straight smooth | Curved smooth |
1. Scatter Plot
- Linear: Straight
- Non-Linear: Curve
- No relationship: Chaos
2. Box Plot
- Linear: Steady step pattern
- Non-Linear: Jumping pattern
3. Violin Plot
- Linear: Smooth shift
- Non-Linear: Irregular shift
4. Pair Plot
- Linear: Diagonal band
- Non-Linear: Curved band
5. Residual Plot
- Linear: Residuals are randomly scattered around zero
- Non-Linear: If you see pattern/shape in residuals
6. LOESS
- Linear: Flat smooth line
- Non-linear: Curved smooth line
7. Heatmap
- Linear: Strong correlations (near +1 or -1)
- Non-Linear: Correlation near 0 means possibly non-linear.
8. Histogram Plot
- Linear: Bell shape
- Non-Linear: Skewed / multi-peak
9. KDE Plot
- Linear: Single smooth bell
- Non-Linear: Multiple peaks or skew
B. Statistical Metrics
You can use mathematical scores to quantify the type of relationship:
1. Pearson Correlation ( )
Calculate the Pearson correlation coefficient between each numerical feature and the target variable. Measures the strength of a linear relationship.
- A high value (
is near 1 or -1) indicates a strong linear relationship. - A value of
is near 0 indicates a weak or no linear relationship. - Limitations: Only captures linear relationships.
2. Spearman’s Rank Correlation (Non-Linear Relationships)
If Pearson’s test fails, use Spearman’s correlation, which detects monotonic (but not necessarily linear) relationships.
- If Pearson’s
is low, but Spearman’s is high, the relationship is likely non-linear.
3. Mutual Information
- Calculate the mutual information between each feature and the target variable.
- Mutual information measures the statistical dependence between variables, capturing both linear and non-linear relationships.
- A higher mutual information value indicates a stronger relationship.
4. Polynomial Regression (for Regression)
- Fit polynomial regression models of different degrees to the data.
- Compare the performance of linear regression (degree 1) to polynomial regression (degree > 1).
- If polynomial regression performs significantly better, it suggests a non-linear relationship.
- Fit a polynomial regression model and compare it to a linear model using R² score.
- Interpretation:
- If
R² (Polynomial) >> R² (Linear), data is non-linear. - If both R² are similar, data is linear.
- If