Linear vs. Non-Linear Relationships

I. Linear Features

In a linear dataset, if feature $X$ increases, feature $Y$ increases/decreases at a constant rate. ➛ A feature has a linear relationship if a constant change in the input results in a proportional, constant change in the output.
Mathematical Form: $y = m x + b$ .
Example: Predict house price based on square footage. As square footage increases, the price generally increases at a steady rate.

II. Non-Linear Features

In a non-linear dataset, $Y$ increases/decreases at a variable rate. ➛ A feature is non-linear if the relationship with the target changes at different levels of the input. These patterns are often more complex and "curved." In other words if a dataset is non-linear if its relationship between input (X) and output (Y) does not follow a straight line.
Example: Predict crop yield based on rainfall. Too little rain is bad, a medium amount is perfect, and too much rain causes flooding (a U-shaped or parabolic relationship).

How to Identify Them

Identifying these relationships is a critical step in Feature Selection, as some algorithms (like Linear Regression) struggle with non-linear data.

A. Visual Inspection (The "Eye Test")

Plot	Linear	Non-Linear
Scatter	Straight trend	Curve
Box	Steady increase	Irregular
Violin	Smooth shift	Uneven shift
Pair plot	Diagonal cloud	Curved cloud
Residual	Random	Pattern
Heatmap	High correlation	Might miss relationship
LOESS	Straight smooth	Curved smooth

1. Scatter Plot

Linear: Straight
Non-Linear: Curve
No relationship: Chaos

2. Box Plot

Linear: Steady step pattern
Non-Linear: Jumping pattern

3. Violin Plot

Linear: Smooth shift
Non-Linear: Irregular shift

4. Pair Plot

Linear: Diagonal band
Non-Linear: Curved band

5. Residual Plot

Linear: Residuals are randomly scattered around zero
Non-Linear: If you see pattern/shape in residuals

6. LOESS

Linear: Flat smooth line
Non-linear: Curved smooth line

7. Heatmap

Linear: Strong correlations (near +1 or -1)
Non-Linear: Correlation near 0 means possibly non-linear.

8. Histogram Plot

Linear: Bell shape
Non-Linear: Skewed / multi-peak

9. KDE Plot

Linear: Single smooth bell
Non-Linear: Multiple peaks or skew

B. Statistical Metrics

You can use mathematical scores to quantify the type of relationship:

1. Pearson Correlation ( $r$ )

Calculate the Pearson correlation coefficient between each numerical feature and the target variable. Measures the strength of a linear relationship.

A high value ( $r$ is near 1 or -1) indicates a strong linear relationship.
A value of $r$ is near 0 indicates a weak or no linear relationship.
Limitations: Only captures linear relationships.

2. Spearman’s Rank Correlation (Non-Linear Relationships)

If Pearson’s test fails, use Spearman’s correlation, which detects monotonic (but not necessarily linear) relationships.

If Pearson’s $r$ is low, but Spearman’s $r h o$ is high, the relationship is likely non-linear.

3. Mutual Information

Calculate the mutual information between each feature and the target variable.
Mutual information measures the statistical dependence between variables, capturing both linear and non-linear relationships.
A higher mutual information value indicates a stronger relationship.

4. Polynomial Regression (for Regression)

Fit polynomial regression models of different degrees to the data.
Compare the performance of linear regression (degree 1) to polynomial regression (degree > 1).
If polynomial regression performs significantly better, it suggests a non-linear relationship.
Fit a polynomial regression model and compare it to a linear model using R² score.
Interpretation:
- If R² (Polynomial) >> R² (Linear), data is non-linear.
- If both R² are similar, data is linear.