R-Squared (The "Accuracy Score" of your Model)

What is R-Squared?

R-Squared (R2), also called the Coefficient of Determination, is a single number between 0 and 1 that tells you: "What percentage of the variation in my outcome (Y) is explained by my predictor(s) (X)?"

🎯 Part 1: The Setup - What Are We Trying to Measure?

Imagine you are trying to predict something, like how well you'll do on a test based on how many hours you studied. R-Squared is the grade we give to our prediction "rule" to see how much of the story it actually tells.

You collect data from 50 students and fit a regression line:

y^=mx+b

Where:

The Central Question: How well does this line predict the actual scores?

This is exactly what R-squared measures.


🧩 Part 2: The Intuition - Two Competing Model

Think of it as a competition between two models:

★ Model 1: "The Lazy Guesser" (The Mean)

★ Model 2: "The Smart Predictor" (The Regression Line)**

What does R-squared asks?

"How much did we reduce our errors by being smart instead of lazy?"


🧱 Part 3: The Building Blocks

★ Block 1: Understand mean score

Imagine ignoring study hours completely. The best guess for any student’s score would be:

μy=i=1nxinwhere xi is score of n students. 

This is called as the mean score μy

★ Block 2: The "Baseline" Model ➛ SST (Total Sum of Squares)

This measures total variability if you ignored X completely and just guessed μy for everyone.

Concept: Imagine you had no model at all; you would simply guess the average value μy for every prediction. SST represents the total error of those "average" guesses.

SST=i=1n(yiμy)2yi is the actual value of the ith observation.

What it represents:

★ Block 3: The "Smart" Model ➛ SSR (Regression Sum of Squares)

This measures how much variability your regression line successfully explained by our model.

Concept: SSR represents how much "better" your regression line is at predicting the data compared to just guessing the average.

SSE=i=1n(yiy^i)2

What it represents:

★ Block 4: The "Leftover Mistakes" ➛ SSE (Sum of Squared Errors)

This measures the variability that your model failed to explain (the residuals).

Concept: These are the "residuals." It represents the noise or factors that your features failed to capture.

SSE=i=1n(yiy^i)2

What it represents:

★ The Golden Equation

These three pieces always follow this relationship:

SST=SSR+SSE

In words: $$\boxed{\Large \text{Total Variation} = \text{Explained Variation} + \text{Unexplained Variation}}$$


🧮 Part 4: The R2 Formula

Now we can define R-squared in two equivalent ways:

★ Method 1: The "Success Ratio"

R2=Explained VariationTotal Variation=SSRSST

Interpretation: "What fraction of the total mystery did we solve?"

★ Method 2: The "Mistake Reduction Ratio"

R2=1Unexplained VariationTotal Variation=1SSESST

Interpretation: "If we start with 100% mystery, how much is left over?"


📊📉 Part 5: Visual Summary Table

r-sq-1.webp

Component Formula What it tells you
SST (yiμy)2 Total error if you just guessed the average.
SSR (y^iμy)2 How much error you "fixed" by using the model.
SSE (yiy^i)2 The error that remains after the model.

🔢 Part 6: Worked Example

★ Given Data:

★ Calculation:

R2=150200=10.25=0.75

🎓 Part 7: How to Interpret R2 Values

The value of R2 is always between 0 and 1

R² Value Meaning Example Scenario
0.90 - 1.00 Excellent fit Predicting height from arm length (biology)
0.70 - 0.89 Strong fit Predicting grades from study hours
0.40 - 0.69 Moderate fit Predicting happiness from income
0.20 - 0.39 Weak fit Predicting stock prices from last month
0.00 - 0.19 Very weak Predicting test scores from shoe size

When is Low R² Acceptable?


🔗 Part 8: The Connection to Correlation (r)

For simple linear regression (one X, one Y):

R2=r2

Where r is the Pearson correlation coefficient between X and Y.

To recover r from R2:

r=±R2

The sign rule:


⚠️ Part 9: Important Limitations

What R² Does NOT Tell You

  1. It doesn't prove causation: High R2 doesn't mean X causes Y
  2. It doesn't detect bias: A biased model can have high R2
  3. It doesn't validate assumptions: Check residual plots for patterns
  4. It rewards complexity: Adding variables always increases R2 (see Adjusted R2 below)


🎯 Part 10: The Problem with R2 ➛ Enter "AdjustedR2"

⚠️ The Flaw in R2

Problem: R2 will always increase (or stay the same) when you add more features, even if those features are pure random noise.

Why? The mathematical definition of R2 is designed to reward any reduction in SSE, no matter how tiny.

Example of the Problem
You have a model predicting house prices with:

Adding "favorite color" increased R2 by 0.001, but it's obviously meaningless!

✅ The Solution: Adjusted R2

Adjusted R2 penalizes you for adding features that don't pull their weight.

🤔 Why Adjusted R2 is Better for Your Model?

in case of Multiple linear Regression

A. It Fights Overfitting

Overfitting happens when your model learns the "noise" in your data rather than the actual "signal." By using Adjusted R2, you are visually alerted when your model is getting too complex without adding real value.

B. It Guides Feature Selection

When you are deciding which features to keep:

  1. Add a feature.
  2. The Penalty: It adds a "penalty" for every new feature you add.
  3. If Adjusted R2 increases, the feature is adding value.
  4. If Adjusted R2 decreases, the feature is likely noise and should be removed.

C. It Accounts for Sample Size (n)

The formula for Adjusted R2 includes the number of data points (n). This is crucial because it’s much easier to "fake" a high R2 with a small dataset (e.g., 5 points and 4 features) than with a large one. Adjusted R2 corrects for this bias.

📝 The Formula:

Adjusted R2=1SSE/(nk)SST/(n1)

Where:

📝 Alternative Form:

Adjusted R2=1[(1R2)n1nk]

📈📉 The "Gap" Test

As a rule of thumb in my lab, I always look at the gap between the two:

★ Visual Comparison:

Scenario R2 Adjusted R2 Verdict
3 features, all relevant 0.85 0.84 ✅ Good model
10 features, 3 relevant 0.87 0.72 ⚠️ Overfitting!
20 features, 2 relevant 0.90 0.55 🚫 Terrible! Too complex

📚 Part 11: Quick Summary

The One-Sentence Summary

  • R-Squared tells you what percentage of the variation in Y is predictable from X.
  • Adjusted R-Squared tells you if adding more features is actually helping or just making your model needlessly complex.

Key Formulas at a Glance

Concept Formula Alternate Formula
R-Squared 1SSESST SSRSST
Adjusted R-Squared 1SSE/(nk)SST/(n1) 1(1R2)n1nk
Correlation r=±R2
(for simple regression)

Final Takeaway

  • Use R2 to see if your model explains a meaningful portion of the variation
  • Use Adjusted R2 when comparing models with different numbers of features
  • Always visualize residuals to check if your model assumptions are valid
  • Remember: A high R2 doesn't automatically mean a good model—context matters!