R-Squared (The "Accuracy Score" of your Model)

What is R-Squared?

R-Squared (R2), also called the Coefficient of Determination, is a single number between 0 and 1 that tells you: "What percentage of the variation in my outcome (Y) is explained by my predictor(s) (X)?"

๐ŸŽฏ Part 1: The Setup - What Are We Trying to Measure?

Imagine you are trying to predict something, like how well you'll do on a test based on how many hours you studied. R-Squared is the grade we give to our prediction "rule" to see how much of the story it actually tells.

You collect data from 50 students and fit a regression line:

y^=mx+b

Where:

The Central Question: How well does this line predict the actual scores?

This is exactly what R-squared measures.


๐Ÿงฉ Part 2: The Intuition - Two Competing Model

Think of it as a competition between two models:

โ˜… Model 1: "The Lazy Guesser" (The Mean)

โ˜… Model 2: "The Smart Predictor" (The Regression Line)**

What does R-squared asks?

"How much did we reduce our errors by being smart instead of lazy?"


๐Ÿงฑ Part 3: The Building Blocks

โ˜… Block 1: Understand mean score

Imagine ignoring study hours completely. The best guess for any studentโ€™s score would be:

ฮผy=โˆ‘i=1nxinโ‹ฏwhereย xiย is score of n students.ย 

This is called as the mean score ฮผy

โ˜… Block 2: The "Baseline" Model โž› SST (Total Sum of Squares)

This measures total variability if you ignored X completely and just guessed ฮผy for everyone.

Concept: Imagine you had no model at all; you would simply guess the average value ฮผy for every prediction. SST represents the total error of those "average" guesses.

SST=โˆ‘i=1n(yiโˆ’ฮผy)2โ‹ฏyiย is the actual value of theย ithย observation.

What it represents:

โ˜… Block 3: The "Smart" Model โž› SSR (Regression Sum of Squares)

This measures how much variability your regression line successfully explained by our model.

Concept: SSR represents how much "better" your regression line is at predicting the data compared to just guessing the average.

SSE=โˆ‘i=1n(yiโˆ’y^i)2

What it represents:

โ˜… Block 4: The "Leftover Mistakes" โž› SSE (Sum of Squared Errors)

This measures the variability that your model failed to explain (the residuals).

Concept: These are the "residuals." It represents the noise or factors that your features failed to capture.

SSE=โˆ‘i=1n(yiโˆ’y^i)2

What it represents:

โ˜… The Golden Equation

These three pieces always follow this relationship:

SST=SSR+SSE

In words: $$\boxed{\Large \text{Total Variation} = \text{Explained Variation} + \text{Unexplained Variation}}$$


๐Ÿงฎ Part 4: The R2 Formula

Now we can define R-squared in two equivalent ways:

โ˜… Method 1: The "Success Ratio"

R2=Explained VariationTotal Variation=SSRSST

Interpretation: "What fraction of the total mystery did we solve?"

โ˜… Method 2: The "Mistake Reduction Ratio"

R2=1โˆ’Unexplained VariationTotal Variation=1โˆ’SSESST

Interpretation: "If we start with 100% mystery, how much is left over?"


๐Ÿ“Š๐Ÿ“‰ Part 5: Visual Summary Table

r-sq-1.webp

Component Formula What it tells you
SST โˆ‘(yiโˆ’ฮผy)2 Total error if you just guessed the average.
SSR โˆ‘(y^iโˆ’ฮผy)2 How much error you "fixed" by using the model.
SSE โˆ‘(yiโˆ’y^i)2 The error that remains after the model.

๐Ÿ”ข Part 6: Worked Example

โ˜… Given Data:

โ˜… Calculation:

R2=1โˆ’50200=1โˆ’0.25=0.75

๐ŸŽ“ Part 7: How to Interpret R2 Values

The value of R2 is always between 0 and 1

Rยฒ Value Meaning Example Scenario
0.90 - 1.00 Excellent fit Predicting height from arm length (biology)
0.70 - 0.89 Strong fit Predicting grades from study hours
0.40 - 0.69 Moderate fit Predicting happiness from income
0.20 - 0.39 Weak fit Predicting stock prices from last month
0.00 - 0.19 Very weak Predicting test scores from shoe size

When is Low Rยฒ Acceptable?


๐Ÿ”— Part 8: The Connection to Correlation (r)

For simple linear regression (one X, one Y):

R2=r2

Where r is the Pearson correlation coefficient between X and Y.

To recover r from R2:

r=ยฑR2

The sign rule:


โš ๏ธ Part 9: Important Limitations

What Rยฒ Does NOT Tell You

  1. It doesn't prove causation: High R2 doesn't mean X causes Y
  2. It doesn't detect bias: A biased model can have high R2
  3. It doesn't validate assumptions: Check residual plots for patterns
  4. It rewards complexity: Adding variables always increases R2 (see Adjusted R2 below)


๐ŸŽฏ Part 10: The Problem with R2 โž› Enter "AdjustedR2"

โš ๏ธ The Flaw in R2

Problem: R2 will always increase (or stay the same) when you add more features, even if those features are pure random noise.

Why? The mathematical definition of R2 is designed to reward any reduction in SSE, no matter how tiny.

Example of the Problem
You have a model predicting house prices with:

Adding "favorite color" increased R2 by 0.001, but it's obviously meaningless!

โœ… The Solution: Adjusted R2

Adjusted R2 penalizes you for adding features that don't pull their weight.

๐Ÿค” Why Adjusted R2 is Better for Your Model?

โž› in case of Multiple linear Regression

A. It Fights Overfitting

Overfitting happens when your model learns the "noise" in your data rather than the actual "signal." By using Adjusted R2, you are visually alerted when your model is getting too complex without adding real value.

B. It Guides Feature Selection

When you are deciding which features to keep:

  1. Add a feature.
  2. The Penalty: It adds a "penalty" for every new feature you add.
  3. If Adjusted R2 increases, the feature is adding value.
  4. If Adjusted R2 decreases, the feature is likely noise and should be removed.

C. It Accounts for Sample Size (n)

The formula for Adjusted R2 includes the number of data points (n). This is crucial because itโ€™s much easier to "fake" a high R2 with a small dataset (e.g., 5 points and 4 features) than with a large one. Adjusted R2 corrects for this bias.

๐Ÿ“ The Formula:

Adjustedย R2=1โˆ’SSE/(nโˆ’k)SST/(nโˆ’1)

Where:

๐Ÿ“ Alternative Form:

Adjustedย R2=1โˆ’[(1โˆ’R2)โ‹…nโˆ’1nโˆ’k]

๐Ÿ“ˆ๐Ÿ“‰ The "Gap" Test

As a rule of thumb in my lab, I always look at the gap between the two:

โ˜… Visual Comparison:

Scenario R2 Adjusted R2 Verdict
3 features, all relevant 0.85 0.84 โœ… Good model
10 features, 3 relevant 0.87 0.72 โš ๏ธ Overfitting!
20 features, 2 relevant 0.90 0.55 ๐Ÿšซ Terrible! Too complex

๐Ÿ“š Part 11: Quick Summary

The One-Sentence Summary

  • R-Squared tells you what percentage of the variation in Y is predictable from X.
  • Adjusted R-Squared tells you if adding more features is actually helping or just making your model needlessly complex.

Key Formulas at a Glance

Concept Formula Alternate Formula
R-Squared 1โˆ’SSESST SSRSST
Adjusted R-Squared 1โˆ’SSE/(nโˆ’k)SST/(nโˆ’1) 1โˆ’(1โˆ’R2)โ‹…nโˆ’1nโˆ’k
Correlation r=ยฑR2
(for simple regression)

Final Takeaway

  • Use R2 to see if your model explains a meaningful portion of the variation
  • Use Adjustedย R2 when comparing models with different numbers of features
  • Always visualize residuals to check if your model assumptions are valid
  • Remember: A high R2 doesn't automatically mean a good modelโ€”context matters!