📊 The Statistical Toolbox: Residuals, Correlation, and Determination

🏗ïļ The Big Three: What are they?

Concept The "In Plain English" Definition Where is it used?
Residual (e) The "Error" for a single point. (Actual - Predicted) To see how "off" we were for one specific person/item.
Correlation (r) The "Direction & Strength" of the relationship. To see if two things (like height and weight) move together.
Determination (R2) The "Success Score" of the whole model. To tell us what % of the variation we successfully explained.

🔗 The Connection: How they fit together

Think of it like a puzzle:

  1. Correlation (r) is the raw relationship between X and Y.
  2. Residuals are the pieces that don't fit the pattern.
  3. Coefficient of Determination (R2) is the final grade for the whole puzzle.

ðŸšĶ Usage Summary


Two Ways to Calculate R2: Which one and When?

There are two formulas for R2. Here is the breakdown:

1. The "Shortcut" Method (using Correlation)

R2=(r)2

2. The "Residual" Method (using Sum of Squares)

R2=1−SSresSStot

📉 Deep Dive: Residuals & Variation

1. Residual (ri)

Residual=Actual - Predictedri=yi−y^i
The Residual Check

Diagnostic Tool: Scientists plot residuals on a "Residual Plot" to see if there are patterns.

  • If they form a "U" shape, a straight line was the wrong choice.
  • If the residuals look like a random cloud, the linear model is appropriate.
    If you see a pattern in your residuals (like a curves or "fan" shape), stop! Your R2 might be high, but your model is fundamentally wrong for the data.

2. Residual Sum of Squares (SSres)

SSres=∑(ri)2=∑(yi−y^i)2

3. Variance of Residuals (Vr)

Vr=σr2=SSresn

4. Standard Deviation of Residuals (σr)

5. Root Mean Square Error (SSline)

Used to measure Accuracy in the same units as your data.

SSline=σr

5. Total Variation (SStotal)

SStotal=∑(y−ξy)2

🛠ïļ Working Example (Corrected)

Data points: (1,1), (2,2), (2,3), (3,6)

1: Summary Stats

i. Means (Ξx,Ξy)
Ξx=(1+2+2+3)4=2.0Ξy=(1+2+3+6)4=3.0
ii. Standard Deviations (σx,σy)

This measures the "spread" of our data around the mean. We'll use the population formula for this high school example: σ=∑(x−ξ)2n.

σx=(1−2)2+(2−2)2+(2−2)2+(3−2)24=1+0+0+14=0.5≈0.707σy=(1−3)2+(2−3)2+(3−3)2+(6−3)24=4+1+0+94=3.5≈1.871
iii: Coefficient Correlation (r)

The correlation r tells us the strength and direction of the linear relationship. Using the formula

r=∑((x−ξx)(y−ξy))n⋅σx⋅σyr=1n[(1−20.707)(1−31.871)+(2−20.707)(2−31.871)+(2−20.707)(3−31.871)+(3−20.707)(6−31.871)]r=(1.135+0+0+1.702)4r=0.945

2: The Equation of the Regression Line (y^=mx+b)

3: Residual Breakdown Table

x y (Actual) y^​ (Predicted) Residual (y−y^) Squared Residual Total Variation (y−yÂŊ)2
1 1 0.5 0.5 0.25 4
2 2 3.0 -1.0 1.00 1
2 3 3.0 0.0 0.00 0
3 6 5.5 0.5 0.25 9
Sum 0 SSres=1.5 SStot=14
i. Residual/Unexplained Variation (SSres)
SSres=∑(y−y^)2=(−0.5)2+(1.0)2+(0.0)2+(0.5)2SSres=1.5

4: Total Variation (SStotal)

It is the starting baseline for all error calculations. It represents 100% of the movement we are trying to explain.

SStotal=∑(y−ξy)2=14.0

5: FINAL Calculation of Coefficient of Determination

R2=r2=0.9452R2=0.893 R2=1−SSresSStotal=1−1.514=0.89
The Final Analysis

  1. How much variation is described by x?
  • 89.3% of the total variation in y is explained by the variation in x (the regression line).
  1. How much variation is NOT described by x?
  • 1−r2=1−0.893=10.7%.

Interpretation: Only 10.7% of the variation is due to "noise" or other factors not included in our model.


Step 6: Additional Interpretations

Vr=σr2=SSresn=1.54=0.375 σr=Vr=0.372=0.612 SSline≈0.612

Observations for "Correlation Coefficient" (r)




[r=1], which is perfect
positive correlation
[r=0], which is no
correlation
[r=0.5], which is weak
positive correlation
[r=-0.5], which is weak
negative correlation
[r=-1], which is perfect
negative correlation