📊 The Statistical Toolbox: Residuals, Correlation, and Determination

🏗️ The Big Three: What are they?

Concept	The "In Plain English" Definition	Where is it used?
Residual ( $e$ )	The "Error" for a single point. (Actual - Predicted)	To see how "off" we were for one specific person/item.
Correlation ( $r$ )	The "Direction & Strength" of the relationship.	To see if two things (like height and weight) move together.
Determination ( $R^{2}$ )	The "Success Score" of the whole model.	To tell us what % of the variation we successfully explained.

🔗 The Connection: How they fit together

Think of it like a puzzle:

Correlation ( $r$ ) is the raw relationship between $X$ and $Y$ .
Residuals are the pieces that don't fit the pattern.
Coefficient of Determination ( $R^{2}$ ) is the final grade for the whole puzzle.

🚦 Usage Summary

Use $r$ to see if there is a relationship at all.
Use $R^{2}$ to see how useful your model is for the big picture.
- $R^{2}$ represents the proportion of the total variation in the dependent variable ( $y$ ) that is explained by the independent variable ( $x$ ).
- In other words, $r^{2}$ is the "Explained" part of the variation. We find it by subtracting the "Unexplained" ratio from 1
Use Residual Patterns to see if your model is biased or broken.
Use Standard Deviation of Residuals ( $σ_{r}$ ) to tell someone: "My prediction is probably off by ± [this much] in real units."

Two Ways to Calculate $R^{2}$ : Which one and When?

There are two formulas for $R^{2}$ . Here is the breakdown:

1. The "Shortcut" Method (using Correlation)

R^{2} = (r)^{2}

When to use: When you already have the correlation coefficient ( $r$ ).
Meaning: It tells you that the variance explained is exactly the square of the linear strength.
Limitation: This only works for Linear Regression with one independent variable.

2. The "Residual" Method (using Sum of Squares)

R^{2} = 1 - \frac{S S_{r e s}}{S S_{t o t}}

When to use: When you have the actual data points and the regression line. This is the most powerful way because it works even for complex models (Multiple Regression).
Meaning: It calculates: $1 - (Unexplained Variation / Total Variation)$ .
Analogy: If you have 100 gallons of "mystery" (Total Variation) and 15 gallons are still "mysterious" (Residuals) after your explanation, then you explained $1 - 15 / 100 = 85 %$ .

📉 Deep Dive: Residuals & Variation

1. Residual ( $r_{i}$ )

Meaning: A residual is a measure of how well a line fits an individual data point.
A residual is the vertical distance between an actual data point and the regression line.
Connection: If you sum all your residuals (In Linear Regression) , they should equal zero.
If you square them and add them up, you get the Residual Sum of Squares ( $S S_{r e s}$ ), which represents the "unexplained" portion of your model.

\begin{aligned} Residual & = Actual - Predicted \\ r_{i} & = y_{i} - {\hat{y}}_{i} \end{aligned}

The Goal: In a perfect model, the sum of residuals is always 0.
The residual for the point (4,3) is [-2]

The Residual Check

Diagnostic Tool: Scientists plot residuals on a "Residual Plot" to see if there are patterns.

If they form a "U" shape, a straight line was the wrong choice.
If the residuals look like a random cloud, the linear model is appropriate.
If you see a pattern in your residuals (like a curves or "fan" shape), stop! Your $R^{2}$ might be high, but your model is fundamentally wrong for the data.

2. Residual Sum of Squares ( $S S_{r e s}$ )

Meaning: This is the Unexplained Variation. This is the variation the line couldn't explain—the distance between the points and the line.
Connection: Lower $S S_{r e s}$ means a higher $R^{2}$ (a better fit!).

S S_{r e s} = \sum (r_{i})^{2} = \sum (y_{i} - {\hat{y}}_{i})^{2}

3. Variance of Residuals ( $V_{r}$ )

Meaning: The average squared error. It tells us the "spread" of our mistakes.

V_{r} = σ_{r}^{2} = \frac{S S_{r e s}}{n}

4. Standard Deviation of Residuals ( $σ_{r}$ )

Meaning: Frequently called the Standard Error of the Estimate.

Why it's crucial: This is the "Typical Error." If $σ_{r} = 2.5$ , it means your predictions are usually off by about 2.5 units.
Use: While $R^{2}$ gives you a percentage (relative), $σ_{r}$ gives you an error in the actual units (absolute), like inches or dollars. $σ_{r} = \sqrt{V_{r}}$

5. Root Mean Square Error ( $S S_{l i n e}$ )

Used to measure Accuracy in the same units as your data.

Meaning: It is the standard deviation of the residuals. It tells you, on average, how far off your predictions are from the actual values.
Connection:
- $R M S E$ is the square root of the average squared residual.
- While $r^{2}$ tells you the percentage of accuracy (e.g., 89%), RMSE tells you the magnitude of the error (e.g., "Our price prediction is off by $500 on average").
Often used interchangeably with the standard deviation of residuals in this context

S S_{l i n e} = σ_{r}

5. Total Variation ( $S S_{t o t a l}$ )

Meaning: It is the starting baseline for all error calculations. It represents 100% of the movement we are trying to explain. It represents the total "chaos" or spread in your $y$ values relative to their average ( $μ_{y}$ ).
Total Variation ( $S S_{t o t a l}$ ): This is the sum of squared differences from the mean $μ_{y}$

S S_{t o t a l} = \sum (y - μ_{y})^{2}

🛠️ Working Example (Corrected)

Data points: (1,1), (2,2), (2,3), (3,6)

1: Summary Stats

i. Means ( $μ_{x}, μ_{y}$ )

\begin{aligned} μ_{x} = \frac{(1 + 2 + 2 + 3)}{4} = 2.0 & μ_{y} = \frac{(1 + 2 + 3 + 6)}{4} = 3.0 \end{aligned}

ii. Standard Deviations ( $σ_{x}, σ_{y}$ )

This measures the "spread" of our data around the mean. We'll use the population formula for this high school example: $σ = \sqrt{\frac{\sum (x - μ)^{2}}{n}}$ .

\begin{aligned} σ_{x} & = \sqrt{\frac{(1 - 2)^{2} + (2 - 2)^{2} + (2 - 2)^{2} + (3 - 2)^{2}}{4}} & = \sqrt{\frac{1 + 0 + 0 + 1}{4}} & = \sqrt{0.5} \approx 0.707 \\ σ_{y} & = \sqrt{\frac{(1 - 3)^{2} + (2 - 3)^{2} + (3 - 3)^{2} + (6 - 3)^{2}}{4}} & = \sqrt{\frac{4 + 1 + 0 + 9}{4}} & = \sqrt{3.5} \approx 1.871 \end{aligned}

iii: Coefficient Correlation ( $r$ )

The correlation $r$ tells us the strength and direction of the linear relationship. Using the formula

\begin{aligned} r & = \frac{\sum ((x - μ_{x}) (y - μ_{y}))}{n \cdot σ_{x} \cdot σ_{y}} \\ r & = \frac{1}{n} [(\frac{1 - 2}{0.707}) (\frac{1 - 3}{1.871}) + (\frac{2 - 2}{0.707}) (\frac{2 - 3}{1.871}) + (\frac{2 - 2}{0.707}) (\frac{3 - 3}{1.871}) + (\frac{3 - 2}{0.707}) (\frac{6 - 3}{1.871})] \\ r & = \frac{(1.135 + 0 + 0 + 1.702)}{4} \\ r & = 0.945 \end{aligned}

Interpretation: This is a very strong positive correlation!

2: The Equation of the Regression Line ( $\hat{y} = m x + b$ )

Slope ( $m$ ): $r \cdot \frac{σ_{y}}{σ_{x}} = 0.945 \cdot \frac{1.871}{0.707} = 2.5$
Intercept ( $b$ ): $μ_{y} - m (μ_{x}) = 3.0 - 2.5 (2.0) = - 2.0$
Line: $\hat{y} = 2.5 x - 2.0$

3: Residual Breakdown Table

$x$	$y$ (Actual)	$\hat{y}$ (Predicted)	Residual ( $y - \hat{y}$ )	Squared Residual	Total Variation ( $y - \bar{y})^{2}$
1	1	0.5	0.5	0.25	4
2	2	3.0	-1.0	1.00	1
2	3	3.0	0.0	0.00	0
3	6	5.5	0.5	0.25	9
Sum			0	$S S_{r e s} = 1.5$	$S S_{t o t} = 14$

i. Residual/Unexplained Variation ( $S S_{r e s}$ )

\begin{aligned} S S_{r e s} & = \sum (y - \hat{y})^{2} = (- 0.5)^{2} + (1.0)^{2} + (0.0)^{2} + (0.5)^{2} \\ S S_{r e s} & = 1.5 \end{aligned}

4: Total Variation ( $S S_{t o t a l}$ )

It is the starting baseline for all error calculations. It represents 100% of the movement we are trying to explain.

Meaning: It represents the total "chaos" or spread in your $y$ values relative to their average ( $μ_{y}$ ).
Total Variation ( $S S_{t o t a l}$ ): This is the sum of squared differences from the mean $μ_{y}$

S S_{t o t a l} = \sum (y - μ_{y})^{2} = 14.0

5: FINAL Calculation of Coefficient of Determination

Method 1 ( $r^{2}$ ):

\begin{aligned} R_{2} & = r^{2} = {0.945}^{2} \\ R_{2} & = 0.893 \end{aligned}

Method 2 (Residuals):

\begin{array}{r} R^{2} = 1 - \frac{S S_{r e s}}{S S_{t o t a l}} = 1 - \frac{1.5}{14} = 0.89 \end{array}

The Final Analysis

How much variation is described by $x$ ?

89.3% of the total variation in $y$ is explained by the variation in $x$ (the regression line).

How much variation is NOT described by $x$ ?

$1 - r^{2} = 1 - 0.893 = 10.7 %$ .

Interpretation: Only 10.7% of the variation is due to "noise" or other factors not included in our model.

Step 6: Additional Interpretations

Variance of Residual ( $V_{r}$ ): Average of squared residuals:

\begin{array}{r} V_{r} = σ_{r}^{2} = \frac{S S_{r e s}}{n} = \frac{1.5}{4} = 0.375 \end{array}

Standard Deviation of Residual ( $σ_{r}$ )

\begin{array}{r} σ_{r} = \sqrt{V_{r}} = \sqrt[2]{0.37} = 0.612 \end{array}

Root Mean Square Error ( $S S_{l i n e}$ )

\begin{array}{r} S S_{l i n e} \approx 0.612 \end{array}

Observations for "Correlation Coefficient" $(r)$


[r=1], which is perfect positive correlation	[r=0], which is no correlation	[r=0.5], which is weak positive correlation	[r=-0.5], which is weak negative correlation	[r=-1], which is perfect negative correlation

📊 The Statistical Toolbox: Residuals, Correlation, and Determination

🏗️ The Big Three: What are they?

🔗 The Connection: How they fit together

🚦 Usage Summary

Two Ways to Calculate R2: Which one and When?

1. The "Shortcut" Method (using Correlation)

2. The "Residual" Method (using Sum of Squares)

📉 Deep Dive: Residuals & Variation

1. Residual (ri)

2. Residual Sum of Squares (SSres)

3. Variance of Residuals (Vr)

4. Standard Deviation of Residuals (σr)

5. Root Mean Square Error (SSline)

5. Total Variation (SStotal)