P-Value

I. What is a Hypothesis?

A statistical hypothesis is a a statement, assumption, or claim about a population that we test through statistical analysis.
Examples:

II. The Null Hypothesis (H0)

In any experiment, we start with the default assumption that nothing interesting is happening. This is the Null Hypothesis (H0).

The Null Hypothesis (H₀) is the default assumption.
It represents:

III. Alternative Hypothesis (Ha)

We also have the Alternative Hypothesis (Ha),
 - which is the opposite claim, suggesting that the Null Hypothesis (H0) is incorrect.

The Alternative Hypothesis is what we are trying to find evidence for.
It represents:

There are three types:

  1. Two-tailed: Ha:μμ0
  2. Right-tailed: Ha:μ>μ0
  3. Left-tailed: Ha:μ<μ0

IV. How do we perform Hypothesis Testing?

Lets Understand below 3 connected concepts

★ Level of Confidence

The confidence level represents how certain we are that our sample data reflects reality despite natural variations.

Confidence Level helps answer: How confident are we in our claim?

★ The "Alpha" (α) Threshold ➛ "Significance Level"

In Advanced Stats, we usually set a "line in the sand" called the significance level (α), typically 0.05.
The significance level (α) is the probability of rejecting the Null Hypothesis when it is actually true. It sets the threshold for statistical significance.
We compare the p-value to a significance level (α)

Here: 0.0059<0.05, so we reject H₀.
Conclusion: There is strong evidence the coin is biased.

Formula: Significance Level (α) = 1 — Confidence Level
Example: If the Confidence Level = 95%, then α = 0.05 (5%).

★ What Is the P-Value?

The P-value is the probability of getting results at least as extreme as the ones we observed, assuming that the Null Hypothesis is actually true.

p-value=P(data as extreme as observedH0 true)

Important: It is a probability calculated under the assumption that H₀ is true.

★ Decision Rule / Interpreting the Number

The P-value is a number between 0 and 1. It tells you how "weird" your data is under the assumption that nothing is happening.

Example: If we obtain a P-value of 0.03 and our significance level α = 0.05, then 0.03 < 0.05, so we reject the null hypothesis.

V. Diagrammatic representation

<img src="Learning/Stats/Pictures/hypothesis_1.png" height="400", width="700">

VI. What the P-Value DOES NOT Mean?

The p-value "is NOT":

The Bigger Picture

How incompatible the data is with pure noise.

VII. Hypothesis Testing Example

Problem Statement:

A university claims that the average starting salary of its graduates is at least $70,000 per year. A job market analyst believes the true average starting salary is less than $70,000. To test this claim, the analyst collects a random sample of 200 graduates and records their starting salaries.

After analyzing the data, the results are:

Step 1: Define Hypothesis

H0:μ=70,000 Ha:μ<70,000

Step 2: Calculate Z-Score (Test Statistic)

Since the sample size is large (n30), we use a Z-test

The formula for the Z-score is:

Z=x¯μσn

Substituting the given values:

Z=68500700008000200Z=2.65

Step 3: Find P-Value

Using a Z-table or calculator, the p-value for Z = -2.65 is:

P(Z<2.65)0.004

Step 4: Compare P-Value with Alpha (α)

Step 5: Conclusion

There is strong statistical evidence that the university’s graduates earn less than $70,000 on average. The analyst rejects Null Hypothesis (H0) and concludes that the true average salary is likely lower than $70,000.

Final Summary


Relation between R2 and pvalue

  • R-square value tells you how much variation is explained by your model. So 0.1 R-square means that your model explains 10% of variation within the data. The greater R-square the better the model. Whereas p-value tells you about the F statistic hypothesis testing of the “fit of the intercept-only model and your model are equal”. So if the p-value is less than the significance level (usually 0.05) then your model fits the data well.

There are 4 scenarios

1. Low R2 and Low p-value (p-value <= 0.05)
  • It means that your model doesn’t explain much of variation of the data but it is significant (better than not having a model)
2. Low R-square and High p-value (p-value > 0.05)
  • It means that your model doesn’t explain much of variation of the data and it is not significant (worst scenario)
3. High R-square and Low p-value
  • It means your model explains a lot of variation within the data and is significant (best scenario)
4. High R-square and High p-value
  • It means that your model explains a lot of variation within the data but is not significant (model is worthless)

Tutorial Videos