Box Plot (Box-and-Whisker Plot)

I. Purpose

Display the distribution and identify outliers in data through quartiles. Excellent for comparing distributions across categories.

II. Analysis Type

Univariate (single variable) or Bivariate (variable across categories)

III. Understanding Box Plot Components

1. Central Tendency

Median (Line inside box) Medium is approximately in center of Q1 and Q3. (50th percentile)
Position indicates where data is centered

2. Spread (IQR)

Interquartile Range (Box height): ( $I Q R = Q 3 - Q 1$ )
Contains middle 50% of data
Larger box = more variability

3. Outliers

Lower fence = $Q 1 - 1.5 \times IQR$ (outliers below this)
Upper fence = $Q 3 + 1.5 \times IQR$ (outliers above this)
Individual points beyond whiskers are outliers.
Mathematically defined as values $> Q 3 + 1.5 \times IQR$ or $< Q 1 - 1.5 \times IQR$

4. Skewness

Right-skewed: Right side of the box-and-whisker plot is longer ➛ Median closer to Q1
Left-skewed: Left side of the box-and-whisker plot is longer ➛ Median closer to Q3 Left-skewed
Symmetric distribution: Medium is approximately in center of Q1 and Q3.

5. The whiskers:

The lines coming out from each box extend from the maximum to the minimum values of each set.
Together with the box, the whiskers show how big a range there is between those two extremes.
Larger ranges indicate wider distribution, that is, more scattered data.

6. Boxes

Short boxes mean their data points consistently hover around the center values.
Taller boxes imply more variable data.

IV. What to Look For?

In Box Plot we start by comparing below groups to derive some conclusions

Compare medians across categories
Compare spread and outlier patterns

1. Common Patterns and Their Meanings

Pattern	Visual Cue	Interpretation	Action
Tall box	Large IQR	High variability	Check for subgroups
Short box	Small IQR	Low variability	Data is consistent
Median at top	Near Q3	Left-skewed	Consider log transform
Median at bottom	Near Q1	Right-skewed	Consider log/sqrt transform
Many outliers	Lots of points	Heavy tails or contamination	Investigate data quality
No whiskers	Only box visible	All data within IQR	Very tight distribution
Asymmetric whiskers	One long, one short	Skewed distribution	Assess need for transformation
No box	Only whiskers	All data at a single value	Check for constant feature
Stacked boxes	Multiple boxes per group	Subgroups or batch effects	Investigate batch/group effect

2. Linearity Identification

i. Linearity (The "Staircase" Effect)

In a linear relationship, the independent variable has a constant effect on the target. On a box plot, this looks like a well-constructed staircase.

Visible Separation: If you draw a horizontal line through the median of Category A, it should not cross through the "box" of Category B. This clear air between the boxes shows the feature has strong predictive power.
Ordered Progression: If your categories have a natural order (e.g., "Small, Medium, Large"), the medians should climb (or descend) steadily. If "Medium" is suddenly lower than "Small," the linear assumption is broken.
Constant "Step" Size: The distance between the median of Group 1 and Group 2 should be roughly the same as the distance between Group 2 and Group 3. This indicates an additive effect, which is the definition of linearity.

ii. Non-Linear (The "Jumping" or "Threshold" Effect)

Non-linearity occurs when the "rules" of the relationship change depending on which group you are looking at.

The "U-Turn" or "Hump": If the medians go up for the first three categories and then suddenly drop for the fourth, you have a parabolic (non-linear) relationship. A straight line would fail here because it can only go in one direction.
Uneven Separation (The Threshold): You might see that Groups 1, 2, and 3 look nearly identical (boxes are overlapping), but Group 4 suddenly "jumps" way higher. This suggests a threshold effect—the feature only matters once it hits a certain level.
Variable Spread: In linear data, the "height" of the boxes (IQR) usually stays similar. If the boxes get significantly taller or shorter as you move across categories, you are looking at Heteroscedasticity, which often stems from a non-linear relationship.

Clue	If Linear...	If Non-Linear...
Median Path	A straight line can connect all median dots.	You need a "curve" or "zigzag" to connect the medians.
Overlap	Minimal overlap between adjacent categories.	Heavy overlap in some spots, huge gaps in others.
Box Height	Consistent height (stable variance).	Boxes "explode" in size or "shrink" unexpectedly.

2. Gaussian distribution

To identify a Normal (Gaussian) Distribution using a box plot, you are essentially looking for perfect symmetry and a specific balance of data across the quartiles. In a normal distribution, the mean and median are approximately equal, and the data is distributed predictably around that center.
Visual Clues for a Normal Distribution

Median Centering: The median line (the horizontal bar inside the box) sits exactly in the middle of the box. The distance from the bottom of the box ( $Q_{1}$ ) to the median is equal to the distance from the median to the top of the box ( $Q_{3}$ ).
Symmetric Whiskers: The "whiskers" extending from the top and bottom of the box are roughly the same length.
Minimal Outliers: In a true normal distribution, outliers (points beyond the whiskers) should be rare and balanced. You shouldn't see a cluster of dots on just one side.

3. Detecting Significant Group Differences

Quick visual rule: If medians are significantly different and boxes don't overlap, there's likely a statistically significant difference between groups. For formal confirmation:

2 groups, normal data: Use independent t-test
2 groups, non-normal: Use Mann-Whitney U test
3+ groups, normal: Use ANOVA
3+ groups, non-normal: Use Kruskal-Wallis test

📌 Box plots are ideal for non-normal data since they use medians and quartiles (robust to outliers).

V. When to Use Box Plots

Comparing distributions across multiple groups
Identifying outliers in data
Quick visual assessment of center, spread, and skewness
Non-parametric analysis (no normality assumption)
Presenting summary statistics visually

VI. When to Avoid Box Plots

Sample size < 10 (too few points for meaningful quartiles)
Need to see actual distribution shape (use histogram/KDE instead)
Bimodal or multimodal data (box plots hide multiple peaks)
Precise comparisons needed (use statistical tests with tables)
Audience unfamiliar with box plots (use bar charts with error bars)

VII. Advantages of Box Plots

Summarize distribution, center, spread, and outliers in one compact graphic
Robust to outliers and non-normal data (uses medians and quartiles)
Excellent for comparing distributions across multiple groups
Quickly highlights skewness, symmetry, and group differences
Can overlay raw data (stripplot) for more detail

VIII. Disadvantages

Can hide multimodality (multiple peaks) or fine structure in data
No indication of sample size (unless annotated)
With extreme outliers, the variance and IQR of box plot shrinks
Can be misinterpreted as bar plots by non-experts
Not ideal for very small samples (<10)
May not show actual distribution shape (use histogram/KDE for that)

✍️ IX. Code Example

# Single box plot
import seaborn as sns
import matplotlib.pyplot as plt

# Load sample dataset
tips = sns.load_dataset('tips')

# Combined with box plot
fig, ax = plt.subplots(figsize=(6, 6))
sns.boxplot(x='day', y='total_bill', data=tips, ax=ax)
plt.title("Box Plot with Strip Plot Overlay (Total Bill by Day)")
plt.show()

⚡ IX. Pro Tip while plotting

1. Median vs Mean Comparison

Display both median and mean to understand skewness impact:

sns.boxplot(x='category', y='value', data=df, showmeans=True,  
		meanprops={"marker":"^", 
					"markerfacecolor":"red",  
					"markeredgecolor":"red", 
					"markersize":8})

If mean > median: Right-skewed distribution (outliers pull mean up)
If mean < median: Left-skewed distribution (outliers pull mean down)
If mean ≈ median: Symmetric distribution

2. Horizontal Box Plots for Many Categories

When comparing many categories with long names, use horizontal orientation:

sns.boxplot(y='category', x='value', data=df, orient='h')

📌 This prevents label overlap and improves readability, especially with 5+ categories.

3. Show Number of Observations

Show the number of observations in the boxplot.

# Calculate number of obs per group & median to position labels
medians = df.groupby(['group'])['value'].median().values
nobs = df.groupby("group").size().values
nobs = [str(x) for x in nobs.tolist()]
nobs = ["n: " + i for i in nobs]
 
# Add it to the plot
pos = range(len(nobs))
for tick,label in zip(pos,ax.get_xticklabels()):
    plt.text(pos[tick], medians[tick] + 0.4, nobs[tick], horizontalalignment='center', size='medium', color='w', weight='semibold')

4. Add Jitter

By adding a stripplot, you can show all observations along with some representation of the underlying distribution.

sns.stripplot(x='species', y='sepal_length', data=df_iris, color="orange", jitter=0.2, size=2.5, ax=ax)

With Jitters, Mean and number of observations added, below is how the Box Plot looks like.

📚 XI. Documentation & External References

Official Documentation: