Box Plot (Box-and-Whisker Plot)

I. Purpose

Display the distribution and identify outliers in data through quartiles. Excellent for comparing distributions across categories.

II. Analysis Type

Univariate (single variable) or Bivariate (variable across categories)

III. Understanding Box Plot Components

ML_AI/_feature_engineering/images/box-2.png

1. Central Tendency
2. Spread (IQR)
3. Outliers
4. Skewness

box_skew.webp

5. The whiskers:
6. Boxes

IV. What to Look For?

In Box Plot we start by comparing below groups to derive some conclusions

1. Common Patterns and Their Meanings

Pattern Visual Cue Interpretation Action
Tall box Large IQR High variability Check for subgroups
Short box Small IQR Low variability Data is consistent
Median at top Near Q3 Left-skewed Consider log transform
Median at bottom Near Q1 Right-skewed Consider log/sqrt transform
Many outliers Lots of points Heavy tails or contamination Investigate data quality
No whiskers Only box visible All data within IQR Very tight distribution
Asymmetric whiskers One long, one short Skewed distribution Assess need for transformation
No box Only whiskers All data at a single value Check for constant feature
Stacked boxes Multiple boxes per group Subgroups or batch effects Investigate batch/group effect

2. Linearity Identification

i. Linearity (The "Staircase" Effect)

In a linear relationship, the independent variable has a constant effect on the target. On a box plot, this looks like a well-constructed staircase.

ii. Non-Linear (The "Jumping" or "Threshold" Effect)

Non-linearity occurs when the "rules" of the relationship change depending on which group you are looking at.

Clue If Linear... If Non-Linear...
Median Path A straight line can connect all median dots. You need a "curve" or "zigzag" to connect the medians.
Overlap Minimal overlap between adjacent categories. Heavy overlap in some spots, huge gaps in others.
Box Height Consistent height (stable variance). Boxes "explode" in size or "shrink" unexpectedly.

2. Gaussian distribution

To identify a Normal (Gaussian) Distribution using a box plot, you are essentially looking for perfect symmetry and a specific balance of data across the quartiles. In a normal distribution, the mean and median are approximately equal, and the data is distributed predictably around that center.
Visual Clues for a Normal Distribution

  1. Median Centering: The median line (the horizontal bar inside the box) sits exactly in the middle of the box. The distance from the bottom of the box (Q1) to the median is equal to the distance from the median to the top of the box (Q3).
  2. Symmetric Whiskers: The "whiskers" extending from the top and bottom of the box are roughly the same length.
  3. Minimal Outliers: In a true normal distribution, outliers (points beyond the whiskers) should be rare and balanced. You shouldn't see a cluster of dots on just one side.

3. Detecting Significant Group Differences

Quick visual rule: If medians are significantly different and boxes don't overlap, there's likely a statistically significant difference between groups. For formal confirmation:

📌 Box plots are ideal for non-normal data since they use medians and quartiles (robust to outliers).

V. When to Use Box Plots

VI. When to Avoid Box Plots

VII. Advantages of Box Plots

VIII. Disadvantages

✍️ IX. Code Example

# Single box plot
import seaborn as sns
import matplotlib.pyplot as plt

# Load sample dataset
tips = sns.load_dataset('tips')

# Combined with box plot
fig, ax = plt.subplots(figsize=(6, 6))
sns.boxplot(x='day', y='total_bill', data=tips, ax=ax)
plt.title("Box Plot with Strip Plot Overlay (Total Bill by Day)")
plt.show()

ML_AI/_feature_engineering/images/box-1.png

⚡ IX. Pro Tip while plotting

1. Median vs Mean Comparison

Display both median and mean to understand skewness impact:

sns.boxplot(x='category', y='value', data=df, showmeans=True,  
		meanprops={"marker":"^", 
					"markerfacecolor":"red",  
					"markeredgecolor":"red", 
					"markersize":8})

If mean > median: Right-skewed distribution (outliers pull mean up)
If mean < median: Left-skewed distribution (outliers pull mean down)
If mean ≈ median: Symmetric distribution

2. Horizontal Box Plots for Many Categories

When comparing many categories with long names, use horizontal orientation:

sns.boxplot(y='category', x='value', data=df, orient='h')

📌 This prevents label overlap and improves readability, especially with 5+ categories.

3. Show Number of Observations

Show the number of observations in the boxplot.

# Calculate number of obs per group & median to position labels
medians = df.groupby(['group'])['value'].median().values
nobs = df.groupby("group").size().values
nobs = [str(x) for x in nobs.tolist()]
nobs = ["n: " + i for i in nobs]
 
# Add it to the plot
pos = range(len(nobs))
for tick,label in zip(pos,ax.get_xticklabels()):
    plt.text(pos[tick], medians[tick] + 0.4, nobs[tick], horizontalalignment='center', size='medium', color='w', weight='semibold')
4. Add Jitter

By adding a stripplot, you can show all observations along with some representation of the underlying distribution.

sns.stripplot(x='species', y='sepal_length', data=df_iris, color="orange", jitter=0.2, size=2.5, ax=ax)

With Jitters, Mean and number of observations added, below is how the Box Plot looks like.
ML_AI/_feature_engineering/images/box-4.png

📚 XI. Documentation & External References

Official Documentation: