Bar Plot / Count Plot

I. Purpose

Display counts or aggregated values for categorical variables. Bar and count plots are essential for comparing frequencies, proportions, or summary statistics across groups, and for visualizing class balance in classification problems.

II. Analysis Type

Univariate (count plot: single categorical variable) or Bivariate (bar plot: categorical + numeric, or grouped bar plot)

III. What to Look For

1. Frequency Distribution

2. Class Imbalance

3. Comparisons Across Categories

4. Proportions and Percentages

5. Grouped Comparisons (Hue Parameter)

6. Ordering and Patterns

7. Missing or Rare Categories

IV. Interpretation Guide

1. Reading Bar Heights

2. Common Patterns and Their Meanings

Pattern Visual Cue Interpretation Action
Class imbalance One bar much taller than others Severe imbalance in target/classes Consider resampling or class weights
Balanced classes Bars of similar height No major imbalance No action needed
Group separation Bars for different hues well separated Strong group effect Investigate group differences
Overlapping bars Bars for different hues overlap Weak or no group effect May not be significant
Short bars Very low bars Rare categories or missing data Consider combining or removing
Zero-height bars Bar missing for a category No data for that category Check data completeness
Stacked bars Bars divided into colored segments Multiple subgroups within each category Use for composition analysis
Wide error bars Large uncertainty High variability in group Collect more data or use robust stats
Narrow error bars Small uncertainty Low variability, reliable estimate Confident in group differences

3. Visual Cues for Imbalance and Group Differences

4. Spotting Outliers and Rare Categories

5. Statistical Significance

V. When to Use Bar/Count Plots

VI. When to Avoid

VII. Disadvantages

VIII. Code Example

import seaborn as sns
import matplotlib.pyplot as plt

# Load the built-in "titanic" dataset from Seaborn
titanic = sns.load_dataset("titanic")

sns.countplot(x='class', data=titanic)
plt.title("Frequency of Class")
plt.show()

# Bar plot - average value by category
sns.barplot(x='class', y='fare', data=titanic, estimator='mean', errorbar='ci')
plt.title("Average Fare by Class (with 95% CI)")  
plt.ylabel("Average Fare ($)")
plt.show()

# Grouped bar plot - comparing across two dimensions
sns.countplot(data=titanic, x='class', hue='survived')
plt.title("Survival Count by Passenger Class")
plt.xlabel("Passenger Class")
plt.ylabel("Count")
plt.legend(title='Survived', labels=['No', 'Yes'])
plt.show()

ML_AI/_feature_engineering/images/bar-1.pngML_AI/_feature_engineering/images/bar-2.pngML_AI/_feature_engineering/images/bar-3.png

IX. Common Pitfalls

  1. Not ordering bars: Use order=df['col'].value_counts().index for descending order
  2. Too many categories: Limit to 10-15 bars; combine rare categories into "Other"
  3. Ignoring error bars: Always check variability, not just means
  4. Wrong estimator: mean vs sum vs median - choose based on data distribution

X. Pro Tips

1. Add Percentage Labels

For count plots with imbalanced categories, add percentage labels on top of bars:

ax = sns.countplot(x='category', data=df, order=df['category'].value_counts().index)
total = len(df)
for p in ax.patches:
    percentage = f'{100 * p.get_height() / total:.1f}%'
    ax.annotate(percentage, 
                (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='bottom', fontsize=10, fontweight='bold')
plt.title("Category Distribution with Percentages")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
2. Horizontal Bars for Long Names

When category names are long, use horizontal bars to improve readability:

sns.countplot(y='category', data=df, order=df['category'].value_counts().index)

The y parameter creates horizontal bars, and labels won't overlap.

3. Custom Aggregation Functions

Use custom estimators for specific calculations:

# Median instead of mean (robust to outliers)
sns.barplot(x='category', y='value', data=df, estimator='median', errorbar=('ci', 95))

# Custom function: 90th percentile
import numpy as np
sns.barplot(x='category', y='value', data=df, estimator=lambda x: np.percentile(x, 90))
When to Use Bar vs Count Plots

Use Count Plot when:

  • Counting occurrences of categorical values
  • Checking class balance in classification
  • Exploring categorical feature distributions
  • Only ONE categorical variable

Use Bar Plot when:

  • Aggregating a numeric variable by category (mean, sum, median)
  • Comparing averages across groups
  • TWO variables: one categorical (x), one numeric (y)
  • Need error bars to show uncertainty

XI. Documentation & External References

Official Documentation:


Understanding Class Imbalance

Calculate imbalance ratio:

# Check imbalance
class_counts = df['target'].value_counts()
imbalance_ratio = class_counts.max() / class_counts.min()

print(f"Imbalance Ratio: {imbalance_ratio:.2f}:1")
if imbalance_ratio > 3:
    print("⚠️ Significant imbalance detected - consider resampling")

Imbalance Severity: