Decision Tree: Which Plot to Use?
---
config:
theme: base
themeVariables:
primaryColor: "#6366f1"
primaryTextColor: "#fff"
primaryBorderColor: "#4f46e5"
lineColor: "#64748b"
secondaryColor: "#8b5cf6"
tertiaryColor: "#06b6d4"
noteTextColor: "#1e293b"
noteBkgColor: "#e0e7ff"
edgeLabelBackground: "#f8fafc"
---
flowchart TD
Start([What are you analyzing?]):::startStyle
Start --> Univariate[Single Variable
Univariate]:::univariateStyle
Start --> Bivariate[Two Variables
Bivariate]:::bivariateStyle
Start --> Multivariate[Multiple Variables
Multivariate]:::multivariateStyle
Start --> Validation[Model Validation]:::validationStyle
%% Univariate Branch
Univariate --> Continuous1[Continuous
Data]:::dataTypeStyle
Univariate --> Categorical1[Categorical
Data]:::dataTypeStyle
Continuous1 --> UniPlots1[📊 Histogram
📈 KDE Plot
📦 Box Plot
📉 Q-Q Plot]:::plotStyle
Categorical1 --> UniPlots2[📊 Bar Plot
🔢 Count Plot]:::plotStyle
%% Bivariate Branch
Bivariate --> BothCont[Both
Continuous]:::dataTypeStyle
Bivariate --> Mixed[One Continuous
One Categorical]:::dataTypeStyle
Bivariate --> TimeSeries[Time Series]:::dataTypeStyle
BothCont --> BiPlots1[📌 Scatter Plot
📈 LOWESS Plot
🎯 Joint Plot]:::plotStyle
Mixed --> BiPlots2[📦 Box Plot
🎻 Violin Plot
🦟 Strip/Swarm Plot]:::plotStyle
TimeSeries --> BiPlots3[📈 Line Plot]:::plotStyle
%% Multivariate Branch
Multivariate --> Pairwise[All Pairwise
Relationships]:::dataTypeStyle
Multivariate --> Corr[Correlations]:::dataTypeStyle
Multivariate --> HighDim[High
Dimensional
Patterns]:::dataTypeStyle
Multivariate --> MultipleDistributions[Multiple
Distribution]:::dataTypeStyle
Pairwise --> MultiPlots1[🎯 Pair Plot]:::plotStyle
Corr --> MultiPlots2[🔥 Heatmap]:::plotStyle
HighDim --> MultiPlots3[📈 Andrews Curves]:::plotStyle
MultipleDistributions --> MultiPlots4[Multiple
📈 KDE Plots,
📦 Box Plots]:::plotStyle
%% Model Validation Branch
Validation --> Regression[Regression
Diagnostics]:::dataTypeStyle
Validation --> Normality[Normality
Testing]:::dataTypeStyle
Regression --> ValPlots1[📉 Residual
Plot]:::plotStyle
Normality --> ValPlots2[📉 Q-Q Plot
📊 Histogram
with KDE]:::plotStyle
%% Style Definitions
classDef startStyle fill:#6366f1,stroke:#4f46e5,stroke-width:3px,color:#fff,font-weight:bold,font-size:16px
classDef univariateStyle fill:#8b5cf6,stroke:#7c3aed,stroke-width:2px,color:#fff,font-weight:bold
classDef bivariateStyle fill:#06b6d4,stroke:#0891b2,stroke-width:2px,color:#fff,font-weight:bold
classDef multivariateStyle fill:#10b981,stroke:#059669,stroke-width:2px,color:#fff,font-weight:bold
classDef validationStyle fill:#f59e0b,stroke:#d97706,stroke-width:2px,color:#fff,font-weight:bold
classDef dataTypeStyle fill:#e0e7ff,stroke:#6366f1,stroke-width:2px,color:#1e293b,font-weight:bold
classDef plotStyle fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#1e293b,font-size:13pxColor Legend:
- 🔵 Purple - Univariate Analysis
- 🔵 Cyan - Bivariate Analysis
- 🟢 Green - Multivariate Analysis
- 🟠 Orange - Model Validation
- 📋 Light Blue - Data Type Categories
- 📊 Yellow - Recommended Plot Types
Summary Table: Plot Selection Guide
| Plot Type | Analysis Type | Best Use Case | Key Insight |
|---|---|---|---|
| 📊 Bar Plot 🔢 Count Plot |
Univariate/Bivariate | Categorical frequencies | Class imbalance, category comparisons |
| 📦 Box Plot | Univariate/Bivariate | Distribution summary, outliers | Median, quartiles, outliers across groups |
| 🔥 Heatmap | Multivariate | Correlation matrix | Multicollinearity, feature relationships |
| 📊 Histogram Plot | Univariate | Distribution shape, normality | Frequency distribution, skewness, outliers |
| 🎯 Joint Plot | Bivariate | Relationship + marginals | Correlation with univariate distributions |
| 📈 KDE Plot | Univariate | Smooth distribution | Distribution shape, group comparisons |
| 📈 Line Plot | Bivariate | Time series, trends | Temporal patterns, seasonality, trends |
| 📈 LOWESS Plot | Bivariate | Non-linear trend detection | True relationship without assuming linearity |
| 🎯 Pair Plot | Multivariate | All pairwise relationships | Correlations, patterns across all variables |
| 📉 Q-Q Plot | Univariate | Normality testing | Distribution comparison to normal |
| 📉 Residual Plot | Bivariate | Regression diagnostics | Model fit, assumptions validation |
| 📌 Scatter Plot | Bivariate | Relationship between two continuous vars | Correlation, linearity, outliers |
| 🎻 Violin Plot | Univariate/Bivariate | Detailed distribution shape | Density, multimodality, group comparisons |
| 📈 Andrews Curves | Multivariate | High-dimensional patterns | Cluster separation, class distinction |
| 🦟 Strip/Swarm Plot | Bivariate | Individual points by category | Actual data points, distribution by group |
Quick Reference: Choosing the Right Plot
I. For Understanding Single Variables (Univariate)
- Distribution shape: Histogram, KDE Plot, Violin Plot
- Normality testing: Q-Q Plot, Histogram with KDE
- Summary statistics: Box Plot
- Categorical frequencies: Bar Plot, Count Plot
II. For Understanding Relationships (Bivariate)
- Two continuous variables: Scatter Plot, LOWESS Plot, Joint Plot
- Continuous vs categorical: Box Plot, Violin Plot, Strip Plot
- Time series: Line Plot
- Model diagnostics: Residual Plot
III. For Understanding Multiple Variables (Multivariate)
- All pairwise relationships: Pair Plot
- Correlations: Heatmap
- High-dimensional patterns: Andrews Curves
- Multiple distributions: Multiple KDE Plots, Multiple Box Plots
IV. For Feature Engineering Decisions
- Detecting non-linearity: Scatter Plot with LOWESS, Residual Plot
- Identifying outliers: Box Plot, Scatter Plot, Strip Plot
- Checking normality: Q-Q Plot, Histogram
- Finding correlations: Heatmap, Pair Plot
- Comparing groups: Box Plot, Violin Plot, KDE Plot
Common Customization Patterns (Matplotlib & Seaborn)
# --- Figure Setup ---
plt.figure(figsize=(12, 6))
fig, ax = plt.subplots(figsize=(12, 6))
# --- Seaborn Styling ---
sns.set_style("whitegrid") # Options: whitegrid, darkgrid, white, dark, ticks
# Alternate: plt.style.use('seaborn-v0_8-darkgrid')
sns.set_context("notebook") # Options: paper, notebook, talk, poster
sns.set_palette("Set2") # Options: Set1, Set2, husl, colorblind, etc.
sns.set_theme(style="whitegrid", context="talk") # For larger plots
sns.set_palette("husl") # Another palette option
# --- Titles and Labels ---
plt.title("Professional Title", fontsize=18, fontweight='bold', pad=20)
plt.xlabel("X Axis Label", fontsize=14)
plt.ylabel("Y Axis Label", fontsize=14)
# --- Grid and Axes ---
plt.grid(True, alpha=0.3, linestyle='--')
plt.xlim(0, 100)
plt.ylim(0, 50)
# --- Legends ---
plt.legend(
title='Legend Title', loc='best', frameon=True,
shadow=True, fontsize=11, title_fontsize=12
)
# --- Ticks and Layout ---
plt.xticks(rotation=45, ha='right')
plt.tight_layout() # Prevents label cutoff
# --- Save Figure ---
plt.savefig('output/figure.png', dpi=300, bbox_inches='tight',
facecolor='white', edgecolor='none')
plt.show()
Best Practices
- Use
figsizeto ensure readability in presentations and papers. - Choose a color palette that is colorblind-friendly for accessibility.
- Always add clear titles, axis labels, and legends.
- Use
tight_layout()to avoid label cutoff. - Rotate x-axis labels for long category names.
- Save figures with high DPI for publication-quality output.
- Use
sns.set_theme()for consistent styling across all plots.
Plotly for Interactive Visualizations
Plotly Quick Reference
Plotly creates interactive plots with zoom, pan, and hover capabilities - ideal for dashboards and presentations.
Basic Plotly Syntax
import plotly.graph_objects as go
import plotly.express as px
# Quick express plots
fig = px.scatter(df, x='col1', y='col2', color='category')
fig = px.line(df, x='date', y='value')
fig = px.box(df, x='category', y='value')
# Graph objects for customization
fig = go.Figure()
fig.add_trace(go.Scatter(x=x_data, y=y_data, mode='lines+markers', name='Series'))
fig.update_layout(title='Interactive Plot', xaxis_title='X', yaxis_title='Y')
fig.show()
Appendix: Additional Resources
Library's official documentation
Documentation Links
Library Comparison
| Library | Best For | Key Strengths | Limitations |
|---|---|---|---|
| Matplotlib | Publications, full control | Industry standard, highly customizable | Verbose, requires more code |
| Seaborn | Statistical analysis, EDA | Beautiful defaults, statistical plots | Limited interactivity |
| Pandas | Quick EDA | Direct from DataFrame, minimal code | Basic functionality only |
| Plotly | Dashboards, presentations | Interactive, web-based | Larger files, browser required |
Pro Tips Summary Table
| Task | Plot | Key Parameter/Technique |
|---|---|---|
| Detect non-linearity | Scatter + LOWESS | sns.regplot(..., lowess=True) - curved line = non-linear |
| Check normality | Q-Q Plot, Histogram | Points on diagonal line = normal distribution |
| Find outliers | Box Plot, Scatter | Points beyond whiskers or far from cluster |
| Check model fit | Residual Plot | Random scatter around zero = good fit |
| Compare groups | Box/Violin Plot | Non-overlapping boxes = significant difference |
| Find correlations | Heatmap, Pair Plot | annot=True, center=0 with diverging colormap |
| Smooth noisy data | Line + Rolling Avg | df['rolling'] = df['val'].rolling(window=7).mean() |
| Multi-class separation | Pair Plot with hue | sns.pairplot(df, hue='target') |
| Distribution shape | Histogram + KDE | kde=True - bell shape = normal |
| Time trends | Line Plot | Add grid with plt.grid(True, alpha=0.3) |
Common Mistakes to Avoid
🚫 Using scatter plots for large datasets without transparency/sampling
🚫 Forgetting to scale features before Andrews curves
🚫 Using rainbow colormaps for continuous data (use sequential or diverging)
🚫 Not checking assumptions before statistical tests (use Q-Q plots!)
🚫 Ignoring outliers visible in plots
🚫 Creating plots without proper labels and titles
🚫 Using 3D plots when 2D would be clearer
🚫 Not considering your audience's expertise level
Real-World Applications
Feature Engineering
- Use scatter + LOWESS to decide if you need polynomial features
- Use Q-Q plots to determine if transformation is needed
- Use heatmaps to detect multicollinearity before modeling
Model Selection
- Use residual plots to validate linear regression assumptions
- Use pair plots with target to assess separability for classification
- Use box plots to check if feature distributions differ across classes
Data Quality
- Use box plots to detect outliers systematically
- Use line plots to spot anomalies in time series
- Use histograms to identify data entry errors (unexpected modes)