Decision Tree: Which Plot to Use?

---  
config:  
theme: base  
themeVariables:  
primaryColor: "#6366f1"  
primaryTextColor: "#fff"  
primaryBorderColor: "#4f46e5"  
lineColor: "#64748b"  
secondaryColor: "#8b5cf6"  
tertiaryColor: "#06b6d4"  
noteTextColor: "#1e293b"  
noteBkgColor: "#e0e7ff"  
edgeLabelBackground: "#f8fafc"  
---
flowchart TD
    Start([What are you analyzing?]):::startStyle
    
    Start --> Univariate[Single Variable
Univariate]:::univariateStyle Start --> Bivariate[Two Variables
Bivariate]:::bivariateStyle Start --> Multivariate[Multiple Variables
Multivariate]:::multivariateStyle Start --> Validation[Model Validation]:::validationStyle %% Univariate Branch Univariate --> Continuous1[Continuous
Data]:::dataTypeStyle Univariate --> Categorical1[Categorical
Data]:::dataTypeStyle Continuous1 --> UniPlots1[📊 Histogram
📈 KDE Plot
📦 Box Plot
📉 Q-Q Plot]:::plotStyle Categorical1 --> UniPlots2[📊 Bar Plot
🔢 Count Plot]:::plotStyle %% Bivariate Branch Bivariate --> BothCont[Both
Continuous]:::dataTypeStyle Bivariate --> Mixed[One Continuous
One Categorical]:::dataTypeStyle Bivariate --> TimeSeries[Time Series]:::dataTypeStyle BothCont --> BiPlots1[📌 Scatter Plot
📈 LOWESS Plot
🎯 Joint Plot]:::plotStyle Mixed --> BiPlots2[📦 Box Plot
🎻 Violin Plot
🦟 Strip/Swarm Plot]:::plotStyle TimeSeries --> BiPlots3[📈 Line Plot]:::plotStyle %% Multivariate Branch Multivariate --> Pairwise[All Pairwise
Relationships]:::dataTypeStyle Multivariate --> Corr[Correlations]:::dataTypeStyle Multivariate --> HighDim[High
Dimensional
Patterns]:::dataTypeStyle Multivariate --> MultipleDistributions[Multiple
Distribution]:::dataTypeStyle Pairwise --> MultiPlots1[🎯 Pair Plot]:::plotStyle Corr --> MultiPlots2[🔥 Heatmap]:::plotStyle HighDim --> MultiPlots3[📈 Andrews Curves]:::plotStyle MultipleDistributions --> MultiPlots4[Multiple
📈 KDE Plots,
📦 Box Plots]:::plotStyle %% Model Validation Branch Validation --> Regression[Regression
Diagnostics]:::dataTypeStyle Validation --> Normality[Normality
Testing]:::dataTypeStyle Regression --> ValPlots1[📉 Residual
Plot]:::plotStyle Normality --> ValPlots2[📉 Q-Q Plot
📊 Histogram
with KDE]:::plotStyle %% Style Definitions classDef startStyle fill:#6366f1,stroke:#4f46e5,stroke-width:3px,color:#fff,font-weight:bold,font-size:16px classDef univariateStyle fill:#8b5cf6,stroke:#7c3aed,stroke-width:2px,color:#fff,font-weight:bold classDef bivariateStyle fill:#06b6d4,stroke:#0891b2,stroke-width:2px,color:#fff,font-weight:bold classDef multivariateStyle fill:#10b981,stroke:#059669,stroke-width:2px,color:#fff,font-weight:bold classDef validationStyle fill:#f59e0b,stroke:#d97706,stroke-width:2px,color:#fff,font-weight:bold classDef dataTypeStyle fill:#e0e7ff,stroke:#6366f1,stroke-width:2px,color:#1e293b,font-weight:bold classDef plotStyle fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#1e293b,font-size:13px

Color Legend:

Summary Table: Plot Selection Guide

Plot Type Analysis Type Best Use Case Key Insight
📊 Bar Plot
🔢 Count Plot
Univariate/Bivariate Categorical frequencies Class imbalance, category comparisons
📦 Box Plot Univariate/Bivariate Distribution summary, outliers Median, quartiles, outliers across groups
🔥 Heatmap Multivariate Correlation matrix Multicollinearity, feature relationships
📊 Histogram Plot Univariate Distribution shape, normality Frequency distribution, skewness, outliers
🎯 Joint Plot Bivariate Relationship + marginals Correlation with univariate distributions
📈 KDE Plot Univariate Smooth distribution Distribution shape, group comparisons
📈 Line Plot Bivariate Time series, trends Temporal patterns, seasonality, trends
📈 LOWESS Plot Bivariate Non-linear trend detection True relationship without assuming linearity
🎯 Pair Plot Multivariate All pairwise relationships Correlations, patterns across all variables
📉 Q-Q Plot Univariate Normality testing Distribution comparison to normal
📉 Residual Plot Bivariate Regression diagnostics Model fit, assumptions validation
📌 Scatter Plot Bivariate Relationship between two continuous vars Correlation, linearity, outliers
🎻 Violin Plot Univariate/Bivariate Detailed distribution shape Density, multimodality, group comparisons
📈 Andrews Curves Multivariate High-dimensional patterns Cluster separation, class distinction
🦟 Strip/Swarm Plot Bivariate Individual points by category Actual data points, distribution by group

Quick Reference: Choosing the Right Plot

I. For Understanding Single Variables (Univariate)

II. For Understanding Relationships (Bivariate)

III. For Understanding Multiple Variables (Multivariate)

IV. For Feature Engineering Decisions


Common Customization Patterns (Matplotlib & Seaborn)

# --- Figure Setup ---
plt.figure(figsize=(12, 6))
fig, ax = plt.subplots(figsize=(12, 6))

# --- Seaborn Styling ---
sns.set_style("whitegrid")           # Options: whitegrid, darkgrid, white, dark, ticks
                                                   # Alternate: plt.style.use('seaborn-v0_8-darkgrid')
							 
sns.set_context("notebook")       # Options: paper, notebook, talk, poster
sns.set_palette("Set2")         # Options: Set1, Set2, husl, colorblind, etc.
sns.set_theme(style="whitegrid", context="talk")  # For larger plots
sns.set_palette("husl")         # Another palette option

# --- Titles and Labels ---
plt.title("Professional Title", fontsize=18, fontweight='bold', pad=20)
plt.xlabel("X Axis Label", fontsize=14)
plt.ylabel("Y Axis Label", fontsize=14)

# --- Grid and Axes ---
plt.grid(True, alpha=0.3, linestyle='--')
plt.xlim(0, 100)
plt.ylim(0, 50)

# --- Legends ---
plt.legend(
  title='Legend Title', loc='best', frameon=True, 
  shadow=True, fontsize=11, title_fontsize=12
)

# --- Ticks and Layout ---
plt.xticks(rotation=45, ha='right')
plt.tight_layout()  # Prevents label cutoff

# --- Save Figure ---
plt.savefig('output/figure.png', dpi=300, bbox_inches='tight', 
      facecolor='white', edgecolor='none')
plt.show()
Best Practices

  • Use figsize to ensure readability in presentations and papers.
  • Choose a color palette that is colorblind-friendly for accessibility.
  • Always add clear titles, axis labels, and legends.
  • Use tight_layout() to avoid label cutoff.
  • Rotate x-axis labels for long category names.
  • Save figures with high DPI for publication-quality output.
  • Use sns.set_theme() for consistent styling across all plots.

Plotly for Interactive Visualizations

Plotly Quick Reference

Plotly creates interactive plots with zoom, pan, and hover capabilities - ideal for dashboards and presentations.

Basic Plotly Syntax

import plotly.graph_objects as go
import plotly.express as px

# Quick express plots
fig = px.scatter(df, x='col1', y='col2', color='category')
fig = px.line(df, x='date', y='value')
fig = px.box(df, x='category', y='value')

# Graph objects for customization
fig = go.Figure()
fig.add_trace(go.Scatter(x=x_data, y=y_data, mode='lines+markers', name='Series'))
fig.update_layout(title='Interactive Plot', xaxis_title='X', yaxis_title='Y')
fig.show()

Appendix: Additional Resources

Library's official documentation

Documentation Links

Library Comparison

Library Best For Key Strengths Limitations
Matplotlib Publications, full control Industry standard, highly customizable Verbose, requires more code
Seaborn Statistical analysis, EDA Beautiful defaults, statistical plots Limited interactivity
Pandas Quick EDA Direct from DataFrame, minimal code Basic functionality only
Plotly Dashboards, presentations Interactive, web-based Larger files, browser required

Pro Tips Summary Table

Task Plot Key Parameter/Technique
Detect non-linearity Scatter + LOWESS sns.regplot(..., lowess=True) - curved line = non-linear
Check normality Q-Q Plot, Histogram Points on diagonal line = normal distribution
Find outliers Box Plot, Scatter Points beyond whiskers or far from cluster
Check model fit Residual Plot Random scatter around zero = good fit
Compare groups Box/Violin Plot Non-overlapping boxes = significant difference
Find correlations Heatmap, Pair Plot annot=True, center=0 with diverging colormap
Smooth noisy data Line + Rolling Avg df['rolling'] = df['val'].rolling(window=7).mean()
Multi-class separation Pair Plot with hue sns.pairplot(df, hue='target')
Distribution shape Histogram + KDE kde=True - bell shape = normal
Time trends Line Plot Add grid with plt.grid(True, alpha=0.3)
Common Mistakes to Avoid

🚫 Using scatter plots for large datasets without transparency/sampling
🚫 Forgetting to scale features before Andrews curves
🚫 Using rainbow colormaps for continuous data (use sequential or diverging)
🚫 Not checking assumptions before statistical tests (use Q-Q plots!)
🚫 Ignoring outliers visible in plots
🚫 Creating plots without proper labels and titles
🚫 Using 3D plots when 2D would be clearer
🚫 Not considering your audience's expertise level

Real-World Applications

Feature Engineering

  • Use scatter + LOWESS to decide if you need polynomial features
  • Use Q-Q plots to determine if transformation is needed
  • Use heatmaps to detect multicollinearity before modeling

Model Selection

  • Use residual plots to validate linear regression assumptions
  • Use pair plots with target to assess separability for classification
  • Use box plots to check if feature distributions differ across classes

Data Quality

  • Use box plots to detect outliers systematically
  • Use line plots to spot anomalies in time series
  • Use histograms to identify data entry errors (unexpected modes)