Data Visualization — Matplotlib, Seaborn & Interactive Charts
Why Visualization Matters in Data Science
A chart can communicate in seconds what a table of numbers cannot communicate in minutes. But bad charts can also mislead — the same dataset can "show" opposite conclusions depending on axis limits, bin sizes, and color choices. This lesson teaches you to produce honest, clear, and beautiful visualizations.
The Python visualization ecosystem:
We focus on matplotlib + seaborn: they cover 95% of production data science needs, and understanding them makes every other library easier to learn.
Matplotlib Architecture: The Artist Hierarchy
Matplotlib has two APIs. Most tutorials mix them up, creating confusion. Understanding the architecture makes both clear.
Every matplotlib plot is a tree of Artist objects:
Two APIs, one hierarchy:
The pyplot interface (plt.plot(...), plt.title(...)) is a state machine that tracks the "current figure" and "current axes". It operates implicitly on the most recently created Axes. This is convenient for one-off plots in a Jupyter cell but dangerous in scripts.
The object-oriented (OO) API (fig, ax = plt.subplots()) gives you explicit handles. ax.plot() operates on that specific Axes, not whatever happens to be current. Always use this in production code, functions, and any code with more than one subplot.
Core Chart Types with Deep Examples
Line Charts — Trends Over Time
Distributions: Histograms, KDE, and Violin Plots
Understanding the shape of your data is the first step in any analysis. These charts answer: Is the distribution normal? Are there outliers? Are groups different?
Bar Charts — Categorical Comparisons
Advanced Layout: GridSpec, Inset Axes, and Annotations
Colormaps, Styling, and Tick Formatting
Choosing the right colormap is not aesthetic — it's scientific. Wrong colormaps misrepresent magnitude:
Seaborn: Statistical Graphics
Seaborn provides plot types that directly answer statistical questions — not just "what are the values" but "how are the distributions different?" and "are these variables correlated?"
Real Project: EDA Dashboard
Chart Selection Guide
Exercises
Exercise 1 — Histogram Bin Size Effect Generate 1000 points from a bimodal distribution (mix of two normals). Plot the same data with bin counts: 5, 15, 30, 100, 500. Observe how bin choice can hide or reveal the bimodality. Add a KDE overlay using Silverman's rule on each.
Exercise 2 — Broken Y-axis (Deliberate Misleading Chart) Plot a bar chart where the differences look huge because the y-axis starts at 99.5% instead of 0%. Then fix it to start at 0%. Observe how axis truncation creates false urgency. Write a rule about when truncating the y-axis is acceptable (spoiler: almost never for bar charts).
Exercise 3 — Custom Colormap
Create a custom diverging colormap from #ef4444 (red) → white → #6366f1 (purple). Apply it to a correlation heatmap of 5 features from a random dataset. Ensure the color center is exactly at 0 using vmin/vmax.
Exercise 4 — Animated Chart
Using matplotlib.animation.FuncAnimation, create an animation of a sine wave where the frequency increases from 0.5 Hz to 5 Hz over 50 frames. Save as a GIF. Note: FuncAnimation works in notebook environments; the playground saves frames instead.
Exercise 5 — Bubble Chart
Create a bubble chart showing countries: x = GDP per capita, y = Life expectancy, bubble size = population, color = continent. Use np.log to compress the population scale for bubble sizes. Include labels for the 5 most populous countries.
Exercise 6 — Facet Grid
Given a dataset with 3 numerical features and 1 categorical (4 levels), create a 4×3 grid of histograms — one row per category, one column per feature. Use plt.subplots(4, 3, sharex="col") so x-axes align by feature. Add a column title and row title using fig.text.
Exercise 7 — Publication-Quality Figure Reproduce a Tufte-style minimalist chart: sparse gridlines, no chart border (remove all 4 spines), direct labels instead of legend, a reference line with a text annotation, and high contrast. Save at 300 DPI. The chart shows A/B test conversion rates with 95% confidence intervals over 14 days.
Exercise 8 — Heatmap from Scratch
Without seaborn, implement a correlation heatmap from a NumPy matrix. Annotate every cell with the value, color cells with a diverging colormap centered at 0, add row/column labels, and include a colorbar. Handle the edge case where corr[i,i] = 1.0 (diagonal) specially with a different annotation.