GadaaLabs
Python Mastery — From Zero to AI Engineering
Lesson 10

Pandas — DataFrames, Data Cleaning, Analysis & Transformation

38 min

Part 1: Series — The 1D Building Block

A Series is a one-dimensional labeled array. Think of it as a column in a spreadsheet where every row has a name (the index). Unlike a plain Python list, operations on two Series automatically align on the index — a property that makes pandas both powerful and occasionally surprising.

Every Series has:

  • values — a NumPy array holding the actual data
  • index — a label array (default: 0, 1, 2, … RangeIndex)
  • dtype — the data type (int64, float64, object, datetime64, category, …)
  • name — optional string used as the column header when the Series is inside a DataFrame
Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly

Part 2: DataFrame Fundamentals

A DataFrame is a 2D table: a dict of Series that all share the same index. Every column is a Series; every row is a cross-section with a label from the index.

Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly

Part 3: Data Cleaning — Production Techniques

Real data is messy. Missing values, wrong types, duplicate rows, inconsistent casing, hidden whitespace — all of these exist in production datasets. Systematic cleaning is the difference between analysis you can trust and analysis that quietly lies.

Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly

Part 4: Indexing Mastery

Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly

Part 5: Apply, Map, and Transform

Pandas provides several function-application methods. Choosing correctly between them matters for correctness and performance.

| Method | Applied to | Input | Output shape | Use when | |--------|-----------|-------|-------------|----------| | Series.map() | Series | one element | same length | element-wise with dict or function | | Series.apply() | Series | one element | any | element-wise when no vectorized op exists | | DataFrame.apply(axis=0) | DataFrame | one column (Series) | flexible | column-wise aggregation | | DataFrame.apply(axis=1) | DataFrame | one row (Series) | same rows | row-wise custom logic | | DataFrame.map() | DataFrame | one cell | same shape | element-wise on entire DataFrame | | groupby.transform() | GroupBy | group (Series) | same rows as input | add group statistics as a column | | .pipe() | any | object itself | any | method chaining |

Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly

Part 6: GroupBy — Split-Apply-Combine

GroupBy is one of pandas' most important abstractions. It lets you split a DataFrame into groups, apply a function to each, and combine the results — all in one expressive call.

Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly

Part 7: Merging, Joining, and Concatenating

Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly

Part 8: Reshaping Data

Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly

Part 9: Time Series

Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly

Part 10: Performance

Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly

PROJECT: Full Real-Estate Data Analysis Pipeline

This project loads a messy real-estate dataset, cleans it systematically, analyzes it with groupby and merging, and generates a polished summary report.

Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly

Exercises

Exercise 1 (Easy): Create a DataFrame with columns name, score, grade for 6 students. Use .loc to select students with score > 80. Add a column passed that is True when score >= 60. Print the result.

Python
Click Run to execute — Python runs in your browser via WebAssembly

Exercise 2 (Easy): Given a Series of transaction amounts as strings (e.g., "$1,200.50"), clean them and compute the total, mean, and max.

Python
Click Run to execute — Python runs in your browser via WebAssembly

Exercise 3 (Medium): Load the CSV below into a DataFrame. Clean it: fix types, handle missing values, remove duplicates. Then group by category and compute total and average amount.

Python
Click Run to execute — Python runs in your browser via WebAssembly

Exercise 4 (Medium): Implement a z-score outlier removal function. Given a DataFrame and column name, return a copy with rows removed where the z-score of that column exceeds a threshold.

Python
Click Run to execute — Python runs in your browser via WebAssembly

Exercise 5 (Medium): Create two DataFrames — orders and products — and merge them. Compute revenue (quantity * price) per product category.

Python
Click Run to execute — Python runs in your browser via WebAssembly

Exercise 6 (Medium): Build a pivot table from sales data showing total sales by region (rows) and product (columns). Add a totals row and column.

Python
Click Run to execute — Python runs in your browser via WebAssembly

Exercise 7 (Hard): Implement a complete time-series feature engineering pipeline: given daily price data, compute 7-day and 30-day moving averages, daily returns, 7-day rolling volatility, and a boolean flag for "golden cross" (MA7 crosses above MA30).

Python
Click Run to execute — Python runs in your browser via WebAssembly

Exercise 8 (Hard): Build a data quality checker class that takes a DataFrame and produces a structured report: column types, missing percentages, unique counts, outlier counts (IQR), and a quality score.

Python
Click Run to execute — Python runs in your browser via WebAssembly

Key Takeaways

  • Index alignment is pandas' superpower and footgun: operations between Series/DataFrames automatically align on the index — invaluable for correctness, but produces all-NaN results when your indexes don't match unexpectedly
  • .loc is label-based, .iloc is position-based: always use .loc[mask, col] = value for assignment — chained indexing (df["col"][mask] = val) may silently operate on a copy
  • pd.to_numeric(errors='coerce') and pd.to_datetime() are your first cleaning moves: they convert bad data to NaN rather than crashing, giving you explicit control over the repair
  • Named aggregations are more readable: agg(total=("sales","sum"), count=("sales","count")) produces clean, named columns in one call instead of chaining .rename()
  • transform keeps the original shape: use it when you want to add a group statistic back to the original DataFrame without reducing rows; .agg() collapses rows
  • Categorical dtype is a free 5–20x memory saving: any string column with low cardinality (status, region, type) should be astype("category")
  • merge vs concat: merge joins on column values (SQL JOIN semantics); concat stacks DataFrames along an axis (UNION ALL semantics) — they solve completely different problems
  • Prefer vectorized ops over apply(): apply(axis=1) is a Python loop over rows — it's 50–200x slower than equivalent NumPy/pandas vectorized operations