GadaaLabs
Python Mastery — From Zero to AI Engineering
Lesson 10

Pandas — DataFrames, Cleaning & Analysis

30 min

The Core Abstractions: Series and DataFrame

Pandas has two fundamental data structures:

  • Series: a 1D labeled array — like a column in a spreadsheet with an index
  • DataFrame: a 2D table of Series — like a spreadsheet, where every column is a Series sharing the same index

The index is the key that makes pandas different from NumPy. Every row has a label (the index), and operations align on these labels automatically. When you add two DataFrames, pandas lines them up by index — not by position.

Python
Click Run to execute — Python runs in your browser via WebAssembly

Indexing: .loc, .iloc, Boolean, .at, .iat

Python
Click Run to execute — Python runs in your browser via WebAssembly

Data Inspection

Python
Click Run to execute — Python runs in your browser via WebAssembly

Handling Missing Data

Missing data in pandas is represented as NaN (for floats) or pd.NA (for nullable integer/string types). Understanding where NaN comes from and how to handle it is one of the most important practical skills.

Python
Click Run to execute — Python runs in your browser via WebAssembly

Data Types, Conversion & String Operations

Python
Click Run to execute — Python runs in your browser via WebAssembly

GroupBy: Split-Apply-Combine

GroupBy is pandas' most powerful feature. The pattern: split the DataFrame into groups, apply a function to each group, combine the results.

Python
Click Run to execute — Python runs in your browser via WebAssembly

Merging, Joining, and Concatenating

Python
Click Run to execute — Python runs in your browser via WebAssembly

Time Series

Python
Click Run to execute — Python runs in your browser via WebAssembly

Performance Tips

python
# 1. Avoid chained indexing — it can operate on copies
df["col"][mask] = value        # BAD: may not modify df
df.loc[mask, "col"] = value    # GOOD: always modifies df

# 2. Use vectorized string ops (.str accessor) instead of apply
df["upper"] = df["name"].apply(lambda x: x.upper())  # SLOW
df["upper"] = df["name"].str.upper()                   # FAST

# 3. Use categorical for low-cardinality string columns
df["status"] = df["status"].astype("category")  # saves memory, speeds groupby

# 4. Prefer agg/transform over apply — apply is pure Python
df.groupby("col").agg("mean")               # fast C
df.groupby("col").apply(lambda g: g.mean()) # slow Python

# 5. Read only what you need
df = pd.read_csv("big.csv", usecols=["a", "b", "c"])

PROJECT: Sales Data Analyzer

Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly

Key Takeaways

  • Index alignment is pandas' superpower and footgun: operations between DataFrames/Series automatically align on the index — great for correctness, but unexpected when your indexes don't match and you get all-NaN results
  • .loc is label-based, .iloc is position-based: never mix them up, and always use .loc[mask, col] = value to avoid the chained-assignment warning
  • pd.to_numeric(errors='coerce') and pd.to_datetime are your first cleaning steps: they turn bad data into NaN rather than raising exceptions, giving you explicit control over what to do next
  • groupby().agg() with named aggregations is more readable than chaining: agg(total=("col", "sum"), count=("col", "count")) produces a clean, named-column result in one call
  • transform keeps the original shape: use it when you want to add a group statistic as a column to the original DataFrame without reducing rows
  • Categorical dtype is a free optimization: string columns with low cardinality (status, region, category) use 5–20x less memory as category and sort/groupby faster
  • merge vs concat: merge joins on column values (like SQL JOIN); concat stacks DataFrames along an axis (like UNION ALL) — they solve different problems
  • Chained indexing (df["col"][mask] = val) may silently fail: pandas can't tell if you're working on a copy or a view; always use .loc for assignment