Python Mastery — From Zero to AI Engineering
Lesson 10
Pandas — DataFrames, Cleaning & Analysis
30 min
The Core Abstractions: Series and DataFrame
Pandas has two fundamental data structures:
- Series: a 1D labeled array — like a column in a spreadsheet with an index
- DataFrame: a 2D table of Series — like a spreadsheet, where every column is a Series sharing the same index
The index is the key that makes pandas different from NumPy. Every row has a label (the index), and operations align on these labels automatically. When you add two DataFrames, pandas lines them up by index — not by position.
Python
Click Run to execute — Python runs in your browser via WebAssembly
Indexing: .loc, .iloc, Boolean, .at, .iat
Python
Click Run to execute — Python runs in your browser via WebAssembly
Data Inspection
Python
Click Run to execute — Python runs in your browser via WebAssembly
Handling Missing Data
Missing data in pandas is represented as NaN (for floats) or pd.NA (for nullable integer/string types). Understanding where NaN comes from and how to handle it is one of the most important practical skills.
Python
Click Run to execute — Python runs in your browser via WebAssembly
Data Types, Conversion & String Operations
Python
Click Run to execute — Python runs in your browser via WebAssembly
GroupBy: Split-Apply-Combine
GroupBy is pandas' most powerful feature. The pattern: split the DataFrame into groups, apply a function to each group, combine the results.
Python
Click Run to execute — Python runs in your browser via WebAssembly
Merging, Joining, and Concatenating
Python
Click Run to execute — Python runs in your browser via WebAssembly
Time Series
Python
Click Run to execute — Python runs in your browser via WebAssembly
Performance Tips
python
PROJECT: Sales Data Analyzer
Python
Click Run to execute — Python runs in your browser via WebAssembly
Python
Click Run to execute — Python runs in your browser via WebAssembly
Key Takeaways
- Index alignment is pandas' superpower and footgun: operations between DataFrames/Series automatically align on the index — great for correctness, but unexpected when your indexes don't match and you get all-NaN results
.locis label-based,.ilocis position-based: never mix them up, and always use.loc[mask, col] = valueto avoid the chained-assignment warningpd.to_numeric(errors='coerce')andpd.to_datetimeare your first cleaning steps: they turn bad data intoNaNrather than raising exceptions, giving you explicit control over what to do nextgroupby().agg()with named aggregations is more readable than chaining:agg(total=("col", "sum"), count=("col", "count"))produces a clean, named-column result in one calltransformkeeps the original shape: use it when you want to add a group statistic as a column to the original DataFrame without reducing rows- Categorical dtype is a free optimization: string columns with low cardinality (status, region, category) use 5–20x less memory as
categoryand sort/groupby faster mergevsconcat:mergejoins on column values (like SQL JOIN);concatstacks DataFrames along an axis (like UNION ALL) — they solve different problems- Chained indexing (
df["col"][mask] = val) may silently fail: pandas can't tell if you're working on a copy or a view; always use.locfor assignment