Pandas DataFrames
Pandas introduces two central data structures: the Series (a labelled 1-D array) and the DataFrame (a labelled 2-D table of Series). Together they provide the ergonomic, label-aware interface that makes Python data analysis feel as natural as working in a spreadsheet — while remaining far more powerful and programmable.
Series vs DataFrame
A Series is essentially a NumPy array with an index. Each element has both a positional location (integer 0, 1, 2 …) and a label (the index value, which can be strings, dates, or anything hashable).
A DataFrame is a dictionary of Series that share the same index. Every column is a Series; the DataFrame simply aligns them on a common row index.
Reading Data from Multiple Formats
Pandas ships with first-class readers for the most common data formats:
Common read_csv parameters to know:
| Parameter | Purpose | Example |
|---|---|---|
| sep | Delimiter character | sep="\t" for TSV |
| parse_dates | Auto-parse these columns as dates | ["created_at"] |
| index_col | Use this column as the row index | "id" |
| usecols | Load only these columns | ["name", "age", "salary"] |
| dtype | Force column dtypes | {"zip": str} |
| na_values | Treat these strings as NaN | ["N/A", "none", "–"] |
| nrows | Read only first N rows (useful for inspection) | 1000 |
First-Pass Inspection
Before doing anything analytical, always run a quick inspection suite:
df.info() is particularly valuable because it shows the count of non-null values per column, immediately revealing which columns have missing data and whether dtypes were inferred correctly (a column of integers stored as float64 is a signal that nulls snuck in).
Boolean Indexing
Boolean indexing is the primary filtering mechanism. You build a boolean Series (a mask) and pass it to the bracket operator:
.loc and .iloc
Always use .loc (label-based) or .iloc (position-based) when you want to select both rows and columns simultaneously:
Mixing label access and positional access (e.g., using plain [] with a row integer when the index is a string) is a common source of bugs. Using .loc/.iloc explicitly eliminates ambiguity.
Summary
- A
Seriesis a labelled 1-D array; aDataFrameis a dictionary of aligned Series sharing a common index. - Pandas can read CSV, Excel, JSON, Parquet, SQL, and more — each reader has parameters for parsing dates, specifying delimiters, and controlling dtypes.
- Always run
.shape,.info(),.describe(), and.head()immediately after loading a dataset to understand its structure before touching the data. - Boolean indexing with
&,|, and~is the standard row-filtering approach; use.locand.ilocfor combined row-and-column selection to avoid ambiguous indexing.