Lesson 4

Cleaning Messy Data

14 min

Raw data is almost always messy. Fields are missing, values are stored as the wrong type, strings are inconsistently capitalised, and dates come in six different formats. Cleaning is not glamorous, but it is where most of the analytical risk lives: a miscounted null or a silently-failed type cast will corrupt every downstream result.

Identifying and Handling Null Values

Start by quantifying the null landscape:

python

import pandas as pd
import numpy as np

df = pd.read_csv("customers.csv")

# Count nulls per column
null_counts = df.isnull().sum()
null_pct    = (df.isnull().mean() * 100).round(2)

null_report = pd.DataFrame({
    "missing_count": null_counts,
    "missing_pct":   null_pct
}).sort_values("missing_pct", ascending=False)

print(null_report[null_report["missing_count"] > 0])

Once you understand where nulls live, choose a strategy per column:

python

# Drop rows missing a critical field
df = df.dropna(subset=["customer_id", "purchase_date"])

# Drop columns where more than 60% of values are missing
threshold = 0.6
df = df.dropna(axis=1, thresh=int(len(df) * (1 - threshold)))

# Fill with a constant
df["country"] = df["country"].fillna("Unknown")

# Fill numeric columns with the column median (robust to outliers)
df["age"] = df["age"].fillna(df["age"].median())

# Forward-fill ordered time series data
df["price"] = df["price"].ffill()

# Fill with group-specific means (more contextually accurate)
df["salary"] = df.groupby("department")["salary"].transform(
    lambda x: x.fillna(x.mean())
)

| Strategy | When to use | |---|---| | Drop row | Missing value is mandatory; row is unusable without it | | Drop column | >50–70% of rows are missing; column adds no signal | | Fill with median/mode | Numeric column; distribution is skewed | | Fill with mean | Numeric column; distribution is approximately normal | | Forward/back fill | Ordered time series; values persist until changed | | Group-conditional fill | Meaningful subgroups exist (e.g., fill salary by job title) |

Removing Duplicates

Duplicate rows silently inflate counts and skew aggregations:

python

# Check total duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")

# Check duplicates on specific columns (logical duplicates)
print(f"Duplicate orders: {df.duplicated(subset=['order_id']).sum()}")

# Inspect duplicate records before removing
dups = df[df.duplicated(subset=["order_id"], keep=False)]
print(dups.sort_values("order_id").head(10))

# Drop duplicates, keeping the first occurrence
df = df.drop_duplicates(subset=["order_id"], keep="first")

Dtype Casting

Pandas infers dtypes on read, but it often gets them wrong. Storing a zip code as float64 or a Boolean flag as object wastes memory and breaks downstream logic:

python

print(df.dtypes)

# Cast to the correct types
df["age"]        = df["age"].astype(int)
df["is_premium"] = df["is_premium"].astype(bool)
df["zip_code"]   = df["zip_code"].astype(str).str.zfill(5)  # keep leading zeros
df["product_id"] = df["product_id"].astype("category")       # low-cardinality strings

# Safe numeric casting — coerce errors to NaN instead of raising
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")

# Downcast to save memory
df["quantity"] = pd.to_numeric(df["quantity"], downcast="integer")

String Normalisation

Free-text fields are a cleaning minefield — inconsistent capitalisation, whitespace, and special characters create false duplicates:

python

# Chain string operations
df["name"] = (
    df["name"]
    .str.strip()          # remove leading/trailing whitespace
    .str.lower()          # normalise case
    .str.replace(r"\s+", " ", regex=True)   # collapse internal whitespace
    .str.replace(r"[^a-z0-9 ]", "", regex=True)  # strip non-alphanumeric
)

# Standardise categorical values
df["status"] = df["status"].str.upper().str.strip()
df["status"] = df["status"].replace({
    "ACTV": "ACTIVE",
    "INACT": "INACTIVE",
    "PNDG": "PENDING"
})

# Extract structured data from a messy string
df[["first_name", "last_name"]] = df["full_name"].str.split(" ", n=1, expand=True)

Parsing Dates with `pd.to_datetime`

Dates stored as strings are a constant source of errors. Parse them explicitly:

python

# Simple ISO format — Pandas handles automatically
df["created_at"] = pd.to_datetime(df["created_at"])

# Explicit format — much faster for large datasets (no format sniffing)
df["purchase_date"] = pd.to_datetime(df["purchase_date"], format="%d/%m/%Y")

# Mixed or ambiguous formats — use errors='coerce' to surface bad values
df["event_time"] = pd.to_datetime(df["event_time"], errors="coerce")
bad_dates = df[df["event_time"].isna() & df["event_time_raw"].notna()]

# Extract date components for feature engineering
df["year"]  = df["created_at"].dt.year
df["month"] = df["created_at"].dt.month
df["dow"]   = df["created_at"].dt.dayofweek   # 0=Monday
df["is_weekend"] = df["dow"].isin([5, 6])

A Cleaning Checklist

Run through this mental checklist on every new dataset:

Null audit — count, locate, decide strategy per column.
Duplicate audit — exact and logical duplicates on key columns.
Dtype audit — every column should have its correct Pandas dtype.
String audit — strip, lower, standardise, resolve aliases.
Date audit — parse all date columns; inspect failures with errors='coerce'.
Range audit — check numeric columns for impossible values (negative ages, 200% discounts).

Summary

Start every cleaning session with a null percentage report; choose a fill or drop strategy based on missingness rate and the column's analytical role.
Duplicate detection requires checking both exact row duplicates and logical duplicates on business-key columns.
Incorrect dtypes silently corrupt aggregations; always cast explicitly using .astype(), pd.to_numeric(), and pd.to_datetime().
String normalisation (strip, lower, regex replace, alias mapping) is essential before any groupby or merge on text columns.
pd.to_datetime(errors='coerce') turns unparseable dates into NaN, making bad values visible rather than raising an exception and halting the pipeline.

Pandas DataFrames Aggregation & GroupBy

Cleaning Messy Data

Identifying and Handling Null Values

Removing Duplicates

Dtype Casting

String Normalisation

Parsing Dates with pd.to_datetime

A Cleaning Checklist

Summary

Parsing Dates with `pd.to_datetime`