GadaaLabs
Python Mastery — From Zero to AI Engineering
Lesson 15

Deep Learning with PyTorch

34 min

Why Neural Networks Work

The Universal Approximation Theorem states that a neural network with a single hidden layer and enough neurons can approximate any continuous function to arbitrary precision. This is the mathematical guarantee that deep learning rests on.

In practice: a network is a composition of differentiable operations. Because every operation has a known derivative, we can compute the gradient of the loss with respect to every parameter — then move parameters in the direction that decreases the loss. This is the entirety of deep learning, made feasible by GPUs and automatic differentiation.

Anatomy of a Neural Network

Input Layer → [Hidden Layer 1] → [Hidden Layer 2] → ... → Output Layer
             (weights + bias)    (weights + bias)          (weights + bias)
             + activation        + activation              + activation

Each layer computes:

output = activation(W @ input + b)

Where W is the weight matrix and b is the bias vector. The activation introduces nonlinearity — without it, stacking layers would collapse to a single linear transformation.

Common activations:

  • ReLU: max(0, x) — fast, avoids vanishing gradients, default choice
  • Sigmoid: 1/(1+e^-x) — outputs 0–1, used in binary classification output
  • Tanh: (e^x - e^-x)/(e^x + e^-x) — outputs -1–1, zero-centered
  • Softmax: normalizes to probability distribution, used in multi-class output

Loss Functions

Regression — Mean Squared Error:

MSE = (1/n) * Σ(y_pred - y_true)²

Binary Classification — Binary Cross-Entropy:

BCE = -(1/n) * Σ[y*log(p) + (1-y)*log(1-p)]

Multi-class — Categorical Cross-Entropy:

CCE = -(1/n) * Σ Σ y_true[i,c] * log(p[i,c])

Backpropagation — The Chain Rule

Given a network loss = L(f3(f2(f1(x)))), the gradient of loss with respect to layer 1 weights is:

∂L/∂W1 = ∂L/∂f3 · ∂f3/∂f2 · ∂f2/∂f1 · ∂f1/∂W1

This is the chain rule applied repeatedly. Backpropagation is just an efficient algorithm that computes this by working backward from the output, reusing intermediate values.

Every modern deep learning framework (PyTorch, JAX, TensorFlow) implements automatic differentiation — you define the forward pass, and gradients are computed for free.

Neural Network from Scratch with NumPy

Neural Network from Scratch — XOR Problem
Click Run to execute — Python runs in your browser via WebAssembly

Gradient Descent Variants

SGD vs Adam Optimizer
Click Run to execute — Python runs in your browser via WebAssembly

PyTorch: Tensors and Autograd

python
import torch

# Tensor creation
x = torch.tensor([1.0, 2.0, 3.0])
W = torch.randn(3, 4, requires_grad=True)   # Track gradients
b = torch.zeros(4, requires_grad=True)

# Forward pass
out = x @ W + b
loss = out.sum()

# Backward pass — PyTorch computes all gradients automatically
loss.backward()
print(W.grad)    # dL/dW
print(b.grad)    # dL/db

# GPU support — just move tensors
device = "cuda" if torch.cuda.is_available() else "cpu"
x = x.to(device)
W = W.to(device)

Building Models with nn.Module

python
import torch
import torch.nn as nn
import torch.nn.functional as F

class MLP(nn.Module):
    def __init__(self, input_size, hidden_sizes, output_size, dropout=0.3):
        super().__init__()
        layers = []
        prev_size = input_size
        for h in hidden_sizes:
            layers.extend([
                nn.Linear(prev_size, h),
                nn.BatchNorm1d(h),
                nn.ReLU(),
                nn.Dropout(dropout),
            ])
            prev_size = h
        layers.append(nn.Linear(prev_size, output_size))
        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)

model = MLP(input_size=784, hidden_sizes=[256, 128], output_size=10)
print(model)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

The PyTorch Training Loop

python
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

def train_model(model, train_loader, val_loader, epochs=20, lr=1e-3):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3)
    criterion = nn.CrossEntropyLoss()
    best_val_loss = float('inf')

    for epoch in range(epochs):
        # ── Training phase ──
        model.train()
        train_loss = 0.0
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()          # Clear gradients from previous step
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            loss.backward()                # Compute gradients
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()               # Update weights
            train_loss += loss.item()

        # ── Validation phase ──
        model.eval()
        val_loss = 0.0
        correct = 0
        with torch.no_grad():              # No gradient computation
            for X_batch, y_batch in val_loader:
                outputs = model(X_batch)
                val_loss += criterion(outputs, y_batch).item()
                correct += (outputs.argmax(1) == y_batch).sum().item()

        val_loss /= len(val_loader)
        scheduler.step(val_loss)

        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save(model.state_dict(), "best_model.pt")   # Save best weights

        if epoch % 5 == 0:
            acc = correct / len(val_loader.dataset)
            print(f"Epoch {epoch}: train={train_loss/len(train_loader):.4f}, "
                  f"val={val_loss:.4f}, acc={acc:.3f}")

    model.load_state_dict(torch.load("best_model.pt"))
    return model

CNN Architecture

python
class CNN(nn.Module):
    """Convolutional network for image classification."""
    def __init__(self, n_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            # Block 1
            nn.Conv2d(1, 32, kernel_size=3, padding=1),   # (1,28,28) -> (32,28,28)
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),                               # -> (32,14,14)

            # Block 2
            nn.Conv2d(32, 64, kernel_size=3, padding=1),  # -> (64,14,14)
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),                               # -> (64,7,7)

            # Block 3
            nn.Conv2d(64, 128, kernel_size=3, padding=1), # -> (128,7,7)
            nn.BatchNorm2d(128),
            nn.ReLU(),
        )
        self.classifier = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),                       # -> (128,1,1)
            nn.Flatten(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(64, n_classes),
        )

    def forward(self, x):
        return self.classifier(self.features(x))

PROJECT: Neural Network from Scratch — Decision Boundary

Neural Net from Scratch — Spiral Classification
Click Run to execute — Python runs in your browser via WebAssembly

PyTorch Digit Classifier on MNIST

This code runs in a Python environment with PyTorch installed, not in the browser playground:

python
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Data
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])
train_data = datasets.MNIST('.', train=True,  download=True, transform=transform)
test_data  = datasets.MNIST('.', train=False, download=True, transform=transform)
train_loader = DataLoader(train_data, batch_size=128, shuffle=True,  num_workers=4)
test_loader  = DataLoader(test_data,  batch_size=256, shuffle=False, num_workers=4)

# Model
class DigitClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(1, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(),
            nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(),
            nn.AdaptiveAvgPool2d(4),
        )
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128*4*4, 256), nn.ReLU(), nn.Dropout(0.4),
            nn.Linear(256, 10)
        )
    def forward(self, x): return self.fc(self.conv(x))

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = DigitClassifier().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Training loop
for epoch in range(10):
    model.train()
    for X, y in train_loader:
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()
        loss = criterion(model(X), y)
        loss.backward()
        optimizer.step()

    # Evaluate
    model.eval()
    correct = 0
    with torch.no_grad():
        for X, y in test_loader:
            X, y = X.to(device), y.to(device)
            correct += (model(X).argmax(1) == y).sum().item()
    print(f"Epoch {epoch+1}: accuracy={correct/len(test_data):.4f}")
# Typically reaches >99% accuracy in 10 epochs

Debugging Neural Networks

When training stalls or fails, work through this checklist:

Loss not decreasing:

  • Check learning rate — too high causes oscillation, too low causes glacial progress
  • Check that optimizer.zero_grad() is called every step
  • Verify input normalization — unnormalized inputs slow convergence dramatically
  • Print the loss before calling .backward() to confirm the forward pass runs

Vanishing gradients (deep networks):

  • Add BatchNorm after each layer
  • Use ReLU (or leaky ReLU) instead of sigmoid/tanh in hidden layers
  • Use residual connections (skip connections)
  • Gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

Overfitting:

  • Add Dropout(0.3-0.5) after dense layers
  • L2 regularization: weight_decay=1e-4 in optimizer
  • Reduce model capacity
  • Add data augmentation (images) or collect more data

Key Takeaways

  • Backpropagation is the chain rule applied recursively — every modern framework automates this
  • ReLU is the default activation; it avoids the vanishing gradient problem that plagued sigmoid/tanh networks
  • The training loop structure is always: zero_grad → forward → loss → backward → step
  • BatchNorm stabilizes training by normalizing layer inputs; add it after every linear/conv layer
  • Use model.eval() and torch.no_grad() during validation — these disable dropout and gradient tracking
  • Learning rate is the most important hyperparameter; use ReduceLROnPlateau or cosine annealing
  • Transfer learning: freeze pretrained feature layers, replace and train only the classification head
  • When debugging, simplify — overfit a single batch first to confirm the model and loss function are correct