The Universal Approximation Theorem states that a neural network with a single hidden layer and enough neurons can approximate any continuous function to arbitrary precision. This is the mathematical guarantee that deep learning rests on.
In practice: a network is a composition of differentiable operations. Because every operation has a known derivative, we can compute the gradient of the loss with respect to every parameter — then move parameters in the direction that decreases the loss. This is the entirety of deep learning, made feasible by GPUs and automatic differentiation.
Where W is the weight matrix and b is the bias vector. The activation introduces nonlinearity — without it, stacking layers would collapse to a single linear transformation.
Softmax: normalizes to probability distribution, used in multi-class output
Loss Functions
Regression — Mean Squared Error:
MSE = (1/n) * Σ(y_pred - y_true)²
Binary Classification — Binary Cross-Entropy:
BCE = -(1/n) * Σ[y*log(p) + (1-y)*log(1-p)]
Multi-class — Categorical Cross-Entropy:
CCE = -(1/n) * Σ Σ y_true[i,c] * log(p[i,c])
Backpropagation — The Chain Rule
Given a network loss = L(f3(f2(f1(x)))), the gradient of loss with respect to layer 1 weights is:
∂L/∂W1 = ∂L/∂f3 · ∂f3/∂f2 · ∂f2/∂f1 · ∂f1/∂W1
This is the chain rule applied repeatedly. Backpropagation is just an efficient algorithm that computes this by working backward from the output, reusing intermediate values.
Every modern deep learning framework (PyTorch, JAX, TensorFlow) implements automatic differentiation — you define the forward pass, and gradients are computed for free.
Neural Network from Scratch with NumPy
Neural Network from Scratch — XOR Problem
Click Run to execute — Python runs in your browser via WebAssembly
Gradient Descent Variants
SGD vs Adam Optimizer
Click Run to execute — Python runs in your browser via WebAssembly
PyTorch: Tensors and Autograd
python
import torch# Tensor creationx = torch.tensor([1.0, 2.0, 3.0])W = torch.randn(3, 4, requires_grad=True) # Track gradientsb = torch.zeros(4, requires_grad=True)# Forward passout = x @ W + bloss = out.sum()# Backward pass — PyTorch computes all gradients automaticallyloss.backward()print(W.grad) # dL/dWprint(b.grad) # dL/db# GPU support — just move tensorsdevice = "cuda" if torch.cuda.is_available() else "cpu"x = x.to(device)W = W.to(device)
Building Models with nn.Module
python
import torchimport torch.nn as nnimport torch.nn.functional as Fclass MLP(nn.Module): def __init__(self, input_size, hidden_sizes, output_size, dropout=0.3): super().__init__() layers = [] prev_size = input_size for h in hidden_sizes: layers.extend([ nn.Linear(prev_size, h), nn.BatchNorm1d(h), nn.ReLU(), nn.Dropout(dropout), ]) prev_size = h layers.append(nn.Linear(prev_size, output_size)) self.network = nn.Sequential(*layers) def forward(self, x): return self.network(x)model = MLP(input_size=784, hidden_sizes=[256, 128], output_size=10)print(model)print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
The PyTorch Training Loop
python
import torchimport torch.nn as nnfrom torch.utils.data import DataLoader, TensorDatasetdef train_model(model, train_loader, val_loader, epochs=20, lr=1e-3): optimizer = torch.optim.Adam(model.parameters(), lr=lr) scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3) criterion = nn.CrossEntropyLoss() best_val_loss = float('inf') for epoch in range(epochs): # ── Training phase ── model.train() train_loss = 0.0 for X_batch, y_batch in train_loader: optimizer.zero_grad() # Clear gradients from previous step outputs = model(X_batch) loss = criterion(outputs, y_batch) loss.backward() # Compute gradients torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) optimizer.step() # Update weights train_loss += loss.item() # ── Validation phase ── model.eval() val_loss = 0.0 correct = 0 with torch.no_grad(): # No gradient computation for X_batch, y_batch in val_loader: outputs = model(X_batch) val_loss += criterion(outputs, y_batch).item() correct += (outputs.argmax(1) == y_batch).sum().item() val_loss /= len(val_loader) scheduler.step(val_loss) if val_loss < best_val_loss: best_val_loss = val_loss torch.save(model.state_dict(), "best_model.pt") # Save best weights if epoch % 5 == 0: acc = correct / len(val_loader.dataset) print(f"Epoch {epoch}: train={train_loss/len(train_loader):.4f}, " f"val={val_loss:.4f}, acc={acc:.3f}") model.load_state_dict(torch.load("best_model.pt")) return model