Serving a model is a software engineering problem, not a machine learning one. A well-designed inference server handles malformed requests gracefully, reports its own health, batches requests to maximise GPU utilisation, and scales horizontally without state. This lesson builds that server from scratch.
API Contract Design
Define the request/response schemas before writing any model code — they are your API's public contract:
@app.get("/health")async def health(): return {"status": "ok"}@app.get("/ready")async def ready(): if session is None: from fastapi import HTTPException raise HTTPException(status_code=503, detail="Model not loaded") return {"status": "ready", "model_version": MODEL_VERSION}
Kubernetes uses /health (liveness probe) and /ready (readiness probe) separately. A slow model load should fail readiness without killing the pod.
Async Batching Under Load
python
import asynciofrom collections import dequeBATCH_SIZE = 16BATCH_WAIT_MS = 20request_queue: deque = deque()async def batch_processor(): while True: await asyncio.sleep(BATCH_WAIT_MS / 1000) if not request_queue: continue batch = [request_queue.popleft() for _ in range(min(BATCH_SIZE, len(request_queue)))] inputs = np.concatenate([b["input"] for b in batch], axis=0) results = session.run(None, {"input": inputs})[0] for i, b in enumerate(batch): b["future"].set_result(results[i])
Batching trades a small latency increase (up to BATCH_WAIT_MS) for significantly higher throughput and GPU utilisation.
Deployment and Scaling
| Concern | Solution |
|---|---|
| Multiple replicas | Stateless server — any replica handles any request |
| Load balancing | Kubernetes Service with sessionAffinity: None |
| Resource limits | requests: cpu: 500m, memory: 1Gi per replica |
| Zero-downtime deploys | Rolling update strategy with readiness probe |
Summary
Define Pydantic request/response schemas first; they act as contracts and provide free input validation.
Load the model in FastAPI's lifespan hook, not at module level, to support graceful startup and shutdown.
Expose separate /health (liveness) and /ready (readiness) endpoints for Kubernetes probes.
Implement async batching when serving GPU models to improve throughput without unbounded latency.
Keep the server stateless so it scales horizontally by simply adding more identical replicas.