Lesson 6

Production RAG

18 min

A RAG pipeline that scores 0.9 on RAGAS but takes 8 seconds to respond will not survive in production. This lesson covers the engineering work that takes a functional RAG prototype to a reliable, fast, observable service.

Latency Decomposition

Before optimising, measure each stage individually:

| Stage | Typical latency | Optimisation lever | |---|---|---| | Query embedding | 20–80 ms | Smaller model, batching | | Vector search (ANN) | 5–30 ms | Index tuning, caching | | Cross-encoder re-rank | 100–500 ms | Limit candidate set to 20 | | LLM generation (first token) | 300–2000 ms | Streaming, smaller model | | LLM generation (full response) | 500–5000 ms | Streaming output | | Total p50 | ~800 ms | | | Total p99 | ~3000 ms | |

Stream LLM output to the user as soon as tokens arrive — this hides most of the generation latency.

Semantic Query Cache

Cache identical or near-identical queries to skip vector search and LLM calls entirely:

python

import hashlib
from functools import lru_cache
from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticCache:
    def __init__(self, threshold: float = 0.95):
        self.model     = SentenceTransformer("all-MiniLM-L6-v2")
        self.threshold = threshold
        self.store: list[tuple[np.ndarray, str]] = []  # (embedding, cached_answer)

    def get(self, query: str) -> str | None:
        q_emb = self.model.encode(query, normalize_embeddings=True)
        for cached_emb, answer in self.store:
            sim = float(q_emb @ cached_emb)
            if sim >= self.threshold:
                return answer
        return None

    def set(self, query: str, answer: str):
        q_emb = self.model.encode(query, normalize_embeddings=True)
        self.store.append((q_emb, answer))

A cosine threshold of 0.95 captures paraphrases without over-caching. In production, back the cache with Redis and set a TTL to avoid serving stale answers.

Output Guardrails

Guard against prompt injection and out-of-scope responses:

python

import re

BLOCK_PATTERNS = [
    r"ignore previous instructions",
    r"you are now",
    r"disregard your",
    r"forget everything",
]

def check_for_injection(query: str) -> bool:
    lower = query.lower()
    return any(re.search(p, lower) for p in BLOCK_PATTERNS)

def validate_response(response: str, retrieved_chunks: list[str]) -> dict:
    """
    Require that every factual claim in the response is grounded in retrieved context.
    Returns a dict with 'safe' bool and 'reason' string.
    """
    if len(response) < 10:
        return {"safe": False, "reason": "Response too short"}
    if response.lower().startswith("i cannot"):
        return {"safe": True, "reason": "Model declined — may be out of scope"}
    if not any(chunk[:50] in response for chunk in retrieved_chunks):
        # Heuristic: any copied phrase from context
        pass  # rely on RAGAS faithfulness monitoring instead
    return {"safe": True, "reason": "ok"}

LlamaIndex Pipeline with Tracing

python

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler

# Enable tracing
debug_handler = LlamaDebugHandler(print_trace_on_end=True)
Settings.callback_manager = CallbackManager([debug_handler])

# Build index from documents
documents = SimpleDirectoryReader("./docs").load_data()
index     = VectorStoreIndex.from_documents(documents)

# Create query engine with hybrid retrieval
query_engine = index.as_query_engine(
    similarity_top_k     = 10,
    response_mode        = "compact",
    streaming            = True,
)

response = query_engine.query("What is the Gadaa governance cycle?")
for token in response.response_gen:
    print(token, end="", flush=True)

The LlamaDebugHandler logs every retrieval and LLM call with timing information — essential for finding the bottleneck in a slow query.

Deployment Architecture

bash

# docker-compose.yml (simplified)
services:
  rag-api:
    image: gadaalabs/rag-api:latest
    environment:
      - QDRANT_URL=http://qdrant:6333
      - REDIS_URL=redis://redis:6379
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    ports: ["8000:8000"]

  qdrant:
    image: qdrant/qdrant:latest
    volumes: ["./qdrant_data:/qdrant/storage"]

  redis:
    image: redis:7-alpine
    command: redis-server --maxmemory 512mb --maxmemory-policy allkeys-lru

Summary

Decompose latency by stage before optimising — the bottleneck is usually LLM generation, not vector search.
Stream LLM responses to users immediately; this alone reduces perceived latency by 60–80%.
Add a semantic cache with cosine similarity thresholding to skip expensive pipeline calls for repeat queries.
Defend against prompt injection with pattern matching and enforce output schema validation before returning responses.
Use LlamaIndex tracing or OpenTelemetry to instrument every pipeline stage in production for debugging and SLA monitoring.

Retrieval Quality