GadaaLabs
RAG Engineering
Lesson 6

Production RAG

18 min

A RAG pipeline that scores 0.9 on RAGAS but takes 8 seconds to respond will not survive in production. This lesson covers the engineering work that takes a functional RAG prototype to a reliable, fast, observable service.

Latency Decomposition

Before optimising, measure each stage individually:

| Stage | Typical latency | Optimisation lever | |---|---|---| | Query embedding | 20–80 ms | Smaller model, batching | | Vector search (ANN) | 5–30 ms | Index tuning, caching | | Cross-encoder re-rank | 100–500 ms | Limit candidate set to 20 | | LLM generation (first token) | 300–2000 ms | Streaming, smaller model | | LLM generation (full response) | 500–5000 ms | Streaming output | | Total p50 | ~800 ms | | | Total p99 | ~3000 ms | |

Stream LLM output to the user as soon as tokens arrive — this hides most of the generation latency.

Semantic Query Cache

Cache identical or near-identical queries to skip vector search and LLM calls entirely:

python
import hashlib
from functools import lru_cache
from sentence_transformers import SentenceTransformer
import numpy as np

class SemanticCache:
    def __init__(self, threshold: float = 0.95):
        self.model     = SentenceTransformer("all-MiniLM-L6-v2")
        self.threshold = threshold
        self.store: list[tuple[np.ndarray, str]] = []  # (embedding, cached_answer)

    def get(self, query: str) -> str | None:
        q_emb = self.model.encode(query, normalize_embeddings=True)
        for cached_emb, answer in self.store:
            sim = float(q_emb @ cached_emb)
            if sim >= self.threshold:
                return answer
        return None

    def set(self, query: str, answer: str):
        q_emb = self.model.encode(query, normalize_embeddings=True)
        self.store.append((q_emb, answer))

A cosine threshold of 0.95 captures paraphrases without over-caching. In production, back the cache with Redis and set a TTL to avoid serving stale answers.

Output Guardrails

Guard against prompt injection and out-of-scope responses:

python
import re

BLOCK_PATTERNS = [
    r"ignore previous instructions",
    r"you are now",
    r"disregard your",
    r"forget everything",
]

def check_for_injection(query: str) -> bool:
    lower = query.lower()
    return any(re.search(p, lower) for p in BLOCK_PATTERNS)

def validate_response(response: str, retrieved_chunks: list[str]) -> dict:
    """
    Require that every factual claim in the response is grounded in retrieved context.
    Returns a dict with 'safe' bool and 'reason' string.
    """
    if len(response) < 10:
        return {"safe": False, "reason": "Response too short"}
    if response.lower().startswith("i cannot"):
        return {"safe": True, "reason": "Model declined — may be out of scope"}
    if not any(chunk[:50] in response for chunk in retrieved_chunks):
        # Heuristic: any copied phrase from context
        pass  # rely on RAGAS faithfulness monitoring instead
    return {"safe": True, "reason": "ok"}

LlamaIndex Pipeline with Tracing

python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.callbacks import CallbackManager, LlamaDebugHandler

# Enable tracing
debug_handler = LlamaDebugHandler(print_trace_on_end=True)
Settings.callback_manager = CallbackManager([debug_handler])

# Build index from documents
documents = SimpleDirectoryReader("./docs").load_data()
index     = VectorStoreIndex.from_documents(documents)

# Create query engine with hybrid retrieval
query_engine = index.as_query_engine(
    similarity_top_k     = 10,
    response_mode        = "compact",
    streaming            = True,
)

response = query_engine.query("What is the Gadaa governance cycle?")
for token in response.response_gen:
    print(token, end="", flush=True)

The LlamaDebugHandler logs every retrieval and LLM call with timing information — essential for finding the bottleneck in a slow query.

Deployment Architecture

bash
# docker-compose.yml (simplified)
services:
  rag-api:
    image: gadaalabs/rag-api:latest
    environment:
      - QDRANT_URL=http://qdrant:6333
      - REDIS_URL=redis://redis:6379
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    ports: ["8000:8000"]

  qdrant:
    image: qdrant/qdrant:latest
    volumes: ["./qdrant_data:/qdrant/storage"]

  redis:
    image: redis:7-alpine
    command: redis-server --maxmemory 512mb --maxmemory-policy allkeys-lru

Summary

  • Decompose latency by stage before optimising — the bottleneck is usually LLM generation, not vector search.
  • Stream LLM responses to users immediately; this alone reduces perceived latency by 60–80%.
  • Add a semantic cache with cosine similarity thresholding to skip expensive pipeline calls for repeat queries.
  • Defend against prompt injection with pattern matching and enforce output schema validation before returning responses.
  • Use LlamaIndex tracing or OpenTelemetry to instrument every pipeline stage in production for debugging and SLA monitoring.