A RAG pipeline that scores 0.9 on RAGAS but takes 8 seconds to respond will not survive in production. This lesson covers the engineering work that takes a functional RAG prototype to a reliable, fast, observable service.
Latency Decomposition
Before optimising, measure each stage individually:
| Stage | Typical latency | Optimisation lever |
|---|---|---|
| Query embedding | 20–80 ms | Smaller model, batching |
| Vector search (ANN) | 5–30 ms | Index tuning, caching |
| Cross-encoder re-rank | 100–500 ms | Limit candidate set to 20 |
| LLM generation (first token) | 300–2000 ms | Streaming, smaller model |
| LLM generation (full response) | 500–5000 ms | Streaming output |
| Total p50 | ~800 ms | |
| Total p99 | ~3000 ms | |
Stream LLM output to the user as soon as tokens arrive — this hides most of the generation latency.
Semantic Query Cache
Cache identical or near-identical queries to skip vector search and LLM calls entirely:
A cosine threshold of 0.95 captures paraphrases without over-caching. In production, back the cache with Redis and set a TTL to avoid serving stale answers.
Output Guardrails
Guard against prompt injection and out-of-scope responses:
python
import reBLOCK_PATTERNS = [ r"ignore previous instructions", r"you are now", r"disregard your", r"forget everything",]def check_for_injection(query: str) -> bool: lower = query.lower() return any(re.search(p, lower) for p in BLOCK_PATTERNS)def validate_response(response: str, retrieved_chunks: list[str]) -> dict: """ Require that every factual claim in the response is grounded in retrieved context. Returns a dict with 'safe' bool and 'reason' string. """ if len(response) < 10: return {"safe": False, "reason": "Response too short"} if response.lower().startswith("i cannot"): return {"safe": True, "reason": "Model declined — may be out of scope"} if not any(chunk[:50] in response for chunk in retrieved_chunks): # Heuristic: any copied phrase from context pass # rely on RAGAS faithfulness monitoring instead return {"safe": True, "reason": "ok"}
LlamaIndex Pipeline with Tracing
python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settingsfrom llama_index.core.callbacks import CallbackManager, LlamaDebugHandler# Enable tracingdebug_handler = LlamaDebugHandler(print_trace_on_end=True)Settings.callback_manager = CallbackManager([debug_handler])# Build index from documentsdocuments = SimpleDirectoryReader("./docs").load_data()index = VectorStoreIndex.from_documents(documents)# Create query engine with hybrid retrievalquery_engine = index.as_query_engine( similarity_top_k = 10, response_mode = "compact", streaming = True,)response = query_engine.query("What is the Gadaa governance cycle?")for token in response.response_gen: print(token, end="", flush=True)
The LlamaDebugHandler logs every retrieval and LLM call with timing information — essential for finding the bottleneck in a slow query.