Python Mastery — From Zero to AI Engineering

Lesson 18

Building AI Applications — LLMs, RAG & Agents

34 min

LLM Fundamentals

Large Language Models are probability distributions over token sequences. Given a prefix, they predict the most likely continuation — sampled rather than deterministic, which is why they can be creative.

Key parameters you control:

| Parameter | Effect | Typical range | |---|---|---| | temperature | Randomness. 0 = deterministic, 2 = chaotic | 0.0 – 1.0 | | top_p | Nucleus sampling: only consider top-p probability mass | 0.9 – 1.0 | | max_tokens | Hard cap on output length | 100 – 8192 | | stop | Stop sequences — model halts when it generates these | e.g. ["\n\n"] |

Tokens are not words. The rule of thumb: 1 token ≈ 0.75 English words, or roughly 4 characters. The word "tokenization" is typically 3-4 tokens. Code is denser — more tokens per character.

Context window is the maximum combined length of prompt + response. Models cannot access information outside their context window during inference. For llama-3.3-70b-versatile on Groq: 128K tokens.

Groq API Integration

Groq uses the OpenAI-compatible SDK. Your existing OpenAI code runs against Groq with two line changes:

python

from groq import Groq

client = Groq(api_key="your-groq-api-key")   # or reads GROQ_API_KEY from env

# Basic completion
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "system", "content": "You are an expert Python engineer."},
        {"role": "user",   "content": "Explain Python's GIL in 3 sentences."},
    ],
    temperature=0.7,
    max_tokens=500,
)
print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

Streaming Responses

For long responses, streaming dramatically improves perceived latency — the user sees text appear token-by-token instead of waiting for the full response:

python

from groq import Groq

client = Groq()

def stream_completion(prompt: str, system: str = "You are a helpful assistant."):
    stream = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
            {"role": "system", "content": system},
            {"role": "user",   "content": prompt},
        ],
        stream=True,        # Key change: enable streaming
        temperature=0.7,
    )

    full_response = ""
    for chunk in stream:
        delta = chunk.choices[0].delta
        if delta.content:
            print(delta.content, end="", flush=True)   # Print as it arrives
            full_response += delta.content

    print()   # Newline at end
    return full_response

# FastAPI streaming endpoint with Server-Sent Events
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio

app = FastAPI()

async def generate_stream(prompt: str):
    """Async generator that yields SSE-formatted events."""
    stream = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )
    for chunk in stream:
        content = chunk.choices[0].delta.content or ""
        if content:
            yield f"data: {content}\n\n"
    yield "data: [DONE]\n\n"

@app.post("/chat/stream")
async def chat_stream(request: ChatRequest):
    return StreamingResponse(
        generate_stream(request.message),
        media_type="text/event-stream",
    )

Prompt Engineering Patterns

System prompt — establishes the model's role, constraints, and output format:

python

SYSTEM_PROMPT = """You are a Python code reviewer. When given code:
1. Identify bugs (critical issues first)
2. Suggest improvements with specific code examples
3. Rate overall quality: poor / fair / good / excellent

Format your response as JSON:
{
  "quality": "good",
  "bugs": ["description1", "description2"],
  "improvements": ["improvement1 with code example"]
}"""

Few-shot — examples train the model to follow a specific format without fine-tuning:

python

messages = [
    {"role": "system", "content": "Classify Python errors."},
    {"role": "user",   "content": "NameError: name 'pd' is not defined"},
    {"role": "assistant", "content": "IMPORT_ERROR: pandas not imported. Fix: import pandas as pd"},
    {"role": "user",   "content": "IndexError: list index out of range"},
    {"role": "assistant", "content": "INDEX_ERROR: accessing index beyond list length. Fix: check len() before indexing"},
    {"role": "user",   "content": "RecursionError: maximum recursion depth exceeded"},  # New query
]

Chain-of-thought — prompting the model to reason step-by-step before answering improves accuracy on complex tasks:

python

cot_prompt = """
<problem>
Estimate the number of tokens in a Python module that's 400 lines long
with typical line length of 50 characters.
</problem>
<thinking>
Walk through the calculation step by step before giving a final answer.
</thinking>
"""

RAG Architecture

Retrieval-Augmented Generation solves LLMs' core limitation: knowledge cutoff. Instead of relying on training data, RAG retrieves relevant documents at query time:

Document corpus
    ↓ Chunking
Text chunks (512 tokens each, 50-token overlap)
    ↓ Embedding model
Vector store (FAISS, Pinecone, Chroma, pgvector)
    ↓
At query time:
Query → Embed → k-NN search → Top-k chunks → Format context → LLM → Answer

The retrieval quality determines the answer quality. Garbage in, garbage out.

Text Chunking Algorithm

Text Chunking with Overlap

Click Run to execute — Python runs in your browser via WebAssembly

TF-IDF Retrieval System

TF-IDF RAG Retrieval System

Click Run to execute — Python runs in your browser via WebAssembly

Tool Use and Function Calling

python

import json
from groq import Groq

client = Groq()

# Define tools as JSON schema
tools = [
    {
        "type": "function",
        "function": {
            "name": "execute_python",
            "description": "Execute Python code and return the output",
            "parameters": {
                "type": "object",
                "properties": {
                    "code": {
                        "type": "string",
                        "description": "Python code to execute",
                    },
                    "timeout": {
                        "type": "integer",
                        "description": "Timeout in seconds",
                        "default": 30,
                    },
                },
                "required": ["code"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "search_docs",
            "description": "Search the documentation for a topic",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "top_k": {"type": "integer", "default": 3},
                },
                "required": ["query"],
            },
        },
    },
]

# Tool implementations
def execute_python(code: str, timeout: int = 30) -> str:
    import subprocess, sys
    result = subprocess.run(
        [sys.executable, "-c", code],
        capture_output=True, text=True, timeout=timeout
    )
    if result.returncode == 0:
        return result.stdout
    return f"Error: {result.stderr}"

def search_docs(query: str, top_k: int = 3) -> str:
    results = retriever.retrieve(query, k=top_k)
    return retriever.format_context(results)

TOOL_MAP = {"execute_python": execute_python, "search_docs": search_docs}

The Agent Loop

python

def run_agent(user_message: str, max_iterations: int = 10) -> str:
    """
    Agent loop:
    1. LLM decides action (tool call or final answer)
    2. Execute tool
    3. Return result to LLM
    4. Repeat until LLM produces final answer
    """
    messages = [
        {"role": "system", "content": "You are a Python coding assistant. Use tools to answer questions."},
        {"role": "user",   "content": user_message},
    ]

    for iteration in range(max_iterations):
        response = client.chat.completions.create(
            model="llama-3.3-70b-versatile",
            messages=messages,
            tools=tools,
            tool_choice="auto",     # Let model decide when to use tools
        )

        message = response.choices[0].message
        messages.append({"role": "assistant", "content": message.content,
                         "tool_calls": message.tool_calls})

        # If no tool calls: model is done
        if not message.tool_calls:
            return message.content

        # Execute all requested tool calls
        for tool_call in message.tool_calls:
            fn_name = tool_call.function.name
            fn_args = json.loads(tool_call.function.arguments)

            print(f"[Agent] Calling {fn_name}({fn_args})")
            result = TOOL_MAP[fn_name](**fn_args)
            print(f"[Agent] Result: {result[:100]}...")

            messages.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": result,
            })

    return "Max iterations reached"

# Usage
answer = run_agent("What's the sum of prime numbers below 100? Show me the code.")

Agent Loop Simulation (In Browser)

Agent Loop Simulation

Click Run to execute — Python runs in your browser via WebAssembly

Token Counter and Cost Estimation

Token Counter and Cost Estimator

Click Run to execute — Python runs in your browser via WebAssembly

Complete Groq Streaming Chat

python

# app/api/ai/chat/route.py (Next.js API Route using Python backend)

# Python FastAPI endpoint that proxies to Groq with streaming
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from groq import Groq
import json

app = FastAPI()
client = Groq()

class Message(BaseModel):
    role: str
    content: str

class ChatRequest(BaseModel):
    messages: list[Message]
    model: str = "llama-3.3-70b-versatile"
    temperature: float = 0.7
    max_tokens: int = 1024

async def stream_chat(request: ChatRequest):
    stream = client.chat.completions.create(
        model=request.model,
        messages=[m.model_dump() for m in request.messages],
        temperature=request.temperature,
        max_tokens=request.max_tokens,
        stream=True,
    )

    for chunk in stream:
        content = chunk.choices[0].delta.content or ""
        finish = chunk.choices[0].finish_reason

        if content:
            # OpenAI-compatible SSE format
            data = json.dumps({"choices": [{"delta": {"content": content}}]})
            yield f"data: {data}\n\n"

        if finish == "stop":
            yield "data: [DONE]\n\n"
            return

@app.post("/api/chat")
async def chat_endpoint(request: ChatRequest):
    return StreamingResponse(
        stream_chat(request),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",    # Disable Nginx buffering
        },
    )

PROJECT: Mini RAG System

Mini RAG System — Full Project

Click Run to execute — Python runs in your browser via WebAssembly

Guardrails

python

import re
from pydantic import BaseModel, validator

class GuardrailedChat(BaseModel):
    message: str
    user_id: str

    @validator("message")
    def check_length(cls, v):
        if len(v) > 10000:
            raise ValueError("Message too long")
        return v

# PII detection (basic — use presidio or similar in production)
PII_PATTERNS = {
    "email":       r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b",
    "phone":       r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",
    "ssn":         r"\b\d{3}-\d{2}-\d{4}\b",
    "credit_card": r"\b(?:\d{4}[-\s]?){3}\d{4}\b",
}

def detect_pii(text: str) -> dict[str, list[str]]:
    found = {}
    for pii_type, pattern in PII_PATTERNS.items():
        matches = re.findall(pattern, text)
        if matches:
            found[pii_type] = matches
    return found

def sanitize_pii(text: str) -> str:
    for pii_type, pattern in PII_PATTERNS.items():
        replacement = f"[REDACTED_{pii_type.upper()}]"
        text = re.sub(pattern, replacement, text)
    return text

# Output validation
def validate_json_output(text: str, required_keys: list[str]) -> bool:
    import json
    try:
        data = json.loads(text)
        return all(k in data for k in required_keys)
    except json.JSONDecodeError:
        return False

What to Build Next

You now have the foundation to build serious AI-powered applications. Here are high-value projects that cement this knowledge:

Beginner-to-intermediate:

AI code reviewer: FastAPI endpoint + Groq + streaming. Review Python files, identify bugs, suggest improvements. Add a VS Code extension as the frontend.
Document Q&A: RAG pipeline over your own PDF library. PyMuPDF for extraction, sentence-transformers for embeddings, FAISS for vector search.
Semantic search engine: Replace keyword search on any dataset with embedding-based search. Build a demo on the scikit-learn dataset docs.

Intermediate-to-advanced:

Auto-grading system: Upload homework, grade against a rubric, provide feedback. Use structured outputs (JSON mode) for deterministic grading.
AI data analyst: Upload a CSV, ask questions in natural language, the agent writes and executes Pandas code, returns visualizations.
Multi-agent pipeline: Research agent gathers information, writing agent synthesizes it, critic agent reviews and requests revisions.

Learning path:

LangChain / LlamaIndex: Higher-level RAG frameworks. Learn after you understand the primitives (which you now do).
DSPy: Programmatic LLM optimization — replaces hand-written prompts with optimized ones.
vLLM / Ollama: Run models locally. Critical for privacy-sensitive applications.
RLHF and fine-tuning: Once you have labeled data, fine-tuning beats prompt engineering for specialized tasks.
Evals: Learn to evaluate LLM outputs systematically. RAGAS for RAG, custom harnesses for structured tasks.

Key Takeaways

LLMs predict token probabilities — temperature controls randomness, top-p limits the candidate pool, and neither makes the model more capable
Stream responses for all user-facing applications — time to first token matters more than total latency
RAG quality depends on chunking and retrieval — embed multiple strategies (semantic + keyword) and rerank results
The agent loop is simple: LLM responds → if tool call, execute and return result → repeat; complexity comes from error handling and state management
Token counting before making API calls prevents budget surprises — estimate with the 4-chars-per-token rule
Guardrails (input validation, PII detection, output schema enforcement) belong in every production AI application
Structured outputs (JSON mode + Pydantic validation) make LLM responses reliable enough to parse programmatically
Cost optimization order: smaller model first, then caching, then prompt compression, then fine-tuning — never start with fine-tuning
RAG beats fine-tuning for knowledge that changes; fine-tuning beats RAG for style, format, and domain-specific behavior
The skills from this course — data structures, OOP, async, ML, profiling, API design — are the foundation every AI engineer needs

Production Python — FastAPI, Packaging & Profiling