Large Language Models are probability distributions over token sequences. Given a prefix, they predict the most likely continuation — sampled rather than deterministic, which is why they can be creative.
Key parameters you control:
| Parameter | Effect | Typical range |
|---|---|---|
| temperature | Randomness. 0 = deterministic, 2 = chaotic | 0.0 – 1.0 |
| top_p | Nucleus sampling: only consider top-p probability mass | 0.9 – 1.0 |
| max_tokens | Hard cap on output length | 100 – 8192 |
| stop | Stop sequences — model halts when it generates these | e.g. ["\n\n"] |
Tokens are not words. The rule of thumb: 1 token ≈ 0.75 English words, or roughly 4 characters. The word "tokenization" is typically 3-4 tokens. Code is denser — more tokens per character.
Context window is the maximum combined length of prompt + response. Models cannot access information outside their context window during inference. For llama-3.3-70b-versatile on Groq: 128K tokens.
Groq API Integration
Groq uses the OpenAI-compatible SDK. Your existing OpenAI code runs against Groq with two line changes:
python
from groq import Groqclient = Groq(api_key="your-groq-api-key") # or reads GROQ_API_KEY from env# Basic completionresponse = client.chat.completions.create( model="llama-3.3-70b-versatile", messages=[ {"role": "system", "content": "You are an expert Python engineer."}, {"role": "user", "content": "Explain Python's GIL in 3 sentences."}, ], temperature=0.7, max_tokens=500,)print(response.choices[0].message.content)print(f"Tokens used: {response.usage.total_tokens}")
Streaming Responses
For long responses, streaming dramatically improves perceived latency — the user sees text appear token-by-token instead of waiting for the full response:
python
from groq import Groqclient = Groq()def stream_completion(prompt: str, system: str = "You are a helpful assistant."): stream = client.chat.completions.create( model="llama-3.3-70b-versatile", messages=[ {"role": "system", "content": system}, {"role": "user", "content": prompt}, ], stream=True, # Key change: enable streaming temperature=0.7, ) full_response = "" for chunk in stream: delta = chunk.choices[0].delta if delta.content: print(delta.content, end="", flush=True) # Print as it arrives full_response += delta.content print() # Newline at end return full_response# FastAPI streaming endpoint with Server-Sent Eventsfrom fastapi import FastAPIfrom fastapi.responses import StreamingResponseimport asyncioapp = FastAPI()async def generate_stream(prompt: str): """Async generator that yields SSE-formatted events.""" stream = client.chat.completions.create( model="llama-3.3-70b-versatile", messages=[{"role": "user", "content": prompt}], stream=True, ) for chunk in stream: content = chunk.choices[0].delta.content or "" if content: yield f"data: {content}\n\n" yield "data: [DONE]\n\n"@app.post("/chat/stream")async def chat_stream(request: ChatRequest): return StreamingResponse( generate_stream(request.message), media_type="text/event-stream", )
Prompt Engineering Patterns
System prompt — establishes the model's role, constraints, and output format:
python
SYSTEM_PROMPT = """You are a Python code reviewer. When given code:1. Identify bugs (critical issues first)2. Suggest improvements with specific code examples3. Rate overall quality: poor / fair / good / excellentFormat your response as JSON:{ "quality": "good", "bugs": ["description1", "description2"], "improvements": ["improvement1 with code example"]}"""
Few-shot — examples train the model to follow a specific format without fine-tuning:
python
messages = [ {"role": "system", "content": "Classify Python errors."}, {"role": "user", "content": "NameError: name 'pd' is not defined"}, {"role": "assistant", "content": "IMPORT_ERROR: pandas not imported. Fix: import pandas as pd"}, {"role": "user", "content": "IndexError: list index out of range"}, {"role": "assistant", "content": "INDEX_ERROR: accessing index beyond list length. Fix: check len() before indexing"}, {"role": "user", "content": "RecursionError: maximum recursion depth exceeded"}, # New query]
Chain-of-thought — prompting the model to reason step-by-step before answering improves accuracy on complex tasks:
python
cot_prompt = """<problem>Estimate the number of tokens in a Python module that's 400 lines longwith typical line length of 50 characters.</problem><thinking>Walk through the calculation step by step before giving a final answer.</thinking>"""
RAG Architecture
Retrieval-Augmented Generation solves LLMs' core limitation: knowledge cutoff. Instead of relying on training data, RAG retrieves relevant documents at query time:
def run_agent(user_message: str, max_iterations: int = 10) -> str: """ Agent loop: 1. LLM decides action (tool call or final answer) 2. Execute tool 3. Return result to LLM 4. Repeat until LLM produces final answer """ messages = [ {"role": "system", "content": "You are a Python coding assistant. Use tools to answer questions."}, {"role": "user", "content": user_message}, ] for iteration in range(max_iterations): response = client.chat.completions.create( model="llama-3.3-70b-versatile", messages=messages, tools=tools, tool_choice="auto", # Let model decide when to use tools ) message = response.choices[0].message messages.append({"role": "assistant", "content": message.content, "tool_calls": message.tool_calls}) # If no tool calls: model is done if not message.tool_calls: return message.content # Execute all requested tool calls for tool_call in message.tool_calls: fn_name = tool_call.function.name fn_args = json.loads(tool_call.function.arguments) print(f"[Agent] Calling {fn_name}({fn_args})") result = TOOL_MAP[fn_name](**fn_args) print(f"[Agent] Result: {result[:100]}...") messages.append({ "role": "tool", "tool_call_id": tool_call.id, "content": result, }) return "Max iterations reached"# Usageanswer = run_agent("What's the sum of prime numbers below 100? Show me the code.")
Agent Loop Simulation (In Browser)
Agent Loop Simulation
Click Run to execute — Python runs in your browser via WebAssembly
Token Counter and Cost Estimation
Token Counter and Cost Estimator
Click Run to execute — Python runs in your browser via WebAssembly
Complete Groq Streaming Chat
python
# app/api/ai/chat/route.py (Next.js API Route using Python backend)# Python FastAPI endpoint that proxies to Groq with streamingfrom fastapi import FastAPIfrom fastapi.responses import StreamingResponsefrom pydantic import BaseModelfrom groq import Groqimport jsonapp = FastAPI()client = Groq()class Message(BaseModel): role: str content: strclass ChatRequest(BaseModel): messages: list[Message] model: str = "llama-3.3-70b-versatile" temperature: float = 0.7 max_tokens: int = 1024async def stream_chat(request: ChatRequest): stream = client.chat.completions.create( model=request.model, messages=[m.model_dump() for m in request.messages], temperature=request.temperature, max_tokens=request.max_tokens, stream=True, ) for chunk in stream: content = chunk.choices[0].delta.content or "" finish = chunk.choices[0].finish_reason if content: # OpenAI-compatible SSE format data = json.dumps({"choices": [{"delta": {"content": content}}]}) yield f"data: {data}\n\n" if finish == "stop": yield "data: [DONE]\n\n" return@app.post("/api/chat")async def chat_endpoint(request: ChatRequest): return StreamingResponse( stream_chat(request), media_type="text/event-stream", headers={ "Cache-Control": "no-cache", "X-Accel-Buffering": "no", # Disable Nginx buffering }, )
PROJECT: Mini RAG System
Mini RAG System — Full Project
Click Run to execute — Python runs in your browser via WebAssembly
Guardrails
python
import refrom pydantic import BaseModel, validatorclass GuardrailedChat(BaseModel): message: str user_id: str @validator("message") def check_length(cls, v): if len(v) > 10000: raise ValueError("Message too long") return v# PII detection (basic — use presidio or similar in production)PII_PATTERNS = { "email": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "phone": r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b", "ssn": r"\b\d{3}-\d{2}-\d{4}\b", "credit_card": r"\b(?:\d{4}[-\s]?){3}\d{4}\b",}def detect_pii(text: str) -> dict[str, list[str]]: found = {} for pii_type, pattern in PII_PATTERNS.items(): matches = re.findall(pattern, text) if matches: found[pii_type] = matches return founddef sanitize_pii(text: str) -> str: for pii_type, pattern in PII_PATTERNS.items(): replacement = f"[REDACTED_{pii_type.upper()}]" text = re.sub(pattern, replacement, text) return text# Output validationdef validate_json_output(text: str, required_keys: list[str]) -> bool: import json try: data = json.loads(text) return all(k in data for k in required_keys) except json.JSONDecodeError: return False
What to Build Next
You now have the foundation to build serious AI-powered applications. Here are high-value projects that cement this knowledge:
Beginner-to-intermediate:
AI code reviewer: FastAPI endpoint + Groq + streaming. Review Python files, identify bugs, suggest improvements. Add a VS Code extension as the frontend.
Document Q&A: RAG pipeline over your own PDF library. PyMuPDF for extraction, sentence-transformers for embeddings, FAISS for vector search.
Semantic search engine: Replace keyword search on any dataset with embedding-based search. Build a demo on the scikit-learn dataset docs.
Intermediate-to-advanced:
Auto-grading system: Upload homework, grade against a rubric, provide feedback. Use structured outputs (JSON mode) for deterministic grading.
AI data analyst: Upload a CSV, ask questions in natural language, the agent writes and executes Pandas code, returns visualizations.
Multi-agent pipeline: Research agent gathers information, writing agent synthesizes it, critic agent reviews and requests revisions.
Learning path:
LangChain / LlamaIndex: Higher-level RAG frameworks. Learn after you understand the primitives (which you now do).