NLP & Transformers — From Tokenization to Fine-Tuning
32 min
The Text Pipeline
Every NLP system starts with raw text and ends with numbers. The pipeline is always:
Raw text → Normalize → Tokenize → Vectorize → Model → Decode
Understanding each step matters because each introduces tradeoffs: information loss, computational cost, and sensitivity to domain.
Text Preprocessing
Text Preprocessing Pipeline
Click Run to execute — Python runs in your browser via WebAssembly
Bag of Words and TF-IDF
Bag of Words and TF-IDF
Click Run to execute — Python runs in your browser via WebAssembly
Word Embeddings — Intuition
Before transformers, Word2Vec (2013) was the standard way to represent words as dense vectors. The key insight: words that appear in similar contexts should have similar vectors.
king - man + woman ≈ queen (famous word analogy)paris - france + italy ≈ rome (geographic analogy)
This emerges from training on billions of words with a simple self-supervised objective: predict neighboring words.
GloVe extends this by factorizing the word co-occurrence matrix. Both produce vectors where semantic relationships become geometric relationships.
The critical limitation: one vector per word. "Bank" (financial) and "bank" (river) get the same embedding. Transformers solve this with contextual embeddings.
The Attention Mechanism
Attention lets each token directly access information from every other token in the sequence:
python
# Scaled dot-product attention (the core operation in every transformer)def attention(Q, K, V, mask=None): d_k = Q.shape[-1] scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k) # Scale by sqrt(d_k) if mask is not None: scores = scores.masked_fill(mask == 0, -1e9) weights = torch.softmax(scores, dim=-1) # Probability distribution return weights @ V, weights # Weighted sum of values
Interpretation: for each token (Q), compute how much it should attend to every other token (K), then retrieve a weighted mixture of their values (V).
Multi-head attention runs this in parallel with different learned projections, letting the model attend to different types of relationships simultaneously.
Positional encoding injects position information since attention is order-invariant. The original transformer uses sinusoidal functions; modern models learn position embeddings.
BERT vs GPT:
| Property | BERT | GPT |
|---|---|---|
| Direction | Bidirectional | Left-to-right only |
| Training | Masked LM + NSP | Next token prediction |
| Best for | Classification, NER, QA | Generation, completion |
| Key idea | See full context | Autoregressive |
Tokenization Deep Dive
Modern tokenizers operate on subword units, not whole words:
BPE (Byte-Pair Encoding): merge frequent byte pairs iteratively. Used by GPT, RoBERTa.
WordPiece: similar to BPE but maximizes language model likelihood. Used by BERT.
SentencePiece: language-agnostic, works on raw text. Used by T5, LLaMA.
python
from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")text = "The quick brown fox jumped over the lazy dogs."tokens = tokenizer(text, return_tensors="pt")print(tokenizer.convert_ids_to_tokens(tokens.input_ids[0]))# ['[CLS]', 'the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dogs', '.', '[SEP]']# Subword splitting for unknown/rare wordsrare_text = "Pneumonoultramicroscopicsilicovolcanoconiosis is challenging."rare_tokens = tokenizer.tokenize(rare_text)print(rare_tokens)# ['pneumon', '##oul', '##tra', '##micro', '##scop', '##ics', '##ili', ...]
Special tokens encode structural information:
[CLS] — classification token (BERT); its final representation used for classification
[SEP] — separates sequences in BERT
[PAD] — padding to equalize sequence lengths
[MASK] — masked token in BERT pre-training
TF-IDF Sentiment Classifier
TF-IDF Sentiment Classifier
Click Run to execute — Python runs in your browser via WebAssembly
Naive Bayes Multi-Class Text Classifier
Click Run to execute — Python runs in your browser via WebAssembly
Fine-Tuning BERT with HuggingFace
This code runs in a Python environment with PyTorch and transformers installed:
from sentence_transformers import SentenceTransformerimport numpy as npmodel = SentenceTransformer("all-MiniLM-L6-v2") # 80MB, fast, excellent qualitydocs = [ "Python is a high-level programming language", "Machine learning requires lots of data", "Neural networks are inspired by the brain", "The stock market crashed yesterday", "Interest rates affect mortgage payments",]doc_embeddings = model.encode(docs, normalize_embeddings=True)query = "deep learning with neural nets"query_embedding = model.encode(query, normalize_embeddings=True)# Cosine similarity is just dot product when vectors are normalizedsimilarities = doc_embeddings @ query_embeddingprint("Semantic similarity to:", query)for doc, sim in sorted(zip(docs, similarities), key=lambda x: -x[1]): print(f" {sim:.4f}: {doc}")# Output correctly ranks neural networks > ML > Python > finance docs
PROJECT: Sentiment Analyzer
Sentiment Analyzer — Full Project
Click Run to execute — Python runs in your browser via WebAssembly
Prompt Engineering for Generative Models
python
from groq import Groqclient = Groq()# ── System prompt pattern ─────────────────────────────────────────────────────response = client.chat.completions.create( model="llama-3.3-70b-versatile", messages=[ { "role": "system", "content": "You are a sentiment analysis expert. Analyze text and return JSON only." }, { "role": "user", "content": "Analyze: 'The film had stunning visuals but a weak plot.'" } ], response_format={"type": "json_object"},)# ── Few-shot pattern ──────────────────────────────────────────────────────────few_shot = [ {"role": "user", "content": "Classify: 'The model crashed.' → "}, {"role": "assistant", "content": "TECHNICAL_ISSUE"}, {"role": "user", "content": "Classify: 'Payment declined.' → "}, {"role": "assistant", "content": "BILLING_ISSUE"}, {"role": "user", "content": "Classify: 'Cannot login to account.' → "},]# ── Chain-of-thought ──────────────────────────────────────────────────────────cot_prompt = """<task>Classify the customer support ticket below.</task><instructions>1. Identify the core problem described2. Consider which department handles this type of issue3. Rate urgency from 1-54. Output JSON: {"category": "...", "urgency": N, "summary": "..."}</instructions><ticket>My API key stopped working and I have a demo in 2 hours!</ticket>"""
Key Takeaways
TF-IDF + logistic regression remains competitive for short-text classification; always establish this baseline before reaching for transformers
BERT-style models produce contextual embeddings — the same word gets different vectors in different sentences; this resolves polysemy
BPE/WordPiece tokenization means out-of-vocabulary words are impossible — every character sequence can be tokenized
Fine-tuning requires only 2-5 epochs; too many epochs causes catastrophic forgetting of pre-trained knowledge
[CLS] token's final-layer embedding is the standard representation for classification tasks
Cosine similarity between normalized embeddings is equivalent to dot product — normalize once, then use matrix multiplication for bulk similarity search
Use sentence-transformers for semantic similarity; avoid fine-tuning BERT from scratch unless you have >10k labeled examples
Prompt engineering (system prompts, few-shot examples, chain-of-thought) can substitute for fine-tuning when labeled data is scarce