Embedding Models
An embedding model converts text into a dense vector where semantic similarity becomes geometric proximity. The choice of model determines the ceiling of your retrieval quality — no amount of clever chunking or re-ranking recovers from a poorly calibrated embedding space. This lesson covers how embedding models are trained, how to select one from public benchmarks, and when to fine-tune.
How Embedding Models Learn
Modern embedding models are trained with contrastive learning on (query, positive passage, negative passages) triplets:
The temperature parameter controls sharpness: lower values push the model to produce more discriminative embeddings.
MTEB Benchmark Comparison
The Massive Text Embedding Benchmark (MTEB) evaluates models across 56 tasks. For RAG, focus on the Retrieval category:
| Model | MTEB Retrieval (avg NDCG@10) | Dimensions | Max tokens | License |
|---|---|---|---|---|
| bge-large-en-v1.5 | 54.29 | 1024 | 512 | MIT |
| text-embedding-3-large | 55.44 | 3072 | 8191 | Commercial |
| e5-large-v2 | 50.56 | 1024 | 512 | MIT |
| all-MiniLM-L6-v2 | 40.69 | 384 | 256 | Apache 2.0 |
| nomic-embed-text-v1 | 53.61 | 768 | 8192 | Apache 2.0 |
For most production systems, bge-large-en-v1.5 or nomic-embed-text-v1 give excellent quality-to-cost ratios. Use all-MiniLM-L6-v2 for prototyping only.
Using sentence-transformers
Always normalise embeddings before computing cosine similarity — it lets you use dot product (faster) in place of the full cosine formula.
Domain Fine-Tuning with sentence-transformers
When the pre-trained model underperforms on your domain, fine-tune with your own (query, relevant passage) pairs:
You need at least 500–1000 high-quality pairs for fine-tuning to improve over the base model. Mine them from user search logs: queries that led to clicks are positive pairs.
Summary
- Embedding models learn via contrastive loss: minimise the distance to positive passages and maximise distance to negatives.
- Use MTEB Retrieval scores to compare models objectively; do not rely on model marketing claims.
bge-large-en-v1.5andnomic-embed-text-v1are strong open-source choices;text-embedding-3-largeleads on benchmarks with a commercial API.- Always normalise embeddings to unit length before computing similarity — it makes cosine similarity equivalent to dot product.
- Fine-tune on domain-specific (query, passage) pairs when general-purpose models underperform; source pairs from click logs or expert annotations.