Every LLM has a knowledge cutoff. Ask Claude about something that happened last week, and it won't know. Ask it about your company's internal API documentation, and it definitely won't know. RAG (Retrieval-Augmented Generation) is the solution — and it's the most widely deployed AI pattern in production systems today.
The Problem RAG Solves
LLMs are trained on a static snapshot of the internet. They're powerful reasoning engines, but their knowledge is frozen at training time. For most real-world applications, this is a problem:
- Your support chatbot needs to know about last week's product update
- Your code assistant needs to understand your internal codebase
- Your research agent needs to process documents you uploaded today
You have three options: fine-tuning (expensive, slow, requires ML expertise), prompt stuffing (limited by context window), or RAG (fast, cheap, and surprisingly effective).
How RAG Works
RAG has two phases:
Phase 1: Indexing (done once)
- Take your source documents (PDFs, docs, code files, database records)
- Split them into chunks (typically 500–1,500 tokens each)
- Run each chunk through an embedding model to get a numerical vector
- Store those vectors in a vector database alongside the original text
Phase 2: Retrieval + Generation (done per query)
- Take the user's question
- Embed it using the same embedding model
- Search the vector database for the most similar chunks
- Inject those chunks into the LLM prompt as context
- Let the LLM generate an answer grounded in the retrieved content
The magic is step 3: semantic similarity search. Instead of keyword matching ("find documents that contain the word 'authentication'"), vector search finds documents that are conceptually similar to the question, even if they use different words.
A Simple Implementation
from anthropic import Anthropic
import chromadb
from sentence_transformers import SentenceTransformer
# Setup
client = Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")
db = chromadb.Client()
collection = db.create_collection("docs")
# Index your documents
def index_documents(documents: list[dict]):
texts = [doc["text"] for doc in documents]
embeddings = embedder.encode(texts).tolist()
collection.add(
documents=texts,
embeddings=embeddings,
ids=[doc["id"] for doc in documents]
)
# Query with RAG
def ask(question: str) -> str:
# Retrieve relevant chunks
query_embedding = embedder.encode([question]).tolist()
results = collection.query(query_embeddings=query_embedding, n_results=3)
context = "\n\n".join(results["documents"][0])
# Generate with context
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}]
)
return response.content[0].text
This is the core of RAG. In 30 lines of Python, you've given an LLM access to your documents.
Vector Databases: Your Options
| Database | Best For | Deployment | |----------|----------|------------| | Chroma | Local dev, prototyping | In-process or local server | | Pinecone | Managed cloud, low ops | SaaS | | Weaviate | Open source, complex filtering | Self-hosted or cloud | | Supabase pgvector | If you're already on Postgres | Supabase or Postgres | | Qdrant | High performance, Rust-based | Self-hosted or cloud |
For prototyping, start with Chroma — it runs in-memory with zero setup. For production, Pinecone or Supabase pgvector (if you're already using Postgres) are the most common choices.
The Chunking Problem
RAG quality depends heavily on how you chunk your documents. Too small and you lose context. Too large and you dilute relevance scores and risk hitting context limits.
Common chunking strategies:
- Fixed size (500 tokens): simple but breaks mid-sentence
- Recursive text splitting: splits on paragraphs → sentences → characters, in order
- Semantic chunking: splits at topic boundaries using embeddings (more complex, better results)
- Document structure: respects markdown headers, code blocks, etc.
For most use cases, recursive splitting with 512 tokens and 50-token overlap works well as a starting point.
What RAG Can't Do
RAG is powerful but has real limitations:
The lost in the middle problem: LLMs tend to ignore context in the middle of a long prompt. If you retrieve 20 chunks, the ones in positions 5-15 are often underweighted.
The recall-precision tradeoff: Retrieve too few chunks and you miss relevant information. Retrieve too many and you add noise that confuses the model.
Outdated embeddings: If your documents change, your embeddings go stale. You need a pipeline to detect changes and re-embed.
Structured data doesn't chunk well: Tables, code, and databases often need specialized retrieval strategies (SQL generation, code-specific embeddings).
RAG in Production
Real production RAG systems add:
- Hybrid search: combine dense (vector) and sparse (BM25) retrieval for better coverage
- Reranking: use a cross-encoder model to reorder retrieved chunks by relevance
- Metadata filtering: filter by date, source, author before vector search
- Query rewriting: have the LLM rephrase the question for better retrieval
- Evaluation: measure retrieval recall and answer correctness automatically
These techniques are what separate "RAG demo" from "RAG that works at 2am when your VP is showing it to a customer." Phase 4 of MindloomHQ covers all of them — the implementation details, the common failure modes, and the production patterns.