Every time someone asks an AI chatbot about something that happened recently, or about their specific company policy, or about a document they uploaded — and the AI gets it right — there is a good chance RAG is involved.
RAG stands for Retrieval-Augmented Generation. It sounds like jargon, but the idea is simple. This guide explains what it is, why it matters, and how to build a basic version yourself.
The Problem RAG Solves
Large language models — ChatGPT, Claude, Gemini — are trained on enormous amounts of text scraped from the internet up to a certain date. They know a lot about the world in general. But they do not know:
- What happened after their training cutoff
- Anything about your company's internal documentation
- The contents of files you have not shown them
- Real-time data of any kind
Ask an LLM "What is our refund policy?" and it has no idea, because your refund policy was never on the internet. Ask it about news from last week and it will either make something up or tell you it does not know.
This is the knowledge gap RAG is designed to fill.
The Pizza Restaurant Analogy
Here is a way to think about it.
Imagine a world-class chef. This chef has trained in culinary schools for decades and knows how to cook essentially any dish from any cuisine. They know French technique, Italian tradition, Japanese precision. Ask them to make something complex and they will figure it out.
But if you open a pizza restaurant and hire this chef, they do not automatically know your restaurant's specific menu. They do not know that your house special uses a sourdough base, or that your signature sauce is roasted tomatoes with smoked garlic, or that table 12 always gets extra cheese.
You have two options. Option A: retrain the chef from scratch to have your menu baked into their memory. That is called fine-tuning, and it is expensive, slow, and needs to happen every time the menu changes.
Option B: before the chef starts cooking each order, hand them the relevant page from your menu. Now they have your specific information right in front of them when they prepare the dish. They apply their general expertise to your specific context.
That is RAG. The LLM is the chef. RAG is the system that fetches the right page from your menu and hands it to the chef before they cook.
How RAG Actually Works
There are three steps:
Step 1: Indexing (happens once, at setup)
You take all your documents — PDFs, text files, web pages, database records — and chop them into chunks. Each chunk gets converted into a numerical representation called an embedding. These embeddings capture the semantic meaning of the text — not just the words, but what the text is about. All the chunks and their embeddings get stored in a vector database.
Step 2: Retrieval (happens at query time)
When a user asks a question, you convert their question into an embedding using the same model. Then you search the vector database for chunks whose embeddings are closest to the question's embedding. "Closest" means most semantically similar — chunks that are about the same topic, even if they use different words.
The top results — say, the 3-5 most relevant chunks — get pulled out. These are the pages from the menu you are about to hand to the chef.
Step 3: Generation (the LLM answers)
You send the user's question plus the retrieved chunks to the LLM in a single prompt. The prompt says something like: "Using only the context below, answer the question. Context: [retrieved chunks]. Question: [user's question]."
The LLM reads the context and writes its answer based on the specific information you provided, not just its general training knowledge.
The result: accurate, grounded answers that cite your actual documents.
A Simple Python Example
Here is a minimal RAG implementation. No LangChain, no complex setup — just the core pattern.
from anthropic import Anthropic
from sentence_transformers import SentenceTransformer
import numpy as np
client = Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")
# --- Your documents ---
documents = [
"Our refund policy allows returns within 30 days of purchase with a valid receipt.",
"Customer support is available Monday through Friday, 9am to 6pm EST.",
"Pro plan subscribers get priority support with a 2-hour response time guarantee.",
"To cancel your subscription, go to Settings > Billing > Cancel Subscription.",
"We accept Visa, Mastercard, American Express, and PayPal.",
]
# --- Step 1: Index — embed all documents ---
doc_embeddings = embedder.encode(documents)
def retrieve(question: str, top_k: int = 2) -> list[str]:
"""Find the most relevant documents for a question."""
question_embedding = embedder.encode([question])[0]
# Cosine similarity
similarities = np.dot(doc_embeddings, question_embedding) / (
np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(question_embedding)
)
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [documents[i] for i in top_indices]
def ask(question: str) -> str:
"""RAG pipeline: retrieve relevant context, then generate an answer."""
context_chunks = retrieve(question)
context = "\n".join(f"- {chunk}" for chunk in context_chunks)
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
messages=[{
"role": "user",
"content": (
f"Answer the question using only the context provided. "
f"If the context does not contain the answer, say so.\n\n"
f"Context:\n{context}\n\n"
f"Question: {question}"
)
}]
)
return response.content[0].text
# --- Test it ---
questions = [
"Can I return something I bought last week?",
"How do I cancel my subscription?",
"What payment methods do you accept?",
]
for q in questions:
print(f"Q: {q}")
print(f"A: {ask(q)}\n")
This is the complete pattern. In production you replace sentence_transformers with an API-based embedding model, replace the in-memory list with a proper vector database like pgvector, and add chunking logic to handle large documents. But the structure — embed, retrieve, generate — stays the same.
When You Need RAG
RAG is the right tool when:
Your data changes frequently. Product catalogs, support documentation, news articles, internal wikis — anything that updates regularly. Fine-tuning a new model every time your docs change is impractical. RAG retrieves current data at query time.
You need source attribution. RAG lets you tell the user which documents informed the answer. This is critical for enterprise applications where people need to verify claims.
Your knowledge is domain-specific. An AI that can answer questions about your company's specific systems, your codebase, your policy documents — none of that exists in an LLM's training data.
Your context exceeds what fits in a prompt. If you have 10,000 pages of documentation, you cannot paste all of it into every request. RAG selects the relevant pages instead.
When You Do Not Need RAG
RAG adds complexity. Do not add it when you do not need it.
Skip RAG when: the LLM already knows what it needs to know from its training data. General coding questions, explanations of well-documented technologies, writing assistance — these do not require RAG.
Skip RAG when: you can just include the relevant text directly in the prompt. If the total context is small (a few hundred lines), put it in the system prompt and skip the retrieval machinery.
Skip RAG when: the task is pure generation — creative writing, summarizing a document the user just pasted, reformatting text. Retrieval adds nothing when the user has already provided the input.
The rule of thumb: add RAG when the LLM needs to know something it could not have been trained on.
Common RAG Pitfalls
Chunk size matters more than people expect. Chunks that are too large pull in irrelevant context. Chunks that are too small lose surrounding context that the LLM needs to understand the answer. A good starting point is 500-800 tokens with 10-20% overlap between chunks.
Retrieval quality determines answer quality. If your retriever pulls the wrong documents, the LLM will either answer incorrectly or say it does not know. Embedding quality, chunk strategy, and the number of retrieved chunks all affect retrieval accuracy.
The LLM must be told to stay grounded. Without explicit instruction, LLMs will sometimes blend retrieved context with their general knowledge. Your prompt needs to tell the model to use only the provided context — and to say when the context is insufficient.
Vector similarity is not perfect. Semantic search finds conceptually similar text, but it can miss exact matches for specific IDs, dates, or technical terms. For these cases, combine vector search with keyword search (hybrid retrieval).
Going Deeper
The simple pattern above scales further than most people expect. Production RAG systems add query rewriting, re-ranking, multi-vector retrieval, and evaluation pipelines — but the core idea never changes: retrieve, then generate.
Phase 4 of the Agentic AI course at MindloomHQ covers memory and RAG in depth. The 10 lessons go from basic embedding and retrieval to production-grade RAG systems with re-ranking, multi-document synthesis, and evaluation strategies. Every lesson includes full code implementations.
Phases 0 and 1 are completely free. No credit card required.