RAG Explained: How to Give Your AI Agent a Memory

Every LLM has a knowledge cutoff. Ask Claude about something that happened last week, and it won't know. Ask it about your company's internal API documentation, and it definitely won't know. RAG (Retrieval-Augmented Generation) is the solution — and it's the most widely deployed AI pattern in production systems today.

The Problem RAG Solves

LLMs are trained on a static snapshot of the internet. They're powerful reasoning engines, but their knowledge is frozen at training time. For most real-world applications, this is a problem:

Your support chatbot needs to know about last week's product update
Your code assistant needs to understand your internal codebase
Your research agent needs to process documents you uploaded today

You have three options: fine-tuning (expensive, slow, requires ML expertise), prompt stuffing (limited by context window), or RAG (fast, cheap, and surprisingly effective).

How RAG Works

RAG has two phases:

Phase 1: Indexing (done once)

Take your source documents (PDFs, docs, code files, database records)
Split them into chunks (typically 500–1,500 tokens each)
Run each chunk through an embedding model to get a numerical vector
Store those vectors in a vector database alongside the original text

Phase 2: Retrieval + Generation (done per query)

Take the user's question
Embed it using the same embedding model
Search the vector database for the most similar chunks
Inject those chunks into the LLM prompt as context
Let the LLM generate an answer grounded in the retrieved content

The magic is step 3: semantic similarity search. Instead of keyword matching ("find documents that contain the word 'authentication'"), vector search finds documents that are conceptually similar to the question, even if they use different words.

A Simple Implementation

from anthropic import Anthropic
import chromadb
from sentence_transformers import SentenceTransformer

# Setup
client = Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")
db = chromadb.Client()
collection = db.create_collection("docs")

# Index your documents
def index_documents(documents: list[dict]):
    texts = [doc["text"] for doc in documents]
    embeddings = embedder.encode(texts).tolist()
    collection.add(
        documents=texts,
        embeddings=embeddings,
        ids=[doc["id"] for doc in documents]
    )

# Query with RAG
def ask(question: str) -> str:
    # Retrieve relevant chunks
    query_embedding = embedder.encode([question]).tolist()
    results = collection.query(query_embeddings=query_embedding, n_results=3)
    context = "\n\n".join(results["documents"][0])

    # Generate with context
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}"
        }]
    )
    return response.content[0].text

This is the core of RAG. In 30 lines of Python, you've given an LLM access to your documents.

Vector Databases: Your Options

| Database | Best For | Deployment | |----------|----------|------------| | Chroma | Local dev, prototyping | In-process or local server | | Pinecone | Managed cloud, low ops | SaaS | | Weaviate | Open source, complex filtering | Self-hosted or cloud | | Supabase pgvector | If you're already on Postgres | Supabase or Postgres | | Qdrant | High performance, Rust-based | Self-hosted or cloud |

For prototyping, start with Chroma — it runs in-memory with zero setup. For production, Pinecone or Supabase pgvector (if you're already using Postgres) are the most common choices.

The Chunking Problem

RAG quality depends heavily on how you chunk your documents. Too small and you lose context. Too large and you dilute relevance scores and risk hitting context limits.

Common chunking strategies:

Fixed size (500 tokens): simple but breaks mid-sentence
Recursive text splitting: splits on paragraphs → sentences → characters, in order
Semantic chunking: splits at topic boundaries using embeddings (more complex, better results)
Document structure: respects markdown headers, code blocks, etc.

For most use cases, recursive splitting with 512 tokens and 50-token overlap works well as a starting point.

What RAG Can't Do

RAG is powerful but has real limitations:

The lost in the middle problem: LLMs tend to ignore context in the middle of a long prompt. If you retrieve 20 chunks, the ones in positions 5-15 are often underweighted.

The recall-precision tradeoff: Retrieve too few chunks and you miss relevant information. Retrieve too many and you add noise that confuses the model.

Outdated embeddings: If your documents change, your embeddings go stale. You need a pipeline to detect changes and re-embed.

Structured data doesn't chunk well: Tables, code, and databases often need specialized retrieval strategies (SQL generation, code-specific embeddings).

RAG in Production

Real production RAG systems add:

Hybrid search: combine dense (vector) and sparse (BM25) retrieval for better coverage
Reranking: use a cross-encoder model to reorder retrieved chunks by relevance
Metadata filtering: filter by date, source, author before vector search
Query rewriting: have the LLM rephrase the question for better retrieval
Evaluation: measure retrieval recall and answer correctness automatically

These techniques are what separate "RAG demo" from "RAG that works at 2am when your VP is showing it to a customer." Phase 4 of MindloomHQ covers all of them — the implementation details, the common failure modes, and the production patterns.

The Problem RAG Solves

LLMs are trained on a static snapshot of the internet. They're powerful reasoning engines, but their knowledge is frozen at training time. For most real-world applications, this is a problem:

Your support chatbot needs to know about last week's product update
Your code assistant needs to understand your internal codebase
Your research agent needs to process documents you uploaded today

You have three options: fine-tuning (expensive, slow, requires ML expertise), prompt stuffing (limited by context window), or RAG (fast, cheap, and surprisingly effective).

How RAG Works

RAG has two phases:

Phase 1: Indexing (done once)

Take your source documents (PDFs, docs, code files, database records)
Split them into chunks (typically 500–1,500 tokens each)
Run each chunk through an embedding model to get a numerical vector
Store those vectors in a vector database alongside the original text

Phase 2: Retrieval + Generation (done per query)

Take the user's question
Embed it using the same embedding model
Search the vector database for the most similar chunks
Inject those chunks into the LLM prompt as context
Let the LLM generate an answer grounded in the retrieved content

A Simple Implementation

from anthropic import Anthropic
import chromadb
from sentence_transformers import SentenceTransformer

# Setup
client = Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")
db = chromadb.Client()
collection = db.create_collection("docs")

# Index your documents
def index_documents(documents: list[dict]):
    texts = [doc["text"] for doc in documents]
    embeddings = embedder.encode(texts).tolist()
    collection.add(
        documents=texts,
        embeddings=embeddings,
        ids=[doc["id"] for doc in documents]
    )

# Query with RAG
def ask(question: str) -> str:
    # Retrieve relevant chunks
    query_embedding = embedder.encode([question]).tolist()
    results = collection.query(query_embeddings=query_embedding, n_results=3)
    context = "\n\n".join(results["documents"][0])

    # Generate with context
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {question}"
        }]
    )
    return response.content[0].text

This is the core of RAG. In 30 lines of Python, you've given an LLM access to your documents.

Vector Databases: Your Options

For prototyping, start with Chroma — it runs in-memory with zero setup. For production, Pinecone or Supabase pgvector (if you're already using Postgres) are the most common choices.

The Chunking Problem

RAG quality depends heavily on how you chunk your documents. Too small and you lose context. Too large and you dilute relevance scores and risk hitting context limits.

Common chunking strategies:

Fixed size (500 tokens): simple but breaks mid-sentence
Recursive text splitting: splits on paragraphs → sentences → characters, in order
Semantic chunking: splits at topic boundaries using embeddings (more complex, better results)
Document structure: respects markdown headers, code blocks, etc.

For most use cases, recursive splitting with 512 tokens and 50-token overlap works well as a starting point.

What RAG Can't Do

RAG is powerful but has real limitations:

The lost in the middle problem: LLMs tend to ignore context in the middle of a long prompt. If you retrieve 20 chunks, the ones in positions 5-15 are often underweighted.

The recall-precision tradeoff: Retrieve too few chunks and you miss relevant information. Retrieve too many and you add noise that confuses the model.

Outdated embeddings: If your documents change, your embeddings go stale. You need a pipeline to detect changes and re-embed.

Structured data doesn't chunk well: Tables, code, and databases often need specialized retrieval strategies (SQL generation, code-specific embeddings).

RAG in Production

Real production RAG systems add:

Hybrid search: combine dense (vector) and sparse (BM25) retrieval for better coverage
Reranking: use a cross-encoder model to reorder retrieved chunks by relevance
Metadata filtering: filter by date, source, author before vector search
Query rewriting: have the LLM rephrase the question for better retrieval
Evaluation: measure retrieval recall and answer correctness automatically

RAG Explained: How to Give Your AI Agent a Memory

The Problem RAG Solves

How RAG Works

A Simple Implementation

Vector Databases: Your Options

The Chunking Problem

What RAG Can't Do

RAG in Production

Get the AI Engineering Newsletter

Ready to build production AI agents?

Related Posts

RAG Explained: How to Give Your AI Agent a Memory

The Problem RAG Solves

How RAG Works

A Simple Implementation

Vector Databases: Your Options

The Chunking Problem

What RAG Can't Do

RAG in Production

Get the AI Engineering Newsletter

Ready to build production AI agents?

Related Posts