Retrieval-Augmented Generation (RAG) is the most practical LLM architecture for production applications. Instead of relying on what the model was trained on, you retrieve relevant context at query time and include it in the prompt. The result is a chatbot that answers questions about your data accurately, without fine-tuning.
This tutorial builds a working RAG chatbot from scratch. By the end you'll have a system that ingests documents, embeds and stores them, retrieves relevant chunks at query time, and generates grounded answers.
What You'll Build
A Python chatbot that:
- Ingests documents (text files, PDFs)
- Splits them into chunks and embeds each chunk
- Stores embeddings in PostgreSQL with pgvector
- At query time, embeds the question, retrieves the closest chunks
- Sends the question + retrieved context to the LLM and returns the answer
We'll use Anthropic's API for generation and OpenAI's text-embedding-3-small for embeddings. You can swap either for any compatible provider.
Prerequisites
pip install anthropic openai psycopg2-binary pgvector pypdf python-dotenv
You'll need PostgreSQL with the pgvector extension. For local development:
docker run -d \
--name pgvector-demo \
-e POSTGRES_PASSWORD=postgres \
-p 5432:5432 \
pgvector/pgvector:pg16
Step 1: Set Up the Database
Connect to the database and create your schema:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
metadata JSONB DEFAULT '{}',
embedding vector(1536)
);
CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
The ivfflat index makes similarity search fast at scale. For fewer than 10,000 rows, you can skip the index and use an exact scan instead.
Step 2: Chunk Your Documents
Chunking determines retrieval quality. Too small: not enough context per chunk. Too large: irrelevant content dilutes the signal.
from pathlib import Path
from typing import Generator
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> Generator[str, None, None]:
"""
Split text into overlapping chunks of roughly chunk_size characters.
Overlap preserves context across chunk boundaries.
"""
words = text.split()
if not words:
return
current_chunk = []
current_length = 0
for word in words:
current_chunk.append(word)
current_length += len(word) + 1 # +1 for space
if current_length >= chunk_size:
yield " ".join(current_chunk)
# Keep overlap words for next chunk
overlap_words = current_chunk[-overlap // 6:] # rough word estimate
current_chunk = overlap_words
current_length = sum(len(w) + 1 for w in overlap_words)
if current_chunk:
yield " ".join(current_chunk)
def load_document(file_path: str) -> str:
path = Path(file_path)
if path.suffix == ".pdf":
from pypdf import PdfReader
reader = PdfReader(file_path)
return "\n\n".join(page.extract_text() for page in reader.pages)
return path.read_text(encoding="utf-8")
For production, use a smarter chunker that respects sentence and paragraph boundaries. Libraries like semantic-chunkers or LangChain's RecursiveCharacterTextSplitter handle this well.
Step 3: Embed and Store
import json
import psycopg2
from openai import OpenAI
openai_client = OpenAI()
def get_embedding(text: str) -> list[float]:
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=text.replace("\n", " "),
)
return response.data[0].embedding
def ingest_document(file_path: str, db_conn, metadata: dict = None):
"""Load, chunk, embed, and store a document."""
text = load_document(file_path)
chunks = list(chunk_text(text))
print(f"Ingesting {file_path}: {len(chunks)} chunks")
with db_conn.cursor() as cur:
for i, chunk in enumerate(chunks):
embedding = get_embedding(chunk)
cur.execute(
"""
INSERT INTO documents (content, metadata, embedding)
VALUES (%s, %s, %s)
""",
(
chunk,
json.dumps({**(metadata or {}), "source": file_path, "chunk": i}),
embedding,
),
)
db_conn.commit()
print(f"Done. {len(chunks)} chunks stored.")
Batch your embedding API calls in production. Sending 500 individual requests is slow and expensive; OpenAI's embeddings endpoint accepts up to 2,048 inputs per request.
Step 4: Retrieve Relevant Chunks
def retrieve(query: str, db_conn, top_k: int = 5) -> list[dict]:
"""Embed the query and return the top-k most similar chunks."""
query_embedding = get_embedding(query)
with db_conn.cursor() as cur:
cur.execute(
"""
SELECT content, metadata,
1 - (embedding <=> %s::vector) AS similarity
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s
""",
(query_embedding, query_embedding, top_k),
)
rows = cur.fetchall()
return [
{"content": row[0], "metadata": row[1], "similarity": float(row[2])}
for row in rows
]
The <=> operator is pgvector's cosine distance. 1 - cosine_distance = cosine_similarity, so higher similarity means more relevant.
Step 5: Generate the Answer
from anthropic import Anthropic
anthropic_client = Anthropic()
SYSTEM_PROMPT = """You are a helpful assistant. Answer the user's question using only the provided context.
If the context doesn't contain enough information to answer the question, say so clearly.
Do not make up information that isn't in the context."""
def answer(question: str, db_conn) -> str:
# Retrieve relevant chunks
chunks = retrieve(question, db_conn, top_k=5)
if not chunks:
return "I don't have any documents loaded to answer from."
# Build context string
context = "\n\n---\n\n".join(
f"[Source: {c['metadata'].get('source', 'unknown')}]\n{c['content']}"
for c in chunks
)
# Generate answer
response = anthropic_client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=1024,
system=SYSTEM_PROMPT,
messages=[
{
"role": "user",
"content": f"Context:\n\n{context}\n\nQuestion: {question}",
}
],
)
return response.content[0].text
Step 6: Wire It Together
import os
import psycopg2
from dotenv import load_dotenv
load_dotenv()
def main():
conn = psycopg2.connect(
host="localhost",
port=5432,
dbname="postgres",
user="postgres",
password="postgres",
)
# Ingest documents (run once, or when documents change)
# ingest_document("docs/product_manual.pdf", conn)
# ingest_document("docs/faq.txt", conn, metadata={"type": "faq"})
print("RAG Chatbot ready. Type 'quit' to exit.\n")
while True:
question = input("You: ").strip()
if question.lower() in ("quit", "exit"):
break
if not question:
continue
response = answer(question, conn)
print(f"\nAssistant: {response}\n")
conn.close()
if __name__ == "__main__":
main()
What to Improve for Production
Hybrid search. Pure vector search misses exact keyword matches. Combine vector similarity with full-text search (BM25) using pgvector's <=> alongside ts_rank from PostgreSQL's built-in FTS. This is called hybrid search and it significantly improves retrieval precision.
Reranking. After retrieving the top-20 chunks by embedding similarity, run a cross-encoder reranker (Cohere Rerank, or a local model like ms-marco-MiniLM) to reorder by actual relevance. The top-5 after reranking are substantially better than top-5 from embedding alone.
Streaming. For a responsive UI, stream the answer token-by-token instead of waiting for the full response. Anthropic's streaming API (client.messages.stream()) handles this cleanly.
Metadata filtering. When users ask about a specific document or date range, filter by metadata before doing similarity search. This makes retrieval faster and more accurate:
cur.execute("""
SELECT content, metadata, 1 - (embedding <=> %s::vector) AS similarity
FROM documents
WHERE metadata->>'type' = %s
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (query_embedding, "faq", query_embedding, top_k))
Chunking strategy. Experiment with chunk size. For technical documentation, 300–400 character chunks with 50-character overlap often work better than larger chunks. For narrative text, 600–800 characters preserves more context per chunk.
Go Further
This tutorial covers the core architecture. Production RAG systems have additional layers: query rewriting, multi-query retrieval, answer grounding verification, and evaluation pipelines to measure retrieval quality over time.
Phase 4 (Memory and RAG) of the MindloomHQ Agentic AI course covers all of it in depth — 10 lessons with full working implementations, pgvector in production, evaluation strategies, and a real project that builds a production-grade document chatbot from scratch.
Start Phase 4: Memory and RAG →
Phases 0 and 1 are completely free.