How to Build a RAG Chatbot in Python (Step by Step)

Retrieval-Augmented Generation (RAG) is the most practical LLM architecture for production applications. Instead of relying on what the model was trained on, you retrieve relevant context at query time and include it in the prompt. The result is a chatbot that answers questions about your data accurately, without fine-tuning.

This tutorial builds a working RAG chatbot from scratch. By the end you'll have a system that ingests documents, embeds and stores them, retrieves relevant chunks at query time, and generates grounded answers.

What You'll Build

A Python chatbot that:

Ingests documents (text files, PDFs)
Splits them into chunks and embeds each chunk
Stores embeddings in PostgreSQL with pgvector
At query time, embeds the question, retrieves the closest chunks
Sends the question + retrieved context to the LLM and returns the answer

We'll use Anthropic's API for generation and OpenAI's text-embedding-3-small for embeddings. You can swap either for any compatible provider.

Prerequisites

pip install anthropic openai psycopg2-binary pgvector pypdf python-dotenv

You'll need PostgreSQL with the pgvector extension. For local development:

docker run -d \
  --name pgvector-demo \
  -e POSTGRES_PASSWORD=postgres \
  -p 5432:5432 \
  pgvector/pgvector:pg16

Step 1: Set Up the Database

Connect to the database and create your schema:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    metadata JSONB DEFAULT '{}',
    embedding vector(1536)
);

CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

The ivfflat index makes similarity search fast at scale. For fewer than 10,000 rows, you can skip the index and use an exact scan instead.

Step 2: Chunk Your Documents

Chunking determines retrieval quality. Too small: not enough context per chunk. Too large: irrelevant content dilutes the signal.

from pathlib import Path
from typing import Generator

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> Generator[str, None, None]:
    """
    Split text into overlapping chunks of roughly chunk_size characters.
    Overlap preserves context across chunk boundaries.
    """
    words = text.split()

    if not words:
        return

    current_chunk = []
    current_length = 0

    for word in words:
        current_chunk.append(word)
        current_length += len(word) + 1  # +1 for space

        if current_length >= chunk_size:
            yield " ".join(current_chunk)
            # Keep overlap words for next chunk
            overlap_words = current_chunk[-overlap // 6:]  # rough word estimate
            current_chunk = overlap_words
            current_length = sum(len(w) + 1 for w in overlap_words)

    if current_chunk:
        yield " ".join(current_chunk)


def load_document(file_path: str) -> str:
    path = Path(file_path)

    if path.suffix == ".pdf":
        from pypdf import PdfReader
        reader = PdfReader(file_path)
        return "\n\n".join(page.extract_text() for page in reader.pages)

    return path.read_text(encoding="utf-8")

For production, use a smarter chunker that respects sentence and paragraph boundaries. Libraries like semantic-chunkers or LangChain's RecursiveCharacterTextSplitter handle this well.

Step 3: Embed and Store

import json
import psycopg2
from openai import OpenAI

openai_client = OpenAI()

def get_embedding(text: str) -> list[float]:
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=text.replace("\n", " "),
    )
    return response.data[0].embedding


def ingest_document(file_path: str, db_conn, metadata: dict = None):
    """Load, chunk, embed, and store a document."""
    text = load_document(file_path)
    chunks = list(chunk_text(text))

    print(f"Ingesting {file_path}: {len(chunks)} chunks")

    with db_conn.cursor() as cur:
        for i, chunk in enumerate(chunks):
            embedding = get_embedding(chunk)
            cur.execute(
                """
                INSERT INTO documents (content, metadata, embedding)
                VALUES (%s, %s, %s)
                """,
                (
                    chunk,
                    json.dumps({**(metadata or {}), "source": file_path, "chunk": i}),
                    embedding,
                ),
            )
    db_conn.commit()
    print(f"Done. {len(chunks)} chunks stored.")

Batch your embedding API calls in production. Sending 500 individual requests is slow and expensive; OpenAI's embeddings endpoint accepts up to 2,048 inputs per request.

Step 4: Retrieve Relevant Chunks

def retrieve(query: str, db_conn, top_k: int = 5) -> list[dict]:
    """Embed the query and return the top-k most similar chunks."""
    query_embedding = get_embedding(query)

    with db_conn.cursor() as cur:
        cur.execute(
            """
            SELECT content, metadata,
                   1 - (embedding <=> %s::vector) AS similarity
            FROM documents
            ORDER BY embedding <=> %s::vector
            LIMIT %s
            """,
            (query_embedding, query_embedding, top_k),
        )
        rows = cur.fetchall()

    return [
        {"content": row[0], "metadata": row[1], "similarity": float(row[2])}
        for row in rows
    ]

The <=> operator is pgvector's cosine distance. 1 - cosine_distance = cosine_similarity, so higher similarity means more relevant.

Step 5: Generate the Answer

from anthropic import Anthropic

anthropic_client = Anthropic()

SYSTEM_PROMPT = """You are a helpful assistant. Answer the user's question using only the provided context.
If the context doesn't contain enough information to answer the question, say so clearly.
Do not make up information that isn't in the context."""


def answer(question: str, db_conn) -> str:
    # Retrieve relevant chunks
    chunks = retrieve(question, db_conn, top_k=5)

    if not chunks:
        return "I don't have any documents loaded to answer from."

    # Build context string
    context = "\n\n---\n\n".join(
        f"[Source: {c['metadata'].get('source', 'unknown')}]\n{c['content']}"
        for c in chunks
    )

    # Generate answer
    response = anthropic_client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=1024,
        system=SYSTEM_PROMPT,
        messages=[
            {
                "role": "user",
                "content": f"Context:\n\n{context}\n\nQuestion: {question}",
            }
        ],
    )

    return response.content[0].text

Step 6: Wire It Together

import os
import psycopg2
from dotenv import load_dotenv

load_dotenv()

def main():
    conn = psycopg2.connect(
        host="localhost",
        port=5432,
        dbname="postgres",
        user="postgres",
        password="postgres",
    )

    # Ingest documents (run once, or when documents change)
    # ingest_document("docs/product_manual.pdf", conn)
    # ingest_document("docs/faq.txt", conn, metadata={"type": "faq"})

    print("RAG Chatbot ready. Type 'quit' to exit.\n")

    while True:
        question = input("You: ").strip()
        if question.lower() in ("quit", "exit"):
            break
        if not question:
            continue

        response = answer(question, conn)
        print(f"\nAssistant: {response}\n")

    conn.close()


if __name__ == "__main__":
    main()

What to Improve for Production

Hybrid search. Pure vector search misses exact keyword matches. Combine vector similarity with full-text search (BM25) using pgvector's <=> alongside ts_rank from PostgreSQL's built-in FTS. This is called hybrid search and it significantly improves retrieval precision.

Reranking. After retrieving the top-20 chunks by embedding similarity, run a cross-encoder reranker (Cohere Rerank, or a local model like ms-marco-MiniLM) to reorder by actual relevance. The top-5 after reranking are substantially better than top-5 from embedding alone.

Streaming. For a responsive UI, stream the answer token-by-token instead of waiting for the full response. Anthropic's streaming API (client.messages.stream()) handles this cleanly.

Metadata filtering. When users ask about a specific document or date range, filter by metadata before doing similarity search. This makes retrieval faster and more accurate:

cur.execute("""
    SELECT content, metadata, 1 - (embedding <=> %s::vector) AS similarity
    FROM documents
    WHERE metadata->>'type' = %s
    ORDER BY embedding <=> %s::vector
    LIMIT %s
""", (query_embedding, "faq", query_embedding, top_k))

Chunking strategy. Experiment with chunk size. For technical documentation, 300–400 character chunks with 50-character overlap often work better than larger chunks. For narrative text, 600–800 characters preserves more context per chunk.

Go Further

This tutorial covers the core architecture. Production RAG systems have additional layers: query rewriting, multi-query retrieval, answer grounding verification, and evaluation pipelines to measure retrieval quality over time.

Phase 4 (Memory and RAG) of the MindloomHQ Agentic AI course covers all of it in depth — 10 lessons with full working implementations, pgvector in production, evaluation strategies, and a real project that builds a production-grade document chatbot from scratch.

Start Phase 4: Memory and RAG →

Phases 0 and 1 are completely free.

What You'll Build

A Python chatbot that:

Ingests documents (text files, PDFs)
Splits them into chunks and embeds each chunk
Stores embeddings in PostgreSQL with pgvector
At query time, embeds the question, retrieves the closest chunks
Sends the question + retrieved context to the LLM and returns the answer

We'll use Anthropic's API for generation and OpenAI's text-embedding-3-small for embeddings. You can swap either for any compatible provider.

Prerequisites

pip install anthropic openai psycopg2-binary pgvector pypdf python-dotenv

You'll need PostgreSQL with the pgvector extension. For local development:

docker run -d \
  --name pgvector-demo \
  -e POSTGRES_PASSWORD=postgres \
  -p 5432:5432 \
  pgvector/pgvector:pg16

Step 1: Set Up the Database

Connect to the database and create your schema:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    metadata JSONB DEFAULT '{}',
    embedding vector(1536)
);

CREATE INDEX ON documents
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

The ivfflat index makes similarity search fast at scale. For fewer than 10,000 rows, you can skip the index and use an exact scan instead.

Step 2: Chunk Your Documents

Chunking determines retrieval quality. Too small: not enough context per chunk. Too large: irrelevant content dilutes the signal.

from pathlib import Path
from typing import Generator

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> Generator[str, None, None]:
    """
    Split text into overlapping chunks of roughly chunk_size characters.
    Overlap preserves context across chunk boundaries.
    """
    words = text.split()

    if not words:
        return

    current_chunk = []
    current_length = 0

    for word in words:
        current_chunk.append(word)
        current_length += len(word) + 1  # +1 for space

        if current_length >= chunk_size:
            yield " ".join(current_chunk)
            # Keep overlap words for next chunk
            overlap_words = current_chunk[-overlap // 6:]  # rough word estimate
            current_chunk = overlap_words
            current_length = sum(len(w) + 1 for w in overlap_words)

    if current_chunk:
        yield " ".join(current_chunk)


def load_document(file_path: str) -> str:
    path = Path(file_path)

    if path.suffix == ".pdf":
        from pypdf import PdfReader
        reader = PdfReader(file_path)
        return "\n\n".join(page.extract_text() for page in reader.pages)

    return path.read_text(encoding="utf-8")

For production, use a smarter chunker that respects sentence and paragraph boundaries. Libraries like semantic-chunkers or LangChain's RecursiveCharacterTextSplitter handle this well.

Step 3: Embed and Store

import json
import psycopg2
from openai import OpenAI

openai_client = OpenAI()

def get_embedding(text: str) -> list[float]:
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=text.replace("\n", " "),
    )
    return response.data[0].embedding


def ingest_document(file_path: str, db_conn, metadata: dict = None):
    """Load, chunk, embed, and store a document."""
    text = load_document(file_path)
    chunks = list(chunk_text(text))

    print(f"Ingesting {file_path}: {len(chunks)} chunks")

    with db_conn.cursor() as cur:
        for i, chunk in enumerate(chunks):
            embedding = get_embedding(chunk)
            cur.execute(
                """
                INSERT INTO documents (content, metadata, embedding)
                VALUES (%s, %s, %s)
                """,
                (
                    chunk,
                    json.dumps({**(metadata or {}), "source": file_path, "chunk": i}),
                    embedding,
                ),
            )
    db_conn.commit()
    print(f"Done. {len(chunks)} chunks stored.")

Batch your embedding API calls in production. Sending 500 individual requests is slow and expensive; OpenAI's embeddings endpoint accepts up to 2,048 inputs per request.

Step 4: Retrieve Relevant Chunks

def retrieve(query: str, db_conn, top_k: int = 5) -> list[dict]:
    """Embed the query and return the top-k most similar chunks."""
    query_embedding = get_embedding(query)

    with db_conn.cursor() as cur:
        cur.execute(
            """
            SELECT content, metadata,
                   1 - (embedding <=> %s::vector) AS similarity
            FROM documents
            ORDER BY embedding <=> %s::vector
            LIMIT %s
            """,
            (query_embedding, query_embedding, top_k),
        )
        rows = cur.fetchall()

    return [
        {"content": row[0], "metadata": row[1], "similarity": float(row[2])}
        for row in rows
    ]

The <=> operator is pgvector's cosine distance. 1 - cosine_distance = cosine_similarity, so higher similarity means more relevant.

Step 5: Generate the Answer

from anthropic import Anthropic

anthropic_client = Anthropic()

SYSTEM_PROMPT = """You are a helpful assistant. Answer the user's question using only the provided context.
If the context doesn't contain enough information to answer the question, say so clearly.
Do not make up information that isn't in the context."""


def answer(question: str, db_conn) -> str:
    # Retrieve relevant chunks
    chunks = retrieve(question, db_conn, top_k=5)

    if not chunks:
        return "I don't have any documents loaded to answer from."

    # Build context string
    context = "\n\n---\n\n".join(
        f"[Source: {c['metadata'].get('source', 'unknown')}]\n{c['content']}"
        for c in chunks
    )

    # Generate answer
    response = anthropic_client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=1024,
        system=SYSTEM_PROMPT,
        messages=[
            {
                "role": "user",
                "content": f"Context:\n\n{context}\n\nQuestion: {question}",
            }
        ],
    )

    return response.content[0].text

Step 6: Wire It Together

import os
import psycopg2
from dotenv import load_dotenv

load_dotenv()

def main():
    conn = psycopg2.connect(
        host="localhost",
        port=5432,
        dbname="postgres",
        user="postgres",
        password="postgres",
    )

    # Ingest documents (run once, or when documents change)
    # ingest_document("docs/product_manual.pdf", conn)
    # ingest_document("docs/faq.txt", conn, metadata={"type": "faq"})

    print("RAG Chatbot ready. Type 'quit' to exit.\n")

    while True:
        question = input("You: ").strip()
        if question.lower() in ("quit", "exit"):
            break
        if not question:
            continue

        response = answer(question, conn)
        print(f"\nAssistant: {response}\n")

    conn.close()


if __name__ == "__main__":
    main()

What to Improve for Production

Streaming. For a responsive UI, stream the answer token-by-token instead of waiting for the full response. Anthropic's streaming API (client.messages.stream()) handles this cleanly.

Metadata filtering. When users ask about a specific document or date range, filter by metadata before doing similarity search. This makes retrieval faster and more accurate:

cur.execute("""
    SELECT content, metadata, 1 - (embedding <=> %s::vector) AS similarity
    FROM documents
    WHERE metadata->>'type' = %s
    ORDER BY embedding <=> %s::vector
    LIMIT %s
""", (query_embedding, "faq", query_embedding, top_k))

Go Further

Start Phase 4: Memory and RAG →

Phases 0 and 1 are completely free.

How to Build a RAG Chatbot in Python (Step by Step)

What You'll Build

Prerequisites

Step 1: Set Up the Database

Step 2: Chunk Your Documents

Step 3: Embed and Store

Step 4: Retrieve Relevant Chunks

Step 5: Generate the Answer

Step 6: Wire It Together

What to Improve for Production

Go Further

Get the AI Engineering Newsletter

Ready to build production AI agents?

Related Posts

How to Build a RAG Chatbot in Python (Step by Step)

What You'll Build

Prerequisites

Step 1: Set Up the Database

Step 2: Chunk Your Documents

Step 3: Embed and Store

Step 4: Retrieve Relevant Chunks

Step 5: Generate the Answer

Step 6: Wire It Together

What to Improve for Production

Go Further

Get the AI Engineering Newsletter

Ready to build production AI agents?

Related Posts