The basic RAG tutorial is everywhere: split your documents into chunks, embed them, store them in a vector database, retrieve the top-k results at query time, and pass them to an LLM. It takes twenty minutes to build and it works for demos.
Production is different. The gap between a demo RAG system and one that reliably serves real users is substantial, and most of the problems are not obvious until you have hit them. This guide covers the decisions that actually matter and the failure modes that actually occur.
Chunking: The Most Underestimated Decision
Every basic tutorial says to split text into chunks of N tokens with some overlap. This produces terrible retrieval quality for most real documents.
The problem is that semantic coherence does not align with fixed token counts. A 512-token chunk that cuts a paragraph in half, or splits a table from its header, or separates a code example from its explanation, is not a retrieval unit — it is an artifact.
Sentence-window chunking stores individual sentences but retrieves surrounding context. You embed the sentence for precise matching, but pass the surrounding 3-5 sentences to the LLM for answer generation. This gives you the precision of small chunks with the context of larger ones.
Hierarchical (parent-document) chunking creates small chunks for retrieval and large chunks for generation. The small chunks index well; the parent document provides context. LlamaIndex's parent-document retriever implements this pattern directly.
Semantic chunking splits documents where topic changes occur, not at fixed intervals. This requires a pass to identify natural breakpoints — usually by comparing the embedding similarity between adjacent sentences and splitting where similarity drops significantly. It is slower to index but produces more coherent retrieval units.
Document-specific strategies are often the best answer. PDF tables should be extracted and chunked differently than paragraph text. Code blocks should stay intact. FAQ formats (question + answer pairs) should chunk by Q&A pair, not by token count. Generic chunking ignores document structure; good chunking preserves it.
Embedding Model Selection
text-embedding-ada-002 is not always the right choice. Embedding models differ meaningfully on domain-specific retrieval tasks, and the best model for general web text is not necessarily the best for legal documents, medical literature, or code.
The MTEB leaderboard is the standard benchmark for embedding model evaluation. It covers multiple retrieval tasks across different domains. For any serious production deployment, run your own evaluation on a sample of your actual query-document pairs — aggregate benchmarks are a starting point, not a decision.
Practical considerations beyond accuracy: latency (larger models are slower), cost (proprietary APIs vs. open-source self-hosted), context window (some embedding models handle long documents better than others), and dimensionality (higher dimensions mean larger storage and slower similarity search at scale).
For cost-sensitive applications, a smaller fast embedding model for initial retrieval followed by a larger cross-encoder for re-ranking often outperforms using a large embedding model alone — at lower cost and similar latency.
Hybrid Search: Dense + Sparse
Pure semantic (vector) search misses exact keyword matches. If a user asks about "CVE-2024-44308" and your document contains that exact string, a semantic search may not retrieve it because the embedding of a CVE number is not semantically meaningful.
Hybrid search combines dense vector retrieval (semantic similarity) with sparse keyword retrieval (BM25 or similar). Most vector databases now support hybrid search natively — Pinecone, Weaviate, Qdrant, and Elasticsearch all have implementations.
The combination is controlled by a weighting parameter (often called alpha): 0 is pure sparse, 1 is pure dense. A value of 0.7-0.8 works well as a starting point for most domains, with exact-match-heavy domains (legal, medical, technical documentation with specific identifiers) benefiting from lower alpha values.
# Qdrant hybrid search example
from qdrant_client.models import SparseVector
results = client.query_points(
collection_name="documents",
query=dense_vector,
sparse_query=SparseVector(indices=sparse_indices, values=sparse_values),
limit=20,
using="hybrid"
)
Do not add hybrid search complexity without measuring whether it improves your specific retrieval task. On purely semantic queries over narrative text, pure dense search often performs comparably.
Re-Ranking: Improving Precision After Retrieval
Retrieve more, then rerank to fewer. This two-stage approach consistently outperforms trying to retrieve precisely in one shot.
The pattern: retrieve top-20 or top-30 candidates using fast vector search, then re-rank using a cross-encoder model that scores each candidate against the query directly (not via embedding similarity). Return the top 5-7 after re-ranking.
Cross-encoders are slower than embedding similarity (they run a full forward pass for each query-document pair) but significantly more accurate. The tradeoff is acceptable because re-ranking runs on tens of documents, not millions.
Cohere's Rerank API and the cross-encoder models from sentence-transformers are the two most common production choices. For latency-sensitive applications, a lightweight cross-encoder running locally often beats an API call.
Evaluation Metrics That Actually Matter
The most common mistake in RAG evaluation is eyeballing outputs. You need quantitative metrics you can track over time and regression-test against.
Retrieval metrics:
- Context precision: Of the retrieved chunks, what fraction were actually relevant to answering the question? Low precision means noise in the context.
- Context recall: Of the relevant information in your corpus, what fraction did the retrieval actually surface? Low recall means answers will be missing information.
Generation metrics:
- Faithfulness: Does the generated answer only state things that are supported by the retrieved context? This is the hallucination metric for RAG specifically.
- Answer relevance: Is the generated answer actually responsive to the question? A faithful but off-topic answer is still a failure.
RAGAS is the standard library for automated RAG evaluation. It implements all of these metrics using LLM-as-judge scoring, which makes it scalable to large evaluation sets without requiring human annotation for every run.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
result = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy, context_precision]
)
print(result)
Build an evaluation set of 50-100 representative question-answer pairs before launching. Run this evaluation every time you change your chunking strategy, embedding model, retrieval parameters, or generation prompt. Treat any regression as a bug.
What Breaks in Production
Query-document mismatch. Users write short queries; documents contain long prose. The embedding space between these is not as aligned as it looks in demos. Query expansion (rewriting the query in multiple forms before retrieval) and hypothetical document embeddings (HyDE — generating a hypothetical answer and embedding that instead of the query) both address this. Test which works better for your distribution.
Stale embeddings. When you update source documents, you need a re-indexing strategy. Partial re-indexing (only changed documents), full re-indexing, and online update strategies all have different complexity and consistency tradeoffs. Plan this before launch, not after.
Retrieval winning, generation losing. Your retrieval is perfect — the right context is in the retrieved documents — but the LLM generates an incorrect or incomplete answer anyway. This is a generation problem, not a retrieval problem. Inspect the two failure modes separately; they have different fixes.
Context window overflow. At large chunk sizes and high top-k values, the combined retrieved context exceeds the LLM's context window or becomes so long that the model loses focus on the relevant parts. Lost-in-the-middle degradation is real: LLMs perform best when the relevant information appears at the beginning or end of a long context, not in the middle.
Missing metadata filtering. A retrieval system that cannot filter by document type, date, department, or other metadata will return irrelevant documents when users ask department-specific or time-bounded questions. Add metadata at indexing time; it is very hard to add later.
Going Deeper
If you want to go from these concepts to building and evaluating a production RAG system end-to-end, Phase 4 of the Agentic AI course at MindloomHQ covers the complete implementation.
The 10 lessons cover vector store architecture, chunking pipelines, hybrid search implementations, re-ranking patterns, evaluation frameworks, and a full production-grade RAG system as the phase project. Every lesson has complete code implementations, not snippets. Phase 0 and Phase 1 are completely free to start, no credit card required.