Building a RAG demo takes an afternoon. Building a RAG system that actually answers user questions correctly in production takes a quarter. The gap is mostly about retrieval quality, and almost everything that matters is downstream of “throw documents into a vector DB and call it a day”.
Here’s the playbook we use on every client RAG build.
Chunking Is The Most Underrated Lever
Bad chunking ruins retrieval no matter how good your embedder or reranker is. Three rules:
- Chunk by semantic unit, not character count. A paragraph is a unit. A section is a unit. A 1,000-character window cut mid-sentence is not.
- Overlap chunks by 10–20% so context that spans a boundary isn’t lost.
- Store the parent alongside each chunk. When retrieval hits, you can return the parent section, not the isolated chunk, to the LLM.
For PDFs and HTML, parse the structural elements (headings, sections) first and chunk within them. Tools like LlamaIndex’s SemanticSplitterNodeParser do this automatically — use them.
Reranking Is Cheap And Massively Improves Quality
Standard vector search returns the top-K chunks by cosine similarity. That’s a coarse filter. A reranker (Cohere Rerank, BGE-Reranker, or a small fine-tuned model) takes those K candidates and re-orders them based on actual semantic relevance to the query.
The cost is one extra API call. The quality gain is often the difference between “demo works” and “users trust it”. Always rerank. Retrieve K=50, rerank to K=5, send those to the LLM.
Hybrid Search Beats Pure Vector Search
Vector search is good at concepts. It’s bad at exact terms — product codes, model numbers, function names. Hybrid search runs vector search and keyword (BM25) search in parallel and fuses the results.
Pinecone, Weaviate, and Qdrant all support hybrid. The fusion is usually reciprocal rank fusion (RRF) or a weighted score. Hybrid wins on almost every benchmark and almost every real query distribution we’ve seen.
Query Rewriting Closes The User-Vocabulary Gap
Users ask questions in their own words. Your documents use the author’s words. Query rewriting bridges the gap.
The cheap version: before retrieval, ask the LLM to rewrite the query into 2–3 alternative phrasings and union the results. The expensive version: fine-tune a small rewriter on your domain.
Add HyDE (Hypothetical Document Embeddings) as a complement — generate a fake “ideal answer” with the LLM and embed THAT for retrieval. Counter-intuitive, often a 10–15% recall improvement.
Evaluate With RAGAS Or You’re Flying Blind
You can’t improve what you don’t measure. RAGAS is the standard eval framework — it scores:
- Faithfulness — is the answer grounded in the retrieved context?
- Answer relevance — does it actually address the question?
- Context precision — how much of what we retrieved was useful?
- Context recall — did we miss anything important?
Build a golden set of 50–200 question + expected-answer pairs. Run RAGAS in CI on every prompt or retrieval change. Numbers go up or down, you know which direction your changes pushed quality.
The Architecture That Wins
This is the pipeline we ship on most client builds:
- User question → query rewriting (LLM, 2–3 variants)
- Hybrid retrieval (vector + BM25) → top 50 candidates
- Cross-encoder reranker → top 5
- LLM answer generation with explicit “cite sources” instruction
- Post-hoc faithfulness check (LLM-as-judge) → if low, fall back to “I don’t know”
Each layer compounds. Skip the rerank and quality drops 20%. Skip the faithfulness check and you ship hallucinations to prod.
What’s New In 2026
- GraphRAG — graph-based retrieval is finally cheap enough to use, and it shines on multi-hop questions
- ColBERT-style late-interaction retrievers are competitive with dense + rerank pipelines
- Agent-augmented RAG — let the agent decide whether to retrieve, what to retrieve, and whether the retrieved chunks are sufficient
The Agentic AI Bootcamp covers each of these in Module 2 and revisits them in the Capstone 1 build.