RAG In Production: The Playbook

Building a RAG demo takes an afternoon. Building a RAG system that actually answers user questions correctly in production takes a quarter. The gap is mostly about retrieval quality, and almost everything that matters is downstream of “throw documents into a vector DB and call it a day”.

Here’s the playbook we use on every client RAG build.

Chunking Is The Most Underrated Lever

Bad chunking ruins retrieval no matter how good your embedder or reranker is. Three rules:

Chunk by semantic unit, not character count. A paragraph is a unit. A section is a unit. A 1,000-character window cut mid-sentence is not.
Overlap chunks by 10–20% so context that spans a boundary isn’t lost.
Store the parent alongside each chunk. When retrieval hits, you can return the parent section, not the isolated chunk, to the LLM.

For PDFs and HTML, parse the structural elements (headings, sections) first and chunk within them. Tools like LlamaIndex’s SemanticSplitterNodeParser do this automatically — use them.

Reranking Is Cheap And Massively Improves Quality

Standard vector search returns the top-K chunks by cosine similarity. That’s a coarse filter. A reranker (Cohere Rerank, BGE-Reranker, or a small fine-tuned model) takes those K candidates and re-orders them based on actual semantic relevance to the query.

The cost is one extra API call. The quality gain is often the difference between “demo works” and “users trust it”. Always rerank. Retrieve K=50, rerank to K=5, send those to the LLM.

Hybrid Search Beats Pure Vector Search

Vector search is good at concepts. It’s bad at exact terms — product codes, model numbers, function names. Hybrid search runs vector search and keyword (BM25) search in parallel and fuses the results.

Pinecone, Weaviate, and Qdrant all support hybrid. The fusion is usually reciprocal rank fusion (RRF) or a weighted score. Hybrid wins on almost every benchmark and almost every real query distribution we’ve seen.

Query Rewriting Closes The User-Vocabulary Gap

Users ask questions in their own words. Your documents use the author’s words. Query rewriting bridges the gap.

The cheap version: before retrieval, ask the LLM to rewrite the query into 2–3 alternative phrasings and union the results. The expensive version: fine-tune a small rewriter on your domain.

Add HyDE (Hypothetical Document Embeddings) as a complement — generate a fake “ideal answer” with the LLM and embed THAT for retrieval. Counter-intuitive, often a 10–15% recall improvement.

Evaluate With RAGAS Or You’re Flying Blind

You can’t improve what you don’t measure. RAGAS is the standard eval framework — it scores:

Faithfulness — is the answer grounded in the retrieved context?
Answer relevance — does it actually address the question?
Context precision — how much of what we retrieved was useful?
Context recall — did we miss anything important?

Build a golden set of 50–200 question + expected-answer pairs. Run RAGAS in CI on every prompt or retrieval change. Numbers go up or down, you know which direction your changes pushed quality.

The Architecture That Wins

This is the pipeline we ship on most client builds:

User question → query rewriting (LLM, 2–3 variants)
Hybrid retrieval (vector + BM25) → top 50 candidates
Cross-encoder reranker → top 5
LLM answer generation with explicit “cite sources” instruction
Post-hoc faithfulness check (LLM-as-judge) → if low, fall back to “I don’t know”

Each layer compounds. Skip the rerank and quality drops 20%. Skip the faithfulness check and you ship hallucinations to prod.

What’s New In 2026

GraphRAG — graph-based retrieval is finally cheap enough to use, and it shines on multi-hop questions
ColBERT-style late-interaction retrievers are competitive with dense + rerank pipelines
Agent-augmented RAG — let the agent decide whether to retrieve, what to retrieve, and whether the retrieved chunks are sufficient

The Agentic AI Bootcamp covers each of these in Module 2 and revisits them in the Capstone 1 build.

Want to build agents in production?

Cohort 1 of the Agentic AI Bootcamp opens May 16, 2026. 16 weeks. In person at Hatch Works, Colombo. Two real production capstones.

Apply Now

FAQ · Agentic AI Bootcamp

Common Questions

How is the Agentic AI Bootcamp different from an online course? +

You show up in person, work alongside a cohort, and ship two real production systems by the end. Online courses give you content. The Agentic AI Bootcamp gives you a portfolio, instructor connections, and a Demo Day in front of hiring companies.

Do I need coding experience? +

Yes — basic Python or JavaScript is enough to keep up. If you don't have it yet, learn the basics before Cohort 1 starts on May 16, 2026 (Codecademy or freeCodeCamp work). For non-technical professionals, see the Applied AI Bootcamp.

When does Cohort 1 start? +

May 16, 2026. 16 weeks. Saturday sessions 9am to 1pm, in person at Hatch Works, Colombo.

How much does it cost? +

LKR 150,000 for the full 16-week programme. Flexible payment plans available. Corporate invoicing for employer-sponsored students.

Newsletter

The Twice-Monthly AI Briefing

Updates from the AI world — what shipped, what we’re using in production, and what’s worth your attention. Two emails a month, no spam.