AI · 11 min read

RAG architecture patterns: when to chunk, when to chunk differently

Rahul Patel

Head of Engineering · Apr 2025

Retrieval-augmented generation is deceptively simple in theory: find relevant text, stuff it into the prompt, ask the model. In practice, the quality of a RAG system lives and dies on retrieval — and retrieval lives and dies on how you chunk.

Chunking is a modelling decision, not a config value

Teams treat chunk size as a number to tune. It is really a question about your documents. Dense legal contracts want small, overlapping chunks anchored on clause boundaries. Long-form articles want larger semantic chunks. Tables and structured data should not be chunked as prose at all.

Hybrid retrieval beats pure vector search

Dense embeddings are great at meaning and terrible at exact matches like part numbers, names, and acronyms. Combining dense vector search with sparse keyword search (BM25) and then re-ranking the merged set consistently outperforms either approach alone.

Re-ranking is where the magic is

Retrieve broadly, then re-rank tightly. Pulling the top 50 candidates and re-ranking them down to the best 5 with a cross-encoder routinely lifts answer quality more than any prompt change. It costs a little latency and pays for itself in accuracy.

Most "the AI is hallucinating" bugs are actually "the retrieval gave it nothing useful" bugs.

A practical default

Chunk on natural boundaries (headings, clauses) with 10–15% overlap.
Store metadata per chunk so you can filter before you search.
Run hybrid dense + sparse retrieval.
Re-rank the merged candidates with a cross-encoder.
Always cite the source chunks back to the user.

Start there, measure with a real eval set, and only add complexity when the numbers tell you to.

Written by Rahul Patel

Head of Engineering at Satvix Tech Solutions