RAG architecture patterns: when to chunk, when to chunk differently
Retrieval-augmented generation is deceptively simple in theory: find relevant text, stuff it into the prompt, ask the model. In practice, the quality of a RAG system lives and dies on retrieval — and retrieval lives and dies on how you chunk.
Chunking is a modelling decision, not a config value
Teams treat chunk size as a number to tune. It is really a question about your documents. Dense legal contracts want small, overlapping chunks anchored on clause boundaries. Long-form articles want larger semantic chunks. Tables and structured data should not be chunked as prose at all.
Hybrid retrieval beats pure vector search
Dense embeddings are great at meaning and terrible at exact matches like part numbers, names, and acronyms. Combining dense vector search with sparse keyword search (BM25) and then re-ranking the merged set consistently outperforms either approach alone.
Re-ranking is where the magic is
Retrieve broadly, then re-rank tightly. Pulling the top 50 candidates and re-ranking them down to the best 5 with a cross-encoder routinely lifts answer quality more than any prompt change. It costs a little latency and pays for itself in accuracy.
Most "the AI is hallucinating" bugs are actually "the retrieval gave it nothing useful" bugs.
A practical default
- Chunk on natural boundaries (headings, clauses) with 10–15% overlap.
- Store metadata per chunk so you can filter before you search.
- Run hybrid dense + sparse retrieval.
- Re-rank the merged candidates with a cross-encoder.
- Always cite the source chunks back to the user.
Start there, measure with a real eval set, and only add complexity when the numbers tell you to.