Three limitations I keep hitting with retrieval-augmented generation in production and I'm running out of ideas [D]

I've had a RAG system running in production for a few months now (legal domain, German regulatory documents). It handles 80% of queries well but there are three patterns where it fails predictably and I haven't found clean solutions.

The scatter problem.

Some questions need information from 8-10 different documents where each one contributes just a small piece. Vector search finds chunks related to the query but not chunks related to each other. So when someone asks something like "compare how notification deadlines work across different German federal states" the system finds 2-3 state-specific documents that happen to match the query well and misses the rest. The answer looks complete but it's actually partial. Cranking up k adds noise and burns tokens without reliably solving it because the missing documents might use completely different terminology for the same concept.

I've thought about query decomposition (break the question into sub-queries per state) but that assumes the system knows upfront how many sub-queries to generate and what dimensions to decompose along. For a general-purpose research tool that feels brittle.

The negative knowledge problem.

When someone asks "do we have any guidance on employee monitoring" and the answer is genuinely no, the system can't cleanly say that. It retrieves whatever chunks are least irrelevant, and the LLM synthesizes something from them anyway. The user gets a confident-sounding answer about a tangentially related topic instead of a straightforward "this isn't covered in your knowledge base."

I've tried similarity score thresholds as a gate but the problem is there's no clean boundary. A legitimate but unusual query might have low similarity scores. A genuinely off-topic query might match some chunks reasonably well because of shared vocabulary. Every threshold I've tested either filters out too much or too little. The prompt instruction to admit uncertainty helps maybe 60% of the time. The other 40% the model just reaches.

The timeline problem.

Questions like "how did the interpretation of X change after the 2023 ruling" require the system to find pre-ruling documents, find post-ruling documents, understand the temporal relationship, and construct a comparative narrative. The metadata has document dates. The prompt says to respect temporal ordering. But the model struggles to build a coherent before/after story when the retrieved chunks don't explicitly reference each other. It tends to either merge everything into one flat answer or just cite the newer source and ignore the older interpretation.

This feels like it needs a fundamentally different retrieval approach (maybe temporal filtering at the search level, or separate retrievals for different time periods) rather than more prompt engineering.

I've been reading about graph RAG approaches, agentic retrieval loops, and multi-hop reasoning chains but most of the literature is benchmarks on synthetic datasets, not production implementations. If anyone has actually deployed solutions for any of these three patterns I'd really like to hear what worked and what didn't. Especially interested in approaches that don't require restructuring the entire pipeline.

submitted by /u/Fabulous-Pea-5366
[link] [comments]

Three limitations I keep hitting with retrieval-augmented generation in production and I'm running out of ideas [D]

Want to read more?

Tagged with