Tested chunking + embeddings data from 3 production websites. [P]
Tiered + page-role-aware RAG retrieval results across 3 corpora with very different content density:
| Workspace | Sources | Chunks | HIGH | MEDIUM | LOW | REJECTED |
|---|---|---|---|---|---|---|
| Intercom | 188 | 941 | 96 | 200 | 541 | 104 |
| HubSpot | 251 | 1705 | 40 | 508 | 1153 | 4 |
| KPMG | 53 | 209 | 3 | 14 | 127 | 65 |
(HIGH = avg operational score 0.84, MEDIUM = 0.55-0.65, LOW = 0, REJECTED = nav/legal/careers)
87 of Intercom's 96 HIGH chunks are help-center articles. HubSpot's HIGH chunks are concrete case studies ("23% increase in ACV"). KPMG's HIGH chunks are basically empty because the entire corpus is positioning prose.
Retrieval probes on KPMG (the worst-case corpus):
- "Family business succession" → /private-enterprise.html (cosine 0.721)
- "ESG and climate risk" → /our-insights/esg.html (cosine 0.794)
- "Cybersecurity for energy sector" → /energy-natural-resources-chemicals.html (cosine 0.656)
So semantic relevance routes correctly even on a thin corpus. Tier weighting (HIGH × 1.20) shifts the top-k composition meaningfully — on Q2, a 0.535-cosine HIGH chunk gets reranked above 0.6+ LOW chunks (weighted 0.642 vs 0.51-0.59).
Key takeaway: a "yield score" (HIGH+MEDIUM chunks / total chunks) is itself useful telemetry. For Intercom that ratio is 31%. For HubSpot it's 32%. For KPMG it's 8%. That predicts before generation which brands will need softer claims and more swap-resistant phrasing.
Anyone publishing benchmarks on this kind of corpus-quality awareness? Most RAG benchmarks assume the source material is uniformly substantive, which is wildly untrue in the wild.
[link] [comments]
Want to read more?
Check out the full article on the original site