2 min readfrom Machine Learning

Tested chunking + embeddings data from 3 production websites. [P]

Tiered + page-role-aware RAG retrieval results across 3 corpora with very different content density:

Workspace Sources Chunks HIGH MEDIUM LOW REJECTED
Intercom 188 941 96 200 541 104
HubSpot 251 1705 40 508 1153 4
KPMG 53 209 3 14 127 65

(HIGH = avg operational score 0.84, MEDIUM = 0.55-0.65, LOW = 0, REJECTED = nav/legal/careers)

87 of Intercom's 96 HIGH chunks are help-center articles. HubSpot's HIGH chunks are concrete case studies ("23% increase in ACV"). KPMG's HIGH chunks are basically empty because the entire corpus is positioning prose.

Retrieval probes on KPMG (the worst-case corpus):

  • "Family business succession" → /private-enterprise.html (cosine 0.721)
  • "ESG and climate risk" → /our-insights/esg.html (cosine 0.794)
  • "Cybersecurity for energy sector" → /energy-natural-resources-chemicals.html (cosine 0.656)

So semantic relevance routes correctly even on a thin corpus. Tier weighting (HIGH × 1.20) shifts the top-k composition meaningfully — on Q2, a 0.535-cosine HIGH chunk gets reranked above 0.6+ LOW chunks (weighted 0.642 vs 0.51-0.59).

Key takeaway: a "yield score" (HIGH+MEDIUM chunks / total chunks) is itself useful telemetry. For Intercom that ratio is 31%. For HubSpot it's 32%. For KPMG it's 8%. That predicts before generation which brands will need softer claims and more swap-resistant phrasing.

Anyone publishing benchmarks on this kind of corpus-quality awareness? Most RAG benchmarks assume the source material is uniformly substantive, which is wildly untrue in the wild.

submitted by /u/Otherwise_Economy576
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#natural language processing for spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#rows.com
#enterprise data management
#AI formula generation techniques
#enterprise-level spreadsheet solutions
#business intelligence tools
#financial modeling with spreadsheets
#natural language processing
#big data management in spreadsheets
#conversational data analysis
#real-time data collaboration
#intelligent data visualization
#data visualization tools
#big data performance
#data analysis tools
#data cleaning solutions
#RAG retrieval
#content density