What's the theoretical basis for using llm consensus as a probability estimator for real world events [R]

This is a genuine technical question here. I've been looking at systems that use an ensemble of ai models to generate probability estimates for open ended real world events. The claim is that consensus across multiple models produces more calibrated estimates than any single model.

this makes sense intuitively and has parallels to ensemble methods in traditional ml. But I'm wondering about the theoretical underpinnings more carefully.

The standard ensemble argument relies on errors being somewhat uncorrelated across models. but if all the models are trained on similar data distributions and share architectural similarities, how independent are their errors really? are we just getting false confidence from models that all have the same blind spots?

also curious about how these systems handle events that are outside the distribution of their training data. novel events are exactly where you'd want good probability estimates and also exactly where you'd expect the most unreliable performance.

submitted by /u/onlyJayal
[link] [comments]

Want to read more?