Seeking arXiv cs.CL endorser for first time submission [D]
Hi all — I'm an independent researcher in rural Manitoba submitting my first paper to arXiv cs.CL and looking for an endorser.
Paper: The Multivac: Blind Peer Matrix Evaluation of Frontier Language Models
Methodology: A fully symmetric N×N multi-judge evaluation where 10 frontier LLMs simultaneously generate responses and blindly judge each other's outputs, with self-judgments excluded. Extends Verga et al.'s Panel of LLM Judges (PoLL) with full symmetry.
Scale: 286 evaluations, 198 de novo questions, 9 category pools (code, reasoning, analysis, communication, meta-alignment, edge cases, plus focused SLM/Qwen/MiniMax batches), 55 models total, 22,254 valid judgments (27,540 including self-exclusions).
Key findings:
- No single model dominates — 6 different models lead the 9 category pools. The model with the most first-place finishes (GPT-5.4, 53 wins) ranks 16th by mean score.
- Same-family rating bias is statistically significant in all 8 families tested, ranging from +0.91 (Qwen) to −1.02 (Mistral). The negative bias pattern in Mistral/Google appears previously unreported.
- Top 4 frontier models are pairwise statistically indistinguishable (overlapping bootstrap 95% CIs, all p > 0.07). Aggregate leaderboard differences in the top tier aren't statistically meaningful.
- Code evaluation has ~2× the judge disagreement of meta-alignment (σ = 1.27 vs 0.71). Overall Krippendorff's α = 0.618.
Open release: Full dataset (27,540 judgments with complete provenance), evaluation framework, and all 198 question prompts released under MIT license. Repo: https://github.com/themultivac/multivac-evaluation
The ask: Since this is my first arXiv cs.* submission, I need endorsement from an existing author. If you're eligible (2+ cs.CL papers on arXiv in the last 5 years) and willing to take a look, the endorsement code is S33JQD and the link is https://arxiv.org/auth/endorse?x=S33JQD — it takes about 30 seconds.
Happy to share the PDF with anyone willing to review. Also open to feedback/critique on the methodology before I finalize v1.
Thanks for considering.
[link] [comments]
Want to read more?
Check out the full article on the original site