1 min readfrom Machine Learning

DeepSWE: new benchmark looking at how well today's frontier models can actually write code [R]

DeepSWE: new benchmark looking at how well today's frontier models can actually write code [R]
DeepSWE: new benchmark looking at how well today's frontier models can actually write code [R]

DeepSWE delivers four advances over existing public benchmarks:

  • Contamination free: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.
  • High diversity: Tasks span a broad pool of 91 repositories across 5 languages.
  • Real-world complexity: Prompts are ~half the length of SWE-bench Pro's, yet solutions require 5.5x more code and ~2x more output tokens.
  • Reliable verification: Verifiers are hand-written to test software behavior rather than implementation details.

The result is a benchmark that reflects how today's frontier coding agents actually perform in software engineering work.

https://preview.redd.it/lacvagyr159h1.png?width=1373&format=png&auto=webp&s=6514340a15d51d7f03da733f08fb3f6a302cac75

It's open-source: https://github.com/datacurve-ai/deep-swe

submitted by /u/we_are_mammals
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#rows.com
#no-code spreadsheet solutions
#digital transformation in spreadsheet software
#enterprise-level spreadsheet solutions
#AI-driven spreadsheet solutions
#real-time data collaboration
#real-time collaboration
#data cleaning solutions
#DeepSWE
#benchmark
#code generation
#frontier models
#software engineering
#coding agents
#contamination free
#high diversity
#real-world complexity
#reliable verification
#software behavior
#implementation details