Model/Coding Evals and the Converging Landscape
Cognition, the makers of Devin, dropped an interesting blog post on FrontierCode, their new coding eval framework which they state will measure how easy it is to maintain the code generated as well as its functionality.
This focus on maintenance is a really interesting approach, and has a lot of merit as we move into an era where a significant amount of AI generated code will need to be sustained as part of production systems. Cognition get a lot of extra points in my mind for including Java and C/C++ in their evaluation. Not everything is greenfield or a rewrite from scratch.
Cognition’s blog follows interesting work that Cursor dropped a few months ago with the publication of Cursorbench and DataCurve released DeepSWE.
Now all of these evals are trying to address gaps which existing in the widely used SWEBench family. SWEBench has encountered the age old problem of all benchmarks - people will lay train tracks through their tooling to improve marketing metrics.
That said, SWE-bench Pro is very widely used, and provides a little glimpse into how everything is converging. I keep an ongoing chart of results and trends across major models, which I update every month or so.The trend is pretty clear, for many basic tasks we are fast hitting a convergence point. We are well past the point where you need the latest greatest model for a lot of day to day work.