Most Coding Agents Break 75%+ of Their Own Fixes Over Time
SWE-CI is a new benchmark that evaluates coding agents on long-term codebase maintenance via continuous integration loops — not one-shot bug fixes. Most models introduced regressions on 75%+ of tasks. Only Claude Opus exceeded a 50% zero-regression rate.