Coding Agents Collapse as Backend Rules Stack Up
A new study finds that LLM coding agents suffer 'constraint decay': performance drops 30+ points when forced to follow architectural patterns, use specific databases, and integrate ORMs. Data-layer defects drive 45% of logic failures.
LLM coding agents are great at building backend services when you let them do whatever they want. The moment you specify an architecture, a database, and an ORM, performance drops off a cliff.
A team from EURECOM and the University of Basilicata tested this systematically: 80 greenfield generation tasks and 20 feature-implementation tasks across eight web frameworks, all targeting the same API contract. They call the phenomenon constraint decay.
- Best models lose an average of 30 percentage points in assertion pass rate (A%) from unconstrained to fully constrained tasks
- Specifying a database engine alone costs 14 to 19 points in marginal performance
- Framework choice creates a 32-point gap: Flask and Express average ~50% A%; Django and FastAPI average ~25%
- 71% of failures are logic errors; 45% of those trace to data-layer defects (bad queries, ORM misuse)
- Even GPT-5.2 drops to 0% pass@1 at the hardest constraint level on some configurations
The Setup
The study fixes a single API contract (the RealWorld Conduit spec: 19 CRUD endpoints, five resource groups) and layers four constraint dimensions on top:
- Web framework (always specified): Flask, FastAPI, Django, aiohttp, Express, Fastify, Hono, Koa
- Architectural pattern: Clean Architecture with four layers and strict dependency direction
- Database backend: PostgreSQL or SQLite
- ORM integration: SQLAlchemy (Python) or Sequelize (Node.js)
These are combined into four levels of increasing constraint density (L0 through L3). Every implementation gets evaluated against the same 291-assertion behavioral test suite, plus static verifiers that check whether the agent actually followed the structural rules.
Two agent scaffolds were tested: Mini-SWE-Agent (a 100-line bash-only scaffold) and OpenHands (full-featured with file editing, code search, and task tracking). Models ranged from Devstral-Small (24B) to GPT-5.2.
Constraint Decay: A% by Level (Select Configurations)
Assertion pass rate drops as structural requirements accumulate
L0 = framework only. L1 = +1 constraint. L2 = +2 constraints. L3 = architecture + database + ORM. M2.5 and GPT-5.2 evaluated on 16-task subset.
Database Is the Biggest Killer
Not all constraints hurt equally. Using a matched-pair design (comparing tasks that differ by exactly one constraint), the authors isolated the marginal cost of each:
Marginal Performance Cost Per Constraint
Average A% drop when adding each constraint in isolation
Marginal effects via matched-pair differences across all model-agent configurations. Error bars omitted for clarity; PostgreSQL and SQLite effects are statistically significant.
Databases are the most expensive constraint by far. PostgreSQL costs nearly 20 points on its own. This isn’t surprising once you look at the error analysis: agents struggle with connection setup, schema creation, query composition, and dialect-specific SQL. Clean Architecture adds another 9 points, reflecting the overhead of splitting code across four layers with strict dependency rules.
ORMs, somewhat counterintuitively, barely move the needle. The authors speculate that specifying “use SQLAlchemy” actually reduces ambiguity compared to leaving the data-access strategy open-ended. GPT-5.2 even performed slightly better with the ORM constraint active.
Flask and Express Win; Django and FastAPI Lose
Framework choice creates a massive performance spread under identical functional requirements. Express, Koa, and Flask form a clear top tier averaging around 50% A%. Django and FastAPI sit 25+ points lower.
Average A% by Framework (across all models and constraint levels)
Averaged across GPT-5-mini, Qwen3-Coder-Next, and Qwen3-235B with both Mini-SWE and OpenHands scaffolds.
The pattern maps cleanly to framework philosophy. Express, Koa, and Flask are minimal and explicit: you wire up routes, you choose your middleware, nothing happens by convention. Django requires you to navigate its auto-discovery, settings modules, and app registry. FastAPI’s type-hint-driven validation layer introduces implicit behavior that agents frequently misconfigure.
Hono trails despite having an API surface similar to Express, likely because it targets edge runtimes and needs a compatibility adapter to run on standard Node.js. That setup step is probably underrepresented in training data.
Where Agents Actually Break
The error analysis (on Qwen3-Coder-Next and MiniMax-M2.5 with Mini-SWE-Agent) reveals a consistent failure profile. Logic errors dominate at 71% of all failures. Server startup failures are second at 12 to 21%. Everything else (incomplete implementations, schema errors, infinite loops, constraint violations) totals under 17%.
Within logic errors, data-layer defects are the leading cause:
- Incorrect query logic (25.5%): SQL executes but returns wrong results from bad joins, filters, or dialect-incompatible operators
- Auth misconfiguration (22.6%): broken token handling or header parsing causing 401 responses everywhere
- DB/ORM runtime errors (21.2%): conceptually correct queries that crash from ORM API misuse
- Business logic defects (11.7%): correct infrastructure, wrong domain behavior
- Framework idiosyncrasies (9.5%): correct application code blocked by unhandled framework defaults
- State propagation failures (9.5%): mutations succeed but subsequent reads don’t reflect them
The data-layer categories (incorrect queries + ORM runtime errors) together account for about 45% of all logic failures. This aligns with the marginal cost analysis: databases are expensive constraints because agents routinely write SQL that doesn’t work in the specified dialect, misuse ORM APIs, or fail to propagate state correctly through the data layer.
Caveats
The study uses a single API contract (the RealWorld Conduit spec). This is a deliberate design choice for internal validity, but it means we don’t know how the results generalize to different API shapes, especially those with more complex business logic or non-CRUD operations.
The gap between A% and pass@1 is consistently large. The best L3 configuration (OpenHands + MiniMax-M2.5) hits 78.6% A% but only 8.3% pass@1, because a single failed assertion out of 291 zeroes out a run. A% is the more informative metric for understanding partial progress, but pass@1 is what matters for real deployment.
Three of the strongest models (MiniMax-M2.5, Kimi-K2.5, GPT-5.2) were evaluated on a 16-task subset due to cost (the full study consumed ~5 billion tokens). Subset and full-set scores correlate at Pearson r = 0.98, so the rankings hold, but exact numbers carry more uncertainty for those models.
The practical implication is clear: if you’re using coding agents for prototyping or demos with loose specs, they work. If you’re expecting production-quality backend code that follows your team’s architecture, connects to your database correctly, and uses your ORM idiomatically, you’re going to spend a lot of time debugging the output. The bottleneck isn’t functional logic. It’s the structural awareness that separates a working prototype from a deployable service.
Liked this? We send one like it every week.
Best papers, one email. No spam.