I want to be upfront: this was not some carefully controlled research project. It was a deliberate internal experiment we ran on a non-client codebase, and parts of it went badly wrong in ways we found genuinely useful.
The premise was simple. For one sprint — two weeks — we handed every possible task to AI agents and documented what happened. Code generation, test writing, documentation, code review, bug triage, even the standup summary. If an agent could theoretically do it, we tried to make it do it.
Here is what we actually found.
The part that genuinely impressed us
CRUD endpoint generation is basically a solved problem. Give an agent a database schema, a clear spec, and the existing patterns from the codebase — and it will produce a working, tested API endpoint in under two minutes. Our human review time on that output averaged about eight minutes per endpoint. That is a real, significant efficiency gain.
Unit tests surprised us more. We had expected the agent to write happy-path tests and call it done. Instead, we consistently got edge cases we had not thought of. The agents were writing tests for null inputs, for unexpected data types, for race conditions in async functions. Not always, but often enough that we started treating the test output as a code review tool in its own right.
Documentation: genuinely boring, genuinely good. Agents write JSDoc at the quality of a developer who is being careful and has unlimited time. That is not how most developers write documentation under sprint pressure.
The part that was mediocre but fixable
Complex business logic with multiple conditional branches was where output quality dropped. The code was syntactically correct. It was often logically wrong in ways that were not immediately obvious. This is actually the worst kind of failure — not a crash that surfaces immediately, but a bug that produces incorrect results quietly.
We found that the quality on these tasks correlated almost perfectly with the quality of the task brief we gave the agent. When we wrote detailed, specific, step-by-step specifications, output was good. When we wrote vague ones, output was bad. This was not surprising in retrospect, but it changed how we think about the skill of briefing agents. Writing a good brief is a real skill. We undervalued it before this experiment.
What broke completely
Two categories failed outright.
Architectural decisions. Anything requiring reasoning about the whole system — not just the file in front of it — produced unreliable results. We asked agents to propose database schema changes, and they optimised locally without understanding the query patterns that made the schema design meaningful. A human who knows the whole codebase catches this. An agent working on a single task does not.
Security-sensitive code. Auth flows, permission systems, encryption handling. We got output that looked correct on inspection and was subtly wrong under specific conditions. We found this during our review. In a real engagement, we might not have. This category is now on our permanent humans-only list. Non-negotiable.
What we changed
The main output from this experiment was a more precisely defined task allocation protocol. Not “AI agents do repetitive work” — that is too vague. A specific list: these categories go to agents, these stay with humans, these are human-led with agent assist. The distinction matters in practice.
We also now require more detailed task briefs before anything goes to an agent. Writing a good brief takes longer than writing a vague one. The output quality difference is worth it.
The conclusion was not “AI is overhyped” or “AI is amazing.” It was: AI agents are extremely good at specific, bounded tasks and unreliable at tasks requiring system-level reasoning or security judgment. Knowing that boundary precisely is what makes the model work. We know it more precisely now than we did two weeks before the experiment.