Code written with AI agents gets substantially more robust when the agent has a feedback loop, like a good test suite. And agents like Claude Code are very good at writing big.. long… super long… test suites. I find myself not only generating more application code than I’ve ever written in my life, but many more lines of testing code.
Reading Tests is Heavy
AI is fast at writing tests, but these generated suites are vulnerable to the same class of quality issues as the application code. And tests can be very mentally taxing to read.
When writing or reviewing application code, there’s primarily one question the human has to keep in mind:
- What’s the system under test and how is it supposed work?
But when reviewing test code, the mental burden is bigger:
- What’s the system under test and how is it supposed work?
- What code is the test mocking, and what code is real?
- What test data is being constructed? Is it accurate enough to make a useful test? Or is the test constructing fake data that is pre-disposed to pass?
- Is this testing code itself well-abstracted? or should it be broken down and made more DRY?
These questions apply to human-written code, of course, but the sheer volume of code the AI writes in a short amount of time can make the review task feel impossible.
I’ve been feeling it.
That, along with the lows of realizing that the test code I’m manually reviewing isn’t good. When see a test file in a PR, I want to skip over it and say to myself “the tests are probably fine!”. But when I look, I almost always find problems that need fixing.
Making Help
Luckily, LLMs are here to help clean up their own messes. Before marking a big PR with a test suite as “ready”, I’ve been experimenting getting help from Claude Code. Here’s an example of the most recent prompt I used when reviewing a colleague’s test suite:
please review the accuracy and logic of the test suite in src/__tests__/_factories/ChildSyncableTests.ts,
check for over-mocking, or logic that doesn't actually test anything.
And the result was helpful:
Good test suite overall - these are real integration tests with no mocking, which is great. But I found several issues:
---
š Critical Bug
Line 411 - Wrong variable checked:
...
And it produced a list of about ten improvements that could be made to the suite, finding a few problems that would have seriously impacted the usefulness of the suite.
If I were working on my own code, at this point I would have said:
Please address all of these issues, then commit and push
But since I was reviewing someone else’s code:
Please make comments on this PR <PR URL> using the GitHub cli, make sure to mention that these comments came from Claude.
(Yes, I like to say please to my AI)
It created a new review, and added comments for each one of the issues it found:

Looping
When I’ve done these reviews on my own tests, I’ve done the review/address/commit loop 2-3 times in a row, starting with a new agent each time. If I see the agent mention new problems that are smaller in severity each time, it increases my confidence that we’re actually getting to a good spot.
Other Useful Prompts
Using the one simple prompt above that asks about issues I’ve found in manual reviews (over-mocking, fake testing) has been helpful so far, but there are other prompts that can be useful, too. I just tried this one:
please review src/tests/_factories/ChildSyncableTests.ts for human understandability. Is the naming of variables and data clear? Is there a logical,
readable flow to the code?
And I was encouraged to see that it identified problems that I’d already found in my manual review, plus more!
Perhaps the questions from the start of this post could make some good prompts? I haven’t tried these yet:
Please review <file>, what test data is being constructed? Is it accurate enough to make a useful test? Or is the test constructing fake data that is pre-disposed to pass?
Please review <file>, is this testing code well-abstracted? or should it be broken down to reuse functionality?
Please review the test <file>, what code is the test mocking, and what code is real? Are there any mocks that are compromising the integrity of the test?
Manual Review
These prompts can help improve code reliability a lot, especially if used iteratively. But in the end, a manual review is still good to catch the kinds of bugs that human brains are better disposed to catch.
The future?
Will we need to review our code at all in the future? I don’t know. Probably. Probably not. It seems possible to me that we’ll find ways to determine which changes are okay for fully-AI reviews.
If there were a way to reliably estimate whether a code patch:
- Has a risk of corrupting user data
- Has a risk of causing user danger
- Poses potential security risks
…then relying on fully AI-reviewed code for safe changes would be fine. Sadly for an estimation like that to be reliable is a big ask. Sneaky bugs that seem innocuous can cause big problems. And I find myself hoping that engineers working on software for airplanes, traffic lights, and self-driving cars are always reviewing their code manually alongside their AI assistants.