AI’s Not Trustworthy, But It Is Testworthy

Today as I hammered away on a code refactor with Cursor AI to migrate some Express endpoints to Next.js, I was met with the sad truth that, though I’d like to imagine the LLM capable of digesting the fine meaning of an entire source code file instantly, and spitting out a logically equivalent rewrite, we’re not at that point of magic computer reality.

The Confidence vs. Correctness Problem

When I first asked Cursor AI to rewrite my Express endpoints for Next.js, it produced code that looked pretty good and confidently assured me it was correct. But I could see that a lot of details from the original were missing.

Three times in a row I asked it to review its work, comparing against the original code. It did so, each time declaring with absolute certainty that the implementation was now correct.

Suspicious of these claims, I had Cursor generate a BASH shell script that would:

Run both the original Express API and new Next.js API
Make identical requests to both
Compare the JSON responses for equality

This script took a while to get right, but still took less time than it would have taken me to write it on my own. When we ran the tests, we discovered the endpoints weren’t actually equivalent. Unsurprisingly, despite all the confident assurances, there were subtle but important differences in the responses.

A Better Workflow

Now that I had tests with clear data results to work from, I had a constructive feedback loop with the AI. I had Cursor AI examine the test results showing the actual differences, then make adjustments and run the tests again. Things changed dramatically. Now it could actually see what logic it had missed in the rewrite reflected in the data.

With each run of the tests, the AI was better able to see what it had to change. Ideally, it would have been able to generate a correct solution without my prodding, but it did get stuck and I had to make suggestions about the right direction to go for fixes. But in the end we were able to get these endpoints to match through this feedback loop.

The Takeaway

AI tools like Cursor are powerful, but their confidence doesn’t correlate with correctness. The most productive workflow wasn’t just instructing the AI generate code and examine it for correctness, but creating a feedback loop where it could:

Write code
Test the code against real data
See the results
Improve based on concrete feedback

This approach leverages AI’s strengths while mitigating its tendency toward overconfidence. For your next AI coding project, consider building verification into your workflow from the start!

AI’s Not Trustworthy, But It Is Testworthy

Authorship Info

The Confidence vs. Correctness Problem

A Better Workflow

The Takeaway

One comment on “AI’s Not Trustworthy, But It Is Testworthy”

Leave a Comment Cancel reply