I’m slowly watching, feeling out the shape of this wave that we’re all riding together.
I’m sensing an order-of-magnitude zoom-out when it comes to the relationship between our brains and the code we manage.
My colleagues, while working to understand our changing roles, have drawn comparisons between AI-produced code and compilers. Most programmers stopped paying attention to Assembly code when compilers became reliable, and reasoned only on the level of the language code. Could it be that AI models are the next step in that evolution? Will we stop reading the code soon? I want to expand on that.
When I started as a programmer, general software engineering work was already far abstracted from Assembly code. Most programmers were using compiled and interpreted languages. Later, package managers appeared and programmers generally adopted another layer of separation from the underlying operations.
We zoomed out. Not only were we using compiled or interpreted languages, our new reasoning level largely involved gluing pre-made packages together instead of authoring the whole system ourselves. In ten minutes we could add a package to our project that represented thousands of hours of effort from other developers.
I’m assuming that for most programmers, adding a package didn’t involve diving into the source to understand the full scope of what was being added. I’m assuming most of us checked out the API, the logo art on the package documentation page, and maybe the number of stars, closed PRs, or package downloads. Then we chose whether to trust whatever was inside.
Now with the advent of agentic AI, we’re zooming out again. Not only are we trusting packages that other people (or bots) have written, many of us are starting to trust agent code that we haven’t read ourselves. With each of these generations in developer tooling [compilers, package managers, AI agents] developers have zoomed out a step, and become responsible for driving an order of magnitude more code.
Old Tools, New tools
Perhaps we can better adapt to our new zoom-level now by examining how we adapted before. What mental tools did we use to begin incorporating large amounts of 3rd party code into our software?
- Blind trust. We trusted that because the package said it worked, it would work. Many times this turned out fine. Sometimes it burned us in big ways.
- Open source. When incorporating someone else’s code, we often looked for open-source packages, and trusted the promise that if we needed to dive deeper at a later time, we could.
- Test Suites. Sometimes when evaluating an open-source package, we could read the test suite the package used. We didn’t have to understand the architecture of what we were using as long as we felt the results were properly asserted.
- Popularity. When we followed the crowd and used the tools that everyone else was using, things often worked out. If we chose a popular package or framework, the chances were high that someone else would find the problems and fix them before we had to.
- Our own adaptability. For many parts of our applications the cost of discovering a bug and fixing it in production was low. For example, we picked a markdown parsing library without much anxiety, and later swapped it out for a faster/better one when problems appeared. This allowed us to trust packages and move forward quickly without stressing about the guts of every package used.
How can this inform our new level of zoom-out?
- Blind trust: many of us are using this already when we’re vibe-coding a feature or hammering out a hobby project. We trust that the AI-written code is good enough for the circumstances.
- Open source: we know that the resulting code is available and transparent to our agents and our own eyes. The code being written is something we can dive into later if we need to.
- Test suites: This one is trickier, since we use the AI to write both the tests and the code being tested. But we’re adapting our methods. We’re exploring how to make work verifiable (browser use, smoke tests, ralph loops) so that the agent can prove its own work.
- Popularity: On an almost weekly basis the biggest AI providers ship updates to their frontier models and harnesses, giving better results for the same level of engineer effort. For example, Claude’s
/reviewcommand gives better results now than it did months ago even without teaching it specifics about a codebase. As before, when we use mainstream tools someone else is likely to find and fix the problems before we do. - Our own adaptability: As before, we need to judge how quickly we can trust 3rd-party code based on how much rides on that code, and how hard it is to recover from a failure. As before, many aspects of many products are resilient to bugs, and can be fixed ad-hoc when problems come up.
What new mental tools can we add to this collection? My colleague wrote an agent skill to give her readable outlines of high-level changes in a PR, so she could quickly mentally grok the change. Another colleague and I recently spent focused time developing skills in one of our repositories for:
- Outlining test suites on-demand in an HTML document designed to make it easy to review the tested behavior of the new code
- Teaching the agent to smoke-test the app autonomously (through browser-use tools, or cURL requests) capturing output in the form of screenshots and curl responses, then producing a readable HTML report on what matched expectations, and what didn’t
- Enriched adversarial review, that teaches review agents specific quirks about our code, and gotchyas to watch out for, embedding historic domain-knowledge into the AI review agents.
Our work now as engineers is to re-negotiate our trust boundaries, and specify our appetites for risk, an equation of stability vs velocity. And as happened in the past, we now need to become comfortable with responsibility over an order of magnitude more code than we had before.
this post was hand-written by Murphy. No AI used in this one.