Judgment Bottleneck – Jon C. Phillips

Then it aged strangely.

When METR went back to revisit the question in February of this year, they ran into a problem.

A large share of the developers they invited declined to participate without AI access at all, which is its own kind of result.

For the returning developers they did measure, the earlier slowdown had turned into an estimated speedup of around 18%.

Put the two studies next to each other and the swing is hard to ignore. The same kind of experienced developer went from roughly 19% slower in mid-2025 to roughly 18% faster by early 2026, a swing of nearly forty points in about a year.

Two things changed in that year.

The tools got materially better, with stronger models and tighter editor integration. And the developers got better at using them, which matters more than it sounds. The 2025 slowdown was largely a transition cost. Experienced people were being asked to move from writing code, which they could do half-asleep, to directing and reviewing it, which they'd never had to do at this volume.

That shift has a learning curve measured in hundreds of hours, and the first study mostly captured people partway up it. By the second study, they'd climbed it.

METR was careful about the caveats, since the people who refuse to work without the tools are exactly the people most helped by them, so the sample skews. But the direction was clear enough, and it matches what most of us feel day to day.

The finding flipped inside of a year, which tells you how fast the ground is moving.

I lead with that because it settles the boring argument so we can get to the interesting one. Whether AI helps you ship is mostly decided at this point.

The 2025 Stack Overflow survey of tens of thousands of developers put adoption at well over 80% using or planning to use these tools, with half using them every day, and the number has only climbed since.

The question worth a few thousand words is what happens to the job once writing the code is the easy part.

And this is the exact spot where every essay reaches for the word "taste," waves its hands, and collects its retweets. I'm going to try to do better than that.

The Word I'm Not Going To Lean On

You've read the "taste is the new skill" piece already. I've been guilty of it too. Probably a dozen versions of it. I get why people are sick of the genre.

The argument is almost always unfalsifiable, too.

It defines "taste" loosely enough that nobody can disagree, and then it does two convenient things at once. First, it flatters the person making it by implying they have the rare thing, and then it reassures a lot of worried engineers that whatever they happen to be good at is the exact part the machines can't touch.

When someone tells you taste is the moat and then can't say how you'd measure it or get better at it, they've told you nothing.

So I want to be precise about what I'm actually claiming, because there's a real shift underneath the lazy version of this argument, and the two are worth separating.

The real claim has a name and a shape. AI relocated the engineering instead of removing it. At least that's my reasoning at the moment.

What Actually Moved

Generation got cheap. Evaluation didn't.

That's the entire thing in two sentences.

Producing a plausible block of code now costs almost nothing, and deciding whether that code is actually right, whether it fits the system, whether it'll still make sense after the next three changes, costs about what it always did.

When one side of a process collapses in price and the other side holds, the expensive side becomes the bottleneck.

The scarce resource is no longer the typing. It's the judgment to tell good output from output that merely looks good.

Here's the shift laid out plainly:

Array	Before cheap generation	After cheap generation
The expensive part	Writing the code	Deciding what's worth writing and whether the result is right
Where junior time went	Producing volume under supervision	Reviewing volume they can't fully evaluate yet
What seniority meant	Knowing the most framework details	Owning the outcome and catching what looks right but isn't
The default failure	Shipping too slowly	Shipping subtly wrong things quickly

The line I keep coming back to is that typing got cheap and thinking didn't, and the entire story lives in the gap between those two facts.

This isn't a prediction about the far future. It's already showing up in how teams describe their work.

The Pragmatic Engineer ran a large survey of engineers in 2026 and the consensus was that AI amplifies tendencies that already existed. The engineers who cared about architecture and quality used the tools to go further on both.

The ones who were mostly producing volume without much concern for whether it was right produced more volume, still without the concern.

You see, the tools didn't supply the judgment, they multiplied whatever judgment was already there, including a multiplier of zero.

Architecture by Autocomplete

Let me make this concrete, because vague is how these arguments die.

There's a failure mode that engineers have started naming out loud, and the best name I've seen for it is architecture by autocomplete.

The model generates structure all day long, and it has no intuition whatsoever for sustainable boundaries. It produces something that runs, passes the tests you wrote, and looks completely reasonable in the diff.

One real example that stuck with me was an AI tool that built a fetch-download-reupload chain for an image file when the object's ID was already sitting right there, requiring none of it.

The code worked. It would have passed review from anyone skimming. It was also architecturally wrong in a way that only becomes visible if you can see the shape of the whole system rather than the twelve lines in front of you.

There are two kinds of correct.

There's code that works, meaning it produces the right output for the cases you checked. And there's code that fits, meaning it belongs in the system it lives in, respects the boundaries that are already there, and won't quietly become the reason a future change takes three days instead of three hours.

AI is remarkably good at the first kind.

It is structurally weak at the second, and the weakness is not a temporary gap that a bigger model closes. Fitting requires holding the entire system in your head at once, and the model mostly sees the local context you handed it.

It optimizes for the diff in front of it because that's the world it can see. Obviously you can give LLMs more context, get it to analyze a full codebase, but the larger the codebase, the more complex and inaccurate this process becomes.

The danger is that working code and fitting code look identical in the moment.

They diverge later. You can accumulate an entire codebase of individually working pieces that no single person understands as a whole, and the bill for that arrives on the next change, not this one.

There's data pointing the same direction. Google's DORA research found that every 25% increase in AI adoption was associated with a measurable drop in delivery stability, more speed bought with more breakage. I've written before about the cleanup tax you pay for shipping too fast, and this is the same tax with a faster printing press attached.

The Strongest Argument Against Everything I Just Said

Here's where I have to be honest, because the version of this essay that skips this part is the version you're right to distrust.

The obvious rebuttal is that judgment is just the current bottleneck, not a permanent one.

Every previous version of this argument confidently named some human layer as the irreplaceable one, and then watched that layer get automated in the next release.

People said AI would never handle multi-file changes, and now it does. People said it couldn't do real refactors across a system, and it's getting better at exactly that. The models are already improving at catching their own errors and reasoning about structure, not just spitting out local snippets.

So why would judgment be any different? Why isn't "the human provides the judgment" just this year's version of a claim that ages badly?

I think that's a strong argument, honestly, and I'm not going to pretend it away with a line about how humans are forever special.

They might not be, at least not in the way that's comfortable to believe.

Here's the narrower thing I'll actually defend. Even if models keep getting better at evaluation, the accountability doesn't move to the model. Someone still has to own the outcome when the system does the wrong thing in production.

You can't put a model on the incident call. You can't point a regulator, a customer, or your own team at a model and say it made the judgment, so the responsibility was its. The judgment stays attached to a person because the consequences do.

That's a weaker claim than "taste is the eternal human moat." It's also far more defensible, precisely because it's weaker.

The bottleneck has nothing to do with a mystical human spark. It's there because responsibility doesn't compress, and judgment is where responsibility lives.

If You're Earlier In Your Career

This is the part that actually worries me, and I don't have a clean answer.

The judgment this whole piece is about gets built by doing the production work that AI now does for you.

I can spot architecture by autocomplete because I spent years writing bad architecture by hand and feeling it break later. That's how I learned and the pattern recognition came from just doing the reps.

The reps were tedious and often felt pointless, and they were also the entire mechanism by which the judgment got installed. I've made this argument before about convenience in a broader sense, that every tool which does the work for you quietly removes the struggle that was teaching you something.

So there's a real problem.

If the tools now do the production work that used to build judgment as a side effect, where does the next generation's judgment come from? This worry is starting to show up in the research too, with recent work on how AI impacts skill formation asking what happens to learning when the struggle that used to build it gets optimized away.

The people best positioned to use AI well right now are the ones who learned the craft before it existed, and the ladder we climbed to get here is being pulled up behind us. I don't think that's doom, but I think pretending it isn't happening helps no one.

What I'd actually suggest, if you're earlier on and the work won't force the reps on you anymore, is to manufacture them on purpose:

Read code you didn't write and predict what will break before you run it. Then run it and find out how wrong you were.
Turn the assistant off sometimes, deliberately, on problems small enough that struggling through them costs you an hour and not a deadline.
Review AI output as if you'll personally be paged at 3am when it fails, because the habit of reviewing like that is the judgment.
Sit with confusion for a while before prompting your way out of it. The understanding that forms in that uncomfortable gap is the thing you're actually trying to build.
When the model gives you something that works, ask why it works and what it would cost in two years, even when nobody is making you.

None of that is about rejecting the tools. Hey, I use them every day. No, it's about keeping the muscle that the tools no longer force you to exercise.

What Judgment Actually Is

Let me define the thing plainly, so this lands as something you can argue with rather than another vibe.

Judgment is the ability to look at a working solution and say why it's the right one, what it will cost later, and what it quietly assumes. Generating the solution is now the cheap part. Knowing whether to trust it is the job.

That's a skill you can actually describe and point at.

You can watch someone exercise it, and you can look at their track record of calls that aged well or badly and get a real sense of how good their judgment is. You can get better at it on purpose.

That's what separates it from the hand-wave version of "taste," which can never tell you how you'd know if you had it.

The reason the bad version of this argument annoys people is that it treats judgment as innate and mystical, when really it's the accumulated residue of having been wrong a lot in ways you had to live with afterward.

Where This Leaves Us

Let's circle back to those developers who refused to work without AI.

They weren't wrong to lean on the tools. The tools are good and the speedup is definitely there, and the early study that suggested otherwise got overtaken by reality within a year. Nostalgia for hand-writing every line is not the takeaway here.

The thing worth noticing is what those developers still bring that the tool doesn't.

They point it at the right problem and catch it when it's confidently producing something that works without fitting, and they own the result when it ships. The code got cheap, which means the judgment is now the job, and if I'm honest it always quietly was.

We just used to be able to hide it inside all the typing.

That's not a comforting story about humans staying special forever. It's a practical observation about where the expensive part went.

The bottleneck moved, and it's worth knowing which side of it you're standing on.