Plausibility Debt

Bad code gets caught. It fails a test, breaks a build, throws in review. The feedback is loud and it's fast, which is exactly why bad code isn't the thing to worry about.

The thing to worry about is code that works. It's technically correct, passes the tests, does what was asked, and represents a worse path than the one a senior engineer would have taken. It clears review because every line is defensible. The cost is invisible now and it compounds later, which is what makes it debt. Call it plausibility debt: the code is plausible enough to ship, so the inquiry that should have happened never does.

Correctness is disarming. "It works" feels like the end of the conversation. With AI it's the sharpest version of an old trap, because the output isn't merely plausible anymore. It's often right. And right-enough-to-pass is precisely the condition under which nobody looks harder.

The underlying behavior was named decades ago, studying pilots and process-control operators: automation complacency. When a system produces output that looks competent, people stop scrutinizing it. They slide from operator to spectator. Parasuraman and Riley's work on automation use and misuse from the late nineties reads like it was written about LLMs. The more reliable the automation appears, the less anyone monitors it, and the failures that slip through are the ones the automation was confident about.

The same slide happens in software. The model is capable but context-blind: it knows how to write code, not what your system needs from this particular change. Your judgment is the layer meant to sit on top of it. When that layer goes quiet, the engineer becomes a conduit. Output goes in one side from the model and out the other into the repo, and the person in the middle stops being the author of the decision and becomes a pipe for it. Conduit engineering.

The behavioral tell is deference. The engineer didn't push back on the tool; they took the path of least resistance and let its first answer stand. Ask why the code is the way it is and they can't say, because they never decided. They relayed. An author can tell you what they rejected and why. A conduit can only tell you what the model returned.

The same root cause produces two different failures, and they need different fixes, so separate them.

the optimality gap

The first is the optimality gap. The model returns a working solution, a simpler one existed, and nobody went looking. Herbert Simon called this satisficing in 1956: accepting the first answer that's good enough rather than searching for the best one. Pair it with anchoring, where the first solution you see becomes the frame for everything after, and you get first-draft-as-final-draft: the engineer treats the model's first draft as the design space instead of one point inside it.

The diagnostic question is "what else did you consider?" When the honest answer is nothing, the solution space was never explored. It was exited at the first exit. The model produced one continuation of the prompt and the engineer ratified it, and a ratification reads exactly like a decision until you ask what the alternatives were.

the conformance gap

The second is the conformance gap. The code is fine in isolation and wrong for your system. It doesn't match your conventions, your architecture, the implicit "how we do things here" that lives in people's heads and nowhere a model can read it. The model can't know your tacit standards. Enforcing them is the human's irreducible job, the part that doesn't transfer no matter how good the tool gets.

When the engineer doesn't enforce them, you get the classic shape: locally correct, globally wrong. Every individual PR is defensible. The codebase drifts anyway. This is where plausibility debt turns into the architecture nobody designed, the slop that accumulates one reasonable-looking change at a time. The per-decision failure is plausibility debt. The compounded result is a system nobody can explain.

Both gaps share a cause and a cure. The cause is that correctness ended the inquiry. The cure is a forcing function that restarts it, and the cheapest one is a question the author has to be able to answer: what would a senior engineer have done here, and why didn't you? What convention did you check it against? If they can't answer, they conduited it, and now is a cheaper time to find out than six months from now.

The line I'd put in front of an engineer: the model's job is to produce something that works; your job is to decide whether it's the right thing, and "it works" is not evidence that it is. That reframes correctness as the start of judgment rather than the end of it, which is exactly the step the failure mode skips.

The debt is quiet. It doesn't throw. It passes review, ships clean, and bills you later, in the debugging session where nobody understands the design and the answer to "why does it work this way" is that no one ever chose.

None of this stays purely individual. The conformance gap especially is an organizational problem with organizational tooling. The tacit standards a model can't read can be written down: a CLAUDE.md or a skill that encodes how we do things here, architecture decision records that capture not just what we chose but what we rejected and why, lint rules and architecture checks that enforce the invariants automatically so conformance stops depending on whether a reviewer remembered to look. These don't replace judgment. They capture judgment that already happened and hand it to the next agent, human or otherwise. The decision still had to be made once, by a person. The point is to not re-make it from scratch every session.

What I'm less sure about is the human side. The models keep improving, which moves the interesting question from "can it do this" to "where does a human still have to be in the loop." Right now senior engineers tend to be best at answering that, because guiding agents has quietly turned every engineer into something closer to an architect and a manager: setting direction, reviewing work, deciding what's worth doing at all. Seniors had years to build the judgment that role needs. The people I worry about are the ICs who haven't. The skills that used to develop through years of productive struggle are the same ones the tooling now skips, and I don't have a clean answer for how you grow someone into an architect's judgment without the years that historically produced it. That part is still an open question for me.

Part 15 of 15 — What I Think About AI Engineering**

← The Model Has No .plan