In recent months, something has shifted in software development. AI coding agents like Claude Code, GitHub Copilot, or Cursor produce code that looks professional at first glance. Sometimes even impressive. Anthropic has Claude write a complete C compiler. A developer has AI implement a DWARF debugger for OCaml. The results look polished.
But are they?
We Work with Probabilities
What you need to understand about large language models: They work with probabilities. Every generated token is a statistical prediction—the most likely next word based on the context. This works surprisingly well. Often even better than expected.
But “probably correct” and “correct” are two fundamentally different things.
Put simply: If a model achieves 99% accuracy on every single token (which would be remarkable), then for a function with 500 tokens, the probability that everything is correct is 0.99500 ≈ 0.7%. The mathematics of probability is unforgiving.
In practice, this means: The larger and more complex a project becomes, the more small inaccuracies accumulate. And that’s exactly where the interesting last 10% begins.
Case 1: Claude Builds a C Compiler
In February 2026, Anthropic introduced Claude’s C Compiler (CCC): a C compiler written in Rust, where 100% of the code comes from Claude Opus 4.6. A human wrote only the test cases. The result: frontend, SSA-based IR, optimizer, code generator, assembler, linker, and DWARF debug info. All built from scratch, without compiler-specific dependencies.
The headline: “Can compile the Linux kernel.”
The reality is more nuanced. An independent benchmark by Harshanu reveals the details:
- Correctness: CCC compiled all 2,844 C files of the Linux kernel without errors. Impressive.
- Linker issue: The build then failed with 40,784 undefined-reference errors. CCC generates incorrect relocations for
__jump_tableand__ksymtab. The compiler works. The linker does not. - Performance: CCC-compiled SQLite takes 2 hours for a benchmark that GCC completes in 10 seconds. With subqueries: 158,000x slower.
- Optimization: The flags
-O0through-O3produce byte-identical binaries. The optimization levels are purely cosmetic.
The cause is technically revealing: CCC lacks proper register allocation. Instead of holding variables in CPU registers, it pushes everything onto the stack, with offsets up to 11,000 bytes deep. Every operation becomes stack → rax → stack, with %rax serving as the sole shuttle register.
“The assembly output reminds me of the quality of an undergraduate’s compiler assignment.” — Comment on GitHub Issue #1, where “Hello World” failed to compile
The compiler is 90% complete. It parses C, it generates machine code, it understands the architecture. But the last 10% (efficient register allocation, correct relocations, working optimization passes) are the things GCC has been working on since 1987 with thousands of developers.
Case 2: AI writes a DWARF debugger for OCaml
An even more revealing case took place in the OCaml repository. A developer submitted a pull request: DWARF v5 debugging support for the OCaml Native Compiler. In the description: a cleanly structured feature with core DWARF support, platform support for Linux and macOS, an LLDB plugin, and a test suite.
Then came the community review.
First question: The copyright headers listed Mark Shinwell as the author, a Jane Street developer who was actually working on similar code in the oxcaml repository. Did the AI copy the code or hallucinate the attribution?
The submitter’s response:
“Claude Sonnet 4.5 (Claude Code) wrote most of it with ChatGPT 5 (Codex) reviewing and Claude addressing issues in each review. Codex wrote the last 10% or so when Claude kept getting stuck. I did not write a single line of code.”
The PR was closed. Not because the code didn’t work. But for more fundamental reasons:
- Copyright: The AI generated copyright headers for a real person. That is a legal issue.
- Maintainability: Thousands of lines of code that no one (not even the submitter) understands in detail cannot be maintained by a community.
- Process: There was no design discussion. No review by the people who will have to maintain the code in the long term.
The OCaml maintainer put it diplomatically: It was a case of “software development processes that are different to the point of being incompatible.” What he meant by that: Software development is more than just producing code.
Case 3: The Security Perspective
I have worked for years on systems where correctness was not a matter of discretion. When I look at the CCC compiler from this perspective, I see a pattern that worries me.
For someone responsible for security or infrastructure, “mostly correct” is not a quality attribute. It is a risk. A compiler that produces valid binaries in the majority of cases but generates faulty relocations under certain conditions is not reliable.
A linker that resolves “almost all” symbols correctly does not produce robust software. It produces runtime errors that are nearly impossible to reproduce. A DWARF generator that only works approximately in accordance with the specification renders the entire debugging process useless.
In security-relevant contexts, there is no gradation between secure and insecure. A door that locks only nine times out of ten does not fulfill its function. The same applies to compilers, cryptographic implementations, access control systems, or medical software. Where correctness is non-negotiable, “probably correct” is not enough. This is precisely where the probabilistic nature of current AI systems clashes with the demands we place on technology we trust.
What’s in the last 10%?
When you look at all these cases together, it becomes clear what the last 10% actually is:
Edge cases and specification fidelity. A C compiler must not only compile typical code but also correctly handle obscure corner cases of the specification. GCC has 40 years of bug reports to prove it.
Performance engineering. Functioning code and efficient code are different things. CCC demonstrates this dramatically: correct results, but 158,000x slower for certain operations.
Integration into existing systems. Software does not exist in isolation. The OCaml PR failed not because of the code itself, but because of integration into a community process, copyright issues, and maintainability.
The ability to explain one’s own work. If the submitter of a PR cannot explain the code because they did not write it, the deep understanding necessary for debugging, maintenance, and further development is missing.
Determinism. Safety-critical systems require deterministic, verifiable results. No statistical approximations.
Who does the last 10%?
That’s the paradoxical part.
The last 10% requires precisely the expertise that can only be built up through years of practice. Understanding register allocation means understanding CPU architectures. Writing correct relocations means understanding ELF binary formats. Making code reviewable means understanding software engineering as a social process.
These are skills that cannot be learned through prompting.
The irony: If we train a generation of developers who use AI to skip the first 90%, who will develop the expertise for the last 10%? This is not a rhetorical question. It is the central challenge facing our industry in the coming years. The problem isn’t just skipping steps, but that the “junior tasks” we used to learn from (the “grunt work”) are now disappearing. You don’t learn compiler design by watching a senior developer; you learn it by first writing a parser yourself. If AI does that, the training ground is gone.
“There will always be room for individuals to write software based on their unique knowledge. The craft of writing software will not become obsolete.” — Freeman Dyson, 1998
This quote was on the old version of this website when I launched it in 2011. It is more relevant today than ever.
What does this mean in practice?
I use AI tools every day. They’re excellent for boilerplate code, for exploring APIs, for initial drafts. Even for complex and large-scale software development. But I’ve stopped viewing them as a substitute for understanding.
AI is good for the initial draft: prototyping, exploration, building a framework. But everything that goes into production needs human review. Every line that is deployed must be understood. To do that, you need to know the fundamentals. If you don’t know how a compiler works, you can’t evaluate an AI-generated compiler. And it requires honesty about limitations. “90% done” is not “done.”
90% is a good start. But software craftsmanship is evident in the last 10%.