Field notes · 29 May 2026
A new model arrived this week. Our promise didn’t change.
Every time a new frontier model lands, we get the same question: does it change what you do?
So we checked, the only way we trust: we set the newest, strongest general-purpose models a real task — write a working piece of software — and then put what they produced to a mathematical proof of correctness. Not a test suite. A proof: the same bar every piece of software we ship has to clear.
Here is the honest result.
The models are good. Genuinely good. More often than not, the code they wrote was correct, and the proof went through. If you only wanted “probably fine”, you’d have been well served.
But “more often than not” is the whole problem.
Ask the same model the same question twice and you get two different programs — and one of them can be subtly, quietly wrong while the other is right. You cannot tell which you got by reading it; both look reasonable. Worse: we watched code that passed on one machine fail on another, unchanged. “It ran fine for me” turned out not to mean “it is correct”.
Predicting is not proving
This is not a knock on the models. It is what they are. A model predicts plausible code. It does not prove correct code, because it doesn’t prove anything — it produces the most likely-looking answer. The gap between code that looks right and code that is right is exactly where expensive mistakes live, and that gap does not close when the model gets cleverer. A better predictor is still a predictor.
It closes only one way: by proving the result.
What a new model doesn’t change
That is what we do, and a new model release doesn’t change it. A stronger model is welcome — it makes good software cheaper to make. It does not move the line that matters here: proven correct, or not shipped. When our software says a thing is done, it is provably done — not “passed the tests we happened to write”, not “worked on the machine we happened to use”. Proven.
New models will keep arriving, each a little better than the last. We’ll happily use the best of them. And every one will hit the same wall on the way out of the door — the proof — which is the only part we were ever really selling.