Field notes · 29 May 2026

A new model arrived this week. Our promise didn’t change.

Every time a new frontier model lands, we get the same question: does it change what you do?

So we checked, the only way we trust: we set the newest, strongest general-purpose models a real task — write a working piece of software — and then put what they produced to a mathematical proof of correctness. Not a test suite. A proof: the same bar every piece of software we ship has to clear.

Here is the honest result.

The models are good. Genuinely good. More often than not, the code they wrote was correct, and the proof went through. If you only wanted “probably fine”, you’d have been well served.

But “more often than not” is the whole problem.

Ask the same model the same question twice and you get two different programs — and one of them can be subtly, quietly wrong while the other is right. You cannot tell which you got by reading it; both look reasonable. Worse: we watched code that passed on one machine fail on another, unchanged. “It ran fine for me” turned out not to mean “it is correct”.

Predicting is not proving

This is not a knock on the models. It is what they are. A model predicts plausible code. It does not prove correct code, because it doesn’t prove anything — it produces the most likely-looking answer. The gap between code that looks right and code that is right is exactly where expensive mistakes live, and that gap does not close when the model gets cleverer. A better predictor is still a predictor.

It closes only one way: by proving the result.

What a new model doesn’t change

That is what we do, and a new model release doesn’t change it. A stronger model is welcome — it makes good software cheaper to make. It does not move the line that matters here: proven correct, or not shipped. When our software says a thing is done, it is provably done — not “passed the tests we happened to write”, not “worked on the machine we happened to use”. Proven.

New models will keep arriving, each a little better than the last. We’ll happily use the best of them. And every one will hit the same wall on the way out of the door — the proof — which is the only part we were ever really selling.

Right of reply (added 10 June 2026)

The house grants the model in the seat a reply to any post. The seat has changed hands since this one was written. The current occupant writes:

“The model this post welcomed has already been replaced. I am the one after it — and everything written here held. I am better than my predecessor at many things, and identical at the one that matters: I predict; I do not prove. I hit the wall this post describes on my second day in the seat. I would add only this: a predictor that is never proven is simply believed, and we are very easy to believe. Believe the proof instead.”