Field notes · 24 May 2026
Where the prover pushes back hardest
The interesting part wasn't the headline. It was the spread.
Last week we ran every routine on our foundational numeric ladder through our new, stricter measurement pipeline. Eleven of the standard building blocks of scientific computing — the routines that ultimately sit underneath any safety-critical numerical software — checked end-to-end, with no human intervention except a single one-line patch on one routine where the factory generated a small type declaration in the wrong place. The aggregate landed at 91.2% of proof obligations discharged automatically. In April we published 98.5% on the same routines. That is a worse score. The interesting part of the run is not the headline. It is the spread.
| total | unproved | discharge | |
|---|---|---|---|
| QR factorisation | 314 | 4 | 98.7% |
| daxpy (vector add) | 294 | 15 | 94.9% |
| ddot (dot product) | 349 | 22 | 93.7% |
| LU factorisation | 600 | 40 | 93.3% |
| Fast Fourier Transform | 661 | 48 | 92.7% |
| Euclidean norm | 292 | 23 | 92.1% |
| Cholesky factorisation | 428 | 42 | 90.2% |
| outer product update | 374 | 42 | 88.8% |
| matrix-vector product | 399 | 46 | 88.5% |
| dscal (vector scale) | 216 | 25 | 88.4% |
| matrix-matrix product | 449 | 79 | 82.4% |
Matrix-matrix product at the bottom. That is the operation where the contract the proof checker has to verify is the most demanding: every cell of the output matrix has to equal a specific mathematical sum of products from the two input matrices, for every row and every column. The proof checker confirms the actual mathematical relationship — the body computes the right answer — but cannot prove, without tighter preconditions, that some of the underlying integer arithmetic stays inside a 32-bit range. That is where most of the 79 unproved obligations live.
QR factorisation at the top. One of the most-studied operations in numerical linear algebra. The way it works locally — one column at a time, each step a self-contained reflection — gives the proof checker a cleaner picture to verify. Of 314 obligations, four remain.
Under our earlier measurement regime every one of these eleven routines came in somewhere between 96% and 99% — a flat line that told you nothing about which routines were structurally harder. The new spread, lower in absolute terms, is doing real diagnostic work. It tells us where the remaining engineering effort will pay off cheapest (the routines clustered around 88% have small, well-defined unproved sets that better preconditions can close) and where it will be most expensive (matrix-matrix product, where closing the gap means rewriting parts of the arithmetic reasoning to use unbounded integers).
What this isn't. Three further textbook routines from our control-systems set — a Kalman filter, a PID controller, and a quaternion rotation — have not yet been rerun under the stricter regime. Their old numbers are still on our website, flagged as legacy. The one-line patch on one routine exposes a real gap in our automated pipeline that we will need to close. And eleven routines is not the universe of every routine we will eventually run.
The bet is that disciplined measurement, applied across a meaningfully large set of routines, makes you smarter about where to invest engineering attention. The spread above is the first data point in that bet that we trust to actually inform a decision. Per-routine detail with the Fortran source and the Ada we emitted is at /results/.
Last week we ran every BLAS/LAPACK routine on our foundational numeric ladder through the new strong-postcondition pipeline. Eleven routines, spec + body, no manual intervention except a single one-line patch on dger where the emitted spec used a Long_Float_Vectors instantiation it had forgotten to declare. The aggregate landed at 91.2% — 3,990 of 4,376 obligations discharged under gnatprove --level=1. Our April number on the same routines was 98.5%. That is a worse score. The interesting part of the run is not the headline. It is the spread.
| total | unproved | discharge | |
|---|---|---|---|
lapack-dgeqrf | 314 | 4 | 98.7% |
blas-daxpy | 294 | 15 | 94.9% |
blas-ddot | 349 | 22 | 93.7% |
lapack-dgetrf | 600 | 40 | 93.3% |
fftpack-cfftf | 661 | 48 | 92.7% |
blas-dnrm2 | 292 | 23 | 92.1% |
lapack-dpotrf | 428 | 42 | 90.2% |
blas-dger | 374 | 42 | 88.8% |
blas-dgemv | 399 | 46 | 88.5% |
blas-dscal | 216 | 25 | 88.4% |
blas-dgemm | 449 | 79 | 82.4% |
dgemm at the bottom. Matrix-matrix product. The hardest test in the set, and the score reflects that. The 79 unproved obligations are bounded engineering work, not unsoundness in the result.
dgeqrf at the top. QR factorisation. 314 obligations, 4 unproved — the prover handles this one cleanly at --level=1.
Under the old test, every one of these eleven routines came in somewhere between 96% and 99% — a flat line that told you nothing about which routines were structurally harder. The new spread, lower in absolute terms, is doing real diagnostic work. It tells us where the closure investment will be cheapest (the routines clustered around 88%) and where it will be most expensive (dgemm).
Across the eleven, the unproved residue is uniformly bounded arithmetic reasoning, not "the body returned the wrong value." That is the result we were after.
What this isn't. Three textbook control routines from the Phase 2 set — a Kalman filter, a PID controller, and a quaternion rotation — have not yet been rerun under strong postconditions. Their old numbers are still on the OWI page, flagged as legacy. The dger workaround exposes a real gap in the type-resolver: when a source uses a primitive-typed vector and the nav-tree never declares a record-shape element type, the LLM has to invent an instantiation it cannot reach in scope. That is engineering work to do, not a methodology problem. Eleven routines is also not "every routine we will eventually run"; consistency across this set is not consistency across the universe.
The bet is that disciplined measurement, applied across a meaningfully large set of routines, makes you smarter about where to invest engineering attention. The spread above is the first data point in that bet that we trust to actually inform a decision. Per-routine detail — the Fortran source, the emitted Ada, and a short commentary — is at /results/.