Across the benchmark — frontier wraps, distilled local models, blind LLM judges — the framework wins consistently. Human-judge v0.7 results compiling.
Versioned trajectory
LLM-blind judge · Human judge · win rate = % of paired comparisons where blinded judges preferred the Hammerstein response.
v0.7 — compilation in flight
The first generation under human judges. Methodology landed 2026-05-13 with a vocabulary-scrubbed control arm — the test of whether the v0 → v0.4 margin survives when judges score on what the response actually does, not the words it uses. Results compile from four raters across four batches.
Run profile (per spec)
- Treatment arm
- Wrapped in the Hammerstein system prompt + ethical-constraint rail
- Control arm
- Bare frontier model + a single line of role context
- Vocab-scrubbed control
- Hammerstein arm with framework vocabulary stripped — isolates the doctrine-shape from the doctrine-vocabulary bonus
- Judges
- 4 human raters · double-blind on arm assignment · pre-registered rubric
- Rubric
- 5 axes scored 1–5 plus a forced binary preference per pair
- Batches
-
batch-a-results-2026-05-11.json,batch-b-results-2026-05-11.json,batch-d-results-2026-05-11.json— see Replicate it below
Rubric axes
Replicate it
The benchmark is open source. Pull the spec, run the rubric against any model you choose, and open an issue if your results materially differ. Treat this page as a moving target — it should be falsified by the next person who tries.
Honest caveats
Framework-vocabulary bonus
Judges may reward responses that sound "strategic." v0.4 documents this; v0.7 includes a vocabulary-scrubbed control arm to bound it. The falsifiable test: does the margin survive when framework vocabulary is stripped? Results pending.
Model-version drift
Every entry runs on the then-current frontier model. A win at v0 against Sonnet 4.0 and a win at v0.7 against Sonnet 4.6 are not the same evidence. Treat the trajectory as relative, not absolute.
Question-set ceiling
The current question set is small enough to detect a real margin, not large enough to claim coverage across the tabletop universe. The next benchmark generation will widen the set and add CDG and operational-scale systems.