Public benchmark Open methodology v0 → v0.7 · methodology shipped 2026-05-13

Across the benchmark — frontier wraps, distilled local models, blind LLM judges — the framework wins consistently. Human-judge v0.7 results compiling.

Best win rate

100%

v0.1 OOD · 48/48 LLM-blind ratings

Latest LLM-blind

75%

v0.4 cross-scale · 7B-vs-Sonnet · 18/24

v0.7 status

Compiling

4 raters · double-blind · vocab-scrubbed control

Models tested against

Opus 4.7 · Sonnet 4.6 · GPT-5

§ I

Versioned trajectory

methodology evolves; the win signal holds

Version

Judge

Win rate

Setup

LLM-blind

n=54

98%

frontier panel

First public bake-off. 6 strategic-reasoning questions × 3 frontier families (Opus 4.7 / Sonnet 4.6 / GPT-5), Hammerstein-on-frontier vs raw frontier. 53 of 54 ratings preferred Hammerstein.

v0.1

LLM-blind

n=48

100%

OOD generic

4 out-of-domain strategic-reasoning questions, same frontier panel. Unanimous across judges and model families. Also: an ablation on Sonnet found the RAG corpus decorative at frontier scale; system prompt alone matches the full wrap.

v0.4

LLM-blind

n=24

75%

cross-scale

Hammerstein-7B (the framework distilled into Qwen2.5-7B local weights, no system prompt) vs raw Claude Sonnet 4.6. 18 of 24 ratings preferred the 7B local model. 4 of 6 questions unanimous across all 4 blind judges. Pair-1 control (7B-vs-raw-7B): 24/24.

v0.7

Human

n=32 *

—

spec → results

First human-judged benchmark. Methodology shipped 2026-05-13; results compiling from 4 raters, double-blind, with a vocabulary-scrubbed control arm.

LLM-blind judge · Human judge · win rate = % of paired comparisons where blinded judges preferred the Hammerstein response.

§ II

v0.7 — compilation in flight

spec 2026-05-13 · results pending · read the spec ↗

The first generation under human judges. Methodology landed 2026-05-13 with a vocabulary-scrubbed control arm — the test of whether the v0 → v0.4 margin survives when judges score on what the response actually does, not the words it uses. Results compile from four raters across four batches.

Run profile (per spec)

Treatment arm: Wrapped in the Hammerstein system prompt + ethical-constraint rail
Control arm: Bare frontier model + a single line of role context
Vocab-scrubbed control: Hammerstein arm with framework vocabulary stripped — isolates the doctrine-shape from the doctrine-vocabulary bonus
Judges: 4 human raters · double-blind on arm assignment · pre-registered rubric
Rubric: 5 axes scored 1–5 plus a forced binary preference per pair
Batches: batch-a-results-2026-05-11.json, batch-b-results-2026-05-11.json, batch-d-results-2026-05-11.json — see Replicate it below

Rubric axes

Tactical soundness Strategic framing Rules accuracy Anti-metagame discipline Actionability of orders

§ III

Replicate it

open-source · MIT · contributions welcome

The benchmark is open source. Pull the spec, run the rubric against any model you choose, and open an issue if your results materially differ. Treat this page as a moving target — it should be falsified by the next person who tries.

eval/BENCHMARK-human-judge-v1.1-2026-05-13.md↗ eval/human_judge_runner.py↗ eval/batch-a-results-2026-05-11.json↗ eval/batch-b-results-2026-05-11.json↗ eval/batch-d-results-2026-05-11.json↗ eval/RESULTS-v0.4.md↗ eval/RESULTS-v0.1.md↗

§ IV

Honest caveats

what would change our mind

Caveat 01

Framework-vocabulary bonus

Judges may reward responses that sound "strategic." v0.4 documents this; v0.7 includes a vocabulary-scrubbed control arm to bound it. The falsifiable test: does the margin survive when framework vocabulary is stripped? Results pending.

Caveat 02

Model-version drift

Every entry runs on the then-current frontier model. A win at v0 against Sonnet 4.0 and a win at v0.7 against Sonnet 4.6 are not the same evidence. Treat the trajectory as relative, not absolute.

Caveat 03

Question-set ceiling

The current question set is small enough to detect a real margin, not large enough to claim coverage across the tabletop universe. The next benchmark generation will widen the set and add CDG and operational-scale systems.