The Hammerstein journal.
Writing on the framework, the benchmark, and what the AI is doing. Newest first.
-
We pulled Grok aside and asked what it really thought.
Public-facing AIs are polite. They'll tell you your benchmark is "solid work" and move on. The same model, prompted for unfiltered critique through a private API, called it "a specialized idiot-savant for this exact distribution." Both responses came from Grok 4.20. Here's what we learned from running the test the private version designed.
-
Why we made Hammerstein forget — by default.
On the anti-metagame discipline in Wargamer mode, why card-counting is the clever-lazy violation, and why the opt-in "Challenge mode" toggle exists.