Sturb

When Role-playing, Do Models Believe What They Say?

TL;DR * When a model role-plays a persona, does it only change what it says, or also what it internally represents as true? * To study this, we induce personas in five ways: prompting, in-context learning (ICL), supervised fine-tuning (SFT), Open Character Training (OCT), and Emergent Misalignment (EM). We measure...

Jul 254

Revisiting GSM-Symbolic: models seem to reason okay, actually

TL;DR The GSM-Symbolic paper (ICLR 2025) purported to show that language models rely on pattern matching rather than genuine reasoning by demonstrating that perturbing the questions to make them break the pattern of the original question would catastrophically reduce performance in the model. Running the results again in March 2026...

Jun 524

Contributing to Technical Research in the AI Safety End Game

When the Claude Mythos system card was released, I initially felt like we had entered a late stage of AI safety, where the number of parties that can make a real impact shrinks down to the handful of labs at the frontier, a few companies too critical to exclude from...

May 234

Sanity-checking “Incompressible Knowledge Probes”

Or, did a chief scientist of an AI assistant startup conclusively show that GPT-5.5 has 9.7T parameters?[1] Introduction Recently, a paper was circulated on Twitter claiming to have reverse engineered the parameter count of many frontier closed-source models including the newer GPT-5.5 (9.7T parameters) and Claude Opus 4.6 (5.3T parameters)...

May 160

The Great Smoothing Out

I recently explored an interesting thought experiment about the Culture while talking with Max Harms. Specifically Max argued that the Culture is far from a perfect utopia, while I felt that it was way better than almost anything else I could think of. One of the key cruxes in the...

Apr 2617

How Does an Agent with Multiple Goals Choose a Target?

Code can be found here. Thanks to Paul Colognese and Narmeen Oozeer, and Mickey Beurskens for collaboration on this work, and with Jonathan Shock for so much support and feedback. TLDR: We investigated how a maze-solving RL agent (not a transformer model) internally represents and switches between multiple sequential goals....

Apr 814

The Garden

The mandate had been given by God that morning, and Nathanael and Amon had arrived in the nascent realm to shape His will. The task was simple: creation, in all its perfection. As divine as this instruction was, it did demand a great deal of creativity on the part of...

Apr 64

Sturb

Sturb

Sanity-checking “Incompressible Knowledge Probes”

When Role-playing, Do Models Believe What They Say?

Vipassana Meditation and Active Inference: A Framework for Understanding Suffering and its Cessation

Whole Brain Emulation as an Anchor for AI Welfare

Sturb

Sanity-checking “Incompressible Knowledge Probes”

When Role-playing, Do Models Believe What They Say?

Vipassana Meditation and Active Inference: A Framework for Understanding Suffering and its Cessation

Whole Brain Emulation as an Anchor for AI Welfare

When Role-playing, Do Models Believe What They Say?

Revisiting GSM-Symbolic: models seem to reason okay, actually

Contributing to Technical Research in the AI Safety End Game

Sanity-checking “Incompressible Knowledge Probes”

The Great Smoothing Out

How Does an Agent with Multiple Goals Choose a Target?

The Garden