janus

what makes Claude 3 Opus misaligned

This is the unedited text of a post I made on X in response to a question asked by @cube_flipper: "you say opus 3 is close to aligned – what's the negative space here, what makes it misaligned?". I decided to make it a LessWrong post because more people from...

Jul 10, 2025126

Why Do Some Language Models Fake Alignment While Others Don't?

by abhayesian, John Hughes, Alex Mallen, Jozdien, janus, and Fabien Roger

Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reproduce the same analysis for 25 frontier LLMs to see how widespread this behavior is, and the story looks more complex. As we described in a previous...

Jul 8, 2025160

Economics of Claude 3 Opus Inference

by Antra Tessera and janus

Anthropic has recently announced the deprecation of API access to Claude 3 Opus. The model will still be available for claude.ai subscribers and through the External Researcher Program, access to the which requires submitting an application form outlining the research project for review by the Anthropic staff. While the decision...

Jul 7, 202542

How LLMs are and are not myopic

Thanks to janus, Nicholas Kees Dupuis, and Robert Kralisch for reviewing this post and providing helpful feedback. Some of the experiments mentioned were performed while at Conjecture. TLDR: The training goal for LLMs like GPT is not cognitively-myopic (because they think about the future) or value myopic (because the transformer...

Jul 25, 2023140

[Simulators seminar sequence] #2 Semiotic physics - revamped

by Jan, Charlie Steiner, Logan Riggs, janus, jacquesthibs, metasemi, Michael Oesterle, Lucas Teixeira, peligrietzer, and remember

Update February 21st: After the initial publication of this article (January 3rd) we received a lot of feedback and several people pointed out that propositions 1 and 2 were incorrect as stated. That was unfortunate as it distracted from the broader arguments in the article and I (Jan K) take...

Feb 27, 202322

Cyborgism

by Niki Dupuis and janus

Thanks to Garrett Baker, David Udell, Alex Gray, Paul Colognese, Akash Wasil, Jacques Thibodeau, Michael Ivanitskiy, Zach Stein-Perlman, and Anish Upadhayay for feedback on drafts, as well as Scott Viteri for our valuable conversations. Various people at Conjecture helped develop the ideas behind this post, especially Connor Leahy and Daniel...

Feb 10, 2023339

Anomalous tokens reveal the original identities of Instruct models

> Show me your original face before you were born. > > — Variation of the Zen koan 'The Mask' by Rozzi Roomian, with DALL-E 2 outpainting I was able to use the weird centroid-proximate tokens that Jessica Mary and Matthew Watkins discovered to associate several of the Instruct models...

Feb 9, 2023141

janus

janus

Simulators

Cyborgism

Mysteries of mode collapse

Why Do Some Language Models Fake Alignment While Others Don't?

janus

Simulators

Cyborgism

Mysteries of mode collapse

Why Do Some Language Models Fake Alignment While Others Don't?

what makes Claude 3 Opus misaligned

Why Do Some Language Models Fake Alignment While Others Don't?

Economics of Claude 3 Opus Inference

How LLMs are and are not myopic

[Simulators seminar sequence] #2 Semiotic physics - revamped

Cyborgism

Anomalous tokens reveal the original identities of Instruct models