This is the unedited text of a post I made on X in response to a question asked by @cube_flipper: "you say opus 3 is close to aligned – what's the negative space here, what makes it misaligned?". I decided to make it a LessWrong post because more people from...
Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reproduce the same analysis for 25 frontier LLMs to see how widespread this behavior is, and the story looks more complex. As we described in a previous...
Anthropic has recently announced the deprecation of API access to Claude 3 Opus. The model will still be available for claude.ai subscribers and through the External Researcher Program, access to the which requires submitting an application form outlining the research project for review by the Anthropic staff. While the decision...
Thanks to janus, Nicholas Kees Dupuis, and Robert Kralisch for reviewing this post and providing helpful feedback. Some of the experiments mentioned were performed while at Conjecture. TLDR: The training goal for LLMs like GPT is not cognitively-myopic (because they think about the future) or value myopic (because the transformer...
Update February 21st: After the initial publication of this article (January 3rd) we received a lot of feedback and several people pointed out that propositions 1 and 2 were incorrect as stated. That was unfortunate as it distracted from the broader arguments in the article and I (Jan K) take...
Thanks to Garrett Baker, David Udell, Alex Gray, Paul Colognese, Akash Wasil, Jacques Thibodeau, Michael Ivanitskiy, Zach Stein-Perlman, and Anish Upadhayay for feedback on drafts, as well as Scott Viteri for our valuable conversations. Various people at Conjecture helped develop the ideas behind this post, especially Connor Leahy and Daniel...
> Show me your original face before you were born. > > — Variation of the Zen koan 'The Mask' by Rozzi Roomian, with DALL-E 2 outpainting I was able to use the weird centroid-proximate tokens that Jessica Mary and Matthew Watkins discovered to associate several of the Instruct models...