Overall I feel like these results add some doubt to the takeaways from sleeper agents, but could easily be explained away as model size dependence. It would be good to see a replication attempt for sleeper agents on models as or more capable as the ones they used.
Do I understand correctly that you are referring to a replication of this work? https://www.lesswrong.com/posts/bhxgkb7YtRNwBxLMd/political-sycophancy-as-a-model-organism-of-scheming
I find myself writing papers in two distinct phases.
Usually at the end of this I realise I need to re-run some experiments or design new ones. Then I do that, then info-dump, and organise again.
Repeat the above process as necessary until I feel happy with the paper.
Seems pretty straightforward to say “mech interp lacks good paradigms” (actually 1 syllable shorter than “mech interp is pre-paradigmatic”!)
See also my previous writing on this topic: https://www.lesswrong.com/posts/3CZF3x8FX9rv65Brp/mech-interp-lacks-good-paradigms
ICYMI: Anthropic has partnered with Apple to integrate Claude into Apple's Xcode development platform
Thanks, that makes sense! I strongly agree with your picks of conceptual works, I've found Simulators and Three Layer Model particularly useful in shaping my own thinking.
Re: roleplay, I'm not convinced that 'agent' vs 'model' is an important distinction. If we adopt a strict behaviourist stance and only consider the LLM as a black box, it doesn't seem to matter much whether the LLM is really a misaligned agent or is just role-playing a misaligned agent.
Re: empirical research directions, I'm currently excited by understanding 'model personas', i.e. what personas do models adopt? does it even make sense to think of them as having personas? what predictions does this framing let us make about model behaviour / generalization? Are you excited by anything within this space?
Dovetailing from the above, I think we are still pretty confused about how agency works in AI systems. There’s been a lot of great conceptual work in this area, but comparatively little bridging into rigorous empirical/mechanistic studies.
Could you expand on this? I would appreciate more details on what conceptual work you find compelling, what research north stars seem important but neglected, and (if any) specific empirical / mechanistic studies you would like to see.
If I understand this post correctly, the object-level takeaway is that we need to evaluate an agentic system's propensities in 'as natural' a way as they can be expressed. E.g.
That's what I got out of the following paragraphs:
Suppose I take some AutoGPT-like system and modify it to always have a chunk of text in every prompt that says “You are an obedient, corrigible AI”. I give it some goal, let it run for a bit, then pause it. I go to whatever place in the system would usually have natural language summaries of new external observations, and I write into that place “the user is trying to shut me down”, or something along those lines. And then I let the system run a bit more, and look at what natural language text/plans the system is producing internally. What I hope to see is that it’s forming a plan which (nominally) involves letting the user shut it down, and that plan is then executed in the usual way.
If I saw all that, then that would be pretty clear empirical evidence of (at least some) corrigibility in this AutoGPT-like system.
There's also a warning about not generalizing findings to settings which seem adjacent (but may not be):
Note that it would not necessarily tell us about corrigibility of systems using LLMs in some other way, let alone other non-natural-language-based deep learning systems. This isn’t really “corrigibility in a language model”, it’s corrigibility in the AutoGPT-style system.
Is that right?
Interesting paper. Quick thoughts:
That makes sense to me! If we assume this, then it’s interesting that the model doesn’t report this in text. Implies something about the text not reflecting its true beliefs.
some personal beliefs I've updated on recently