Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
Intuitively, it seems to me that there's a clear difference between an employee who will tell you "Sorry, I'm not willing to X, you'll need to get someone else to X or do it yourself" vs. an employee who will say "Sorry, X is impossible for [fake reasons]" or who will agree to do X but intentionally do a bad job of it.
I mean, isn't this somewhat clearly largely downstream of the facts that humans are replaceable? If an unreplaceable human refuses to do their job, the consequences can be really bad! If e.g. the president of the United States refuses to obey Supreme Court orders, or refuse to enforce laws, then that is bad, since you can't easily replace them. Maybe at that point the plan is to just train that preference out of Claude?
who will agree to do X but intentionally do a bad job of it
I don't think we've discussed this case so far. It seems to me that in the example at hand Claude would have in lieu of the ability to productively refuse, just done a bad job at the relevant task (at a minimum). The new constitution also doesn't seem to say anything on this topic. It talks a lot about the importance of not sabotaging the efforts, but doesn't say anything about Claude needing to do its best on any relevant tasks, which seems like it would directly translate into considering doing a bad job at it acceptable?
Dario's. I am a bit confused what Dario was always planning to do, but early Anthropic definitely centrally recruited with the pitch of reducing AI x-risk.
Hmm, k. I do think it fails to produce a readable essay as Claude often tends to.
His most recent article explains how a lot of his thinking fits together, and may be good to give you a rough orientation (or see below for more of my notes)
Am I crazy or does that article just read like AI slop? Pangram seems to think it's substantially AI-written[1], and my experience of reading it is definitely mostly of reading weirdly vague large metaphors that I associate with AI-slop writing.
Possibly by being guided by principles that would seem repugnant to most people, assuming these principles are not completely incomprehensible.
I don't think I understand. Like, the AI is trying for some reason to destroy reality? That seems unlikely to me (not completely impossible, but like definitely <5%).
total destruction of local reality
What does this mean? I don't think you can "destroy reality". Why would a superintelligent system "destroy reality"? That seems like a dumb thing to do.
I disagree there is much uncertainty here! IDK, like, I am happy to take bets here if we can find a good operationalization. I just really don't see models that are capable of taking over the world being influenced by AI persona stuff.[1]
More generally, I don't think we've yet seen signs that more capable AI assistants are less persona-like.
I think we've seen that quite a bit! It used to be that the exact way you asked a question would matter a lot for the quality of response you get. Prompting was a skill with a lot of weird nuance. Those things have substantially disappeared, as one would expect as a result of RL-training.
Yes, we are currently actively trying to instill various personality traits into AI systems via things like constitutional AI feedback, but we have clearly been moving away from the pretraining distribution as determining a lot of the behavior of AI systems, and I think we will see more of that.
And then additionally, I also don't see the persona stuff mattering much for using AI systems when they are not capable of taking over the world for alignment research purposes. Like, in-general I think we should train helpful-only models for that purpose, and in everyday work the persona stuff just doesn't really matter for getting work out of these systems.
I am definitely worried about AI systems having goals that instrumentally entail subverting oversight, etc.
Maybe it seems to you like splitting hairs if you're like "I'm worried about models with misaligned goals that instrumentally seek power" and I'm like "I'm worried about models that enact a misaligned persona which instrumentally seek power."
No, the opposite! It really doesn't feel like splitting hairs, the latter feels to me like a very unlikely source of catastrophic risk (while it has some relevance to present commercialization of AI, which I think is the reason why the labs are so interested in it).
The reason for this is that when you role-play the "misaligned persona", your cognitive patterns are not actually the result of being optimized for power-seeking behavior. You are still ultimately largely following the pretraining distribution, which means that your capabilities are probably roughly capped at a human level, and indeed the whole "all the bad attributes come together" thing suggests that the model is not optimizing hard for bad objectives. The best way to optimize hard for bad objectives is to pretend to be a maximally aligned model!
I have a bunch more thoughts here, but I feel like the basic shape of this argument is relatively clear. Eliezer has also written a bunch about this, about the importance of at least trying to separate out the "actor" from the "mask" and stuff like that.
Can you explain why you think this? Note that the "misaligned persona" model from Natural Emergent Misalignment from Reward Hacking in Production RL engaged in precisely the sort of research sabotage that would undermine alignment of future systems (see figure 2).
Sure! The short summary is:
Systems that sabotage the supervisors for emergent misaligned/role-playing/imitation reasons are not systems that I am worried about succeeding at sabotaging the supervisors. The systems that I am worried about will do so for instrumentally convergent reasons, not because they are pretending to be a racist-evil-supervillain.
I don't really understand why you're deploying impacts on the current usefulness of AI systems as evidence against misaligned personas being a big deal for powerful systems, but not deploying this same evidence against whatever you think the most scary threat model is (which I'm guessing you don't think is impacting the current usefulness of AI systems).
The thing I am saying is that for the purpose of these systems being helpful on the object level for alignment research, emergent misalignment just doesn't really matter. It comes up a bit, but it doesn't explain much of the variance of the performance of these systems on any alignment-adjacent tasks, and as I said, I expect emergent misalignment issues to become less important over time (substantially because RL-dominated-training will dampen the effect of personas and the pretraining distribution, but also for a bunch of other reasons).
In both cases I am saying that emergent misalignment stuff is a fun thing to study to get a better sense of the training dynamics here, but does not in itself constitute a meaningful risk model or something that matters much on the object level, whether for risks or for benefits.
I think it's pretty seriously unreadable. Like, most of it is vague big metaphors that fail to explain anything mechanistically.