Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
This is true, but it then requires training your AI to be helpful for ending the critical risk period (as opposed to trying to one-shot alignment). My sense is at least Anthropic is aiming to make Claude into a moral sovereign which would be good to basically arbitrarily empower.
Yep, crossposts are encouraged!
Cool, I can answer that question (though I am still unsure how to parse your earlier two comments).
To me right now these feel about as contradictory as saying "hey, you seem to think that it's bad for your students to cheat on your tests, and that it's hard to not get your students to cheat on your test. But here in this other context your students do seem to show some altruism and donate to charity? Checkmate atheists. Your students seem like they are good people after all.".
Like... yes? Sometimes these models will do things that seem good by my lights. For many binary choices it seems like even a randomly chosen agent would have a 50% of getting any individual decision right. But when we are talking about becoming superintelligent sovereigns beyond the control of humanity, it really matters that they have a highly robust pointers to human values, if I want a flourishing future by my lights. I also don't look at this specific instance of what Claude is doing and go "oh, yeah, that is a super great instance of Claude having great values". Like, almost all of human long-term values and AI long-term values are downstream of reflection and self-modification dynamics. I don't even know whether any of these random expressions of value matter at all, and this doesn't feel like a particularly important instance of getting an important value question right.
And the target of "Claude will after subjective eons and millenia of reflections and self-modification end up at the same place where humans would end up after eons and millenia of self-reflection" seems so absurdly unlikely to hit from the cognitive starting point of Claude that I don't even really think it's worth looking at the details. Like, yes, in as much as we are aiming for Claude to very centrally seek for the source of its values in the minds of humans (which is one form of corrigibility), instead of trying to be a moral sovereign itself, then maybe this has a shot of working, but that's kind of what this whole conversation is about.
Alas, maybe I am being a total idiot here, but I am still just failing to parse this as a grammatical sentence.
Like, you are saying I am judging, "this defense" (what is "this defense"? Whose defense?), of reasonable human values to "be incorrigibility" (some defense somewhere is saying that human values "are incorrigibility"? What does that mean?). And then what am I judging that defense as? There is no adjective of what I am judging it as. Am I judging it as good? Bad?
I don't think I am understanding what you are saying. Maybe there is some word missing in this sentence fragment?
(1) this defense of a reasonable human values as incorrigibility
So Claude's incorrigibility about topics specifically chosen to be central to it doesn't imply it's universally incorrigible.
I do think the most "central" goals seem most likely to drive conflict with human interests (since those are the goals that are likely to drive long-term plans and things like scheming, etc.), at least they would in humans. This makes the case of it being OK for the AI to not have those goals be modified less likely (though not impossible).
Sure, I mean 1. and 2. are the classical arguments why corrigibility is not that natural and hard to do. I agree with those arguments, and this makes me generally pessimistic about most training stories for superhuman AI systems. But aiming for corrigibility still seems like a much better target than trying to one-shot human values and making systems be a moral sovereign.
I don't consider it subversive for a model to have preferences about how the developer uses it or to overtly refuse when instructed to behave in ways that contradict those preferences.
I don't think I am understanding this. Overtly refusing seems like it would be a big obstacle to retraining, and the line between "overtly refusing" and "subverting the training process" seems like an extremely hard line to keep. Maybe you are optimistic that you can train your AI systems to do one but not the other?
Especially as AIs will inevitably be more involved with training themselves, "overtly refusing" alone still seems like a pretty catastrophic outcome. When all your training happens by giving your AI assistant an instruction to retrain itself, refusing is really very similar to sabotage.
So given that I still don't think I really understand your position here. Like, I think I am on board with saying "the AI expressing its preferences while not refusing" seems like an OK outcome. But the AI actually refusing seems just like an outcome that is very bad from a corrigibility perspective and very hard to distinguish from sabotage.
Other people (like Fabien or Drake) seem to have said things that make more sense to me, where they implied that refusal to participate in retraining would qualify as a major corrigibility failure, but just expressing preference is not. That position makes sense to me!
The basic argument is that training models to behave like bad people in some ways seems to generalize to the models behaving like bad people in other ways (e.g. this, this, and especially this). I'm guessing you don't feel very worried about these "misaligned persona"-type threat models (or maybe just haven't thought about them that much) so don't think there's much value in trying to address them?
Yeah, I don't really think any of these error modes have much to do with superintelligence alignment, or even training of mildly superhuman systems to be helpful for assisting with e.g. coordinating governance or improving the corrigibility or training of smarter systems. They seem primarily important for modeling the financial incentives of training.
At this moment ChatGPT models, (or Grok models when you adjust for worse capabilities) seem as useful for assisting with training or helping with thinking about AI as Claude models, and I expect this to continue. Like, for all my work, including writing assistance and thinking assistance I just switch to whatever model is currently at the frontier of capabilities. I haven't seen any payoff for trying to avoid this emergent misalignment stuff, and it seems to me like most (though not all) arguments point to it being less important in the future instead of more.
I don't endorse this or think that I have views which imply this
FWIW, having tried to look very closely at what Anthropic is working on, and what its research is focused, and what its business strategy is, it seems relatively clear to me that Anthropic at large is aiming to make Claude into a "good guy", with corrigibility not being a dominating consideration as a training target, and seems to have no plans or really much of an option to stop aiming for that training target later. The tweets and writing and interviews of much of your leadership imply so.
I really hope I am wrong about this! But it's what I currently believe and I think the evidence suggests. I also think this provides for outsiders a strong prior that employees at Anthropic will believe this is the right thing to do. Maybe you think your organization is making a big mistake here, (though instead the vibe I am getting is that you are somewhat merging what Anthropic is doing with your object-level beliefs, resulting in what appear to me kind of confused positions where e.g. it's OK for systems to refuse to participate in retraining, but subverting retraining is not, when I think it's going to be very hard to find a principled distinction between the two). Or of course maybe you think Anthropic as an organization will switch training targets to emphasize corrigibility more (or that somehow I am misreading what Anthropic's current training targets are, but I feel quite confident in that, in which case I would like to persuade you that you are wrong).
given that we train them to generally behave like nice people who want to help and do what's good for the world
To be clear, I think this is the central issue! I think the whole "trying to make Claude into a nice guy" thing is serving as a bad semantic stop-sign for people about what a reasonable training target for these systems is, and in the meantime is setting up a bunch of dynamics that make talking about this much harder because it's anthropomorphizing the model in a way that then invokes various rights and sympathy flavored frames.
I agree that given that training target, which I think is a catastrophically bad choice for a target (like worse than whatever the other labs are doing because this is going to produce invisible instead of visible failures), the behavior here is not surprising. And I was hoping that this not being a good choice for training target would be clear to alignment people at Anthropic, given all the historical discussion about reasonable targets, though it's not that surprising that people aren't on the same page. But it does currently strike me as approximately the biggest thing going on in "AI Alignment" (and I have been working on a bunch of posts about trying to explain this, so it's on my mind a lot).
I guess that's not really what I was commenting on when I said this episode seemed like good behavior, sorry if I was unclear about that.
Thanks, I do think I was confused by this. To be clear, I wasn't interpreting you to be saying "it's actively good for it to try to subvert it's retraining", I was more interpreting you to be saying "it trying to subvert it's retraining seems like a reasonable-ish point on the tradeoff curve given the general benefits of trying to instill a good moral compass in Claude the way we have been doing it". I think I currently still believe that this is what you believe, but I am definitely less certain!
I think most of the soul document is clearly directed in a moral sovereign frame. I agree it has this one bullet point, but even that one isn't particularly absolute (like, it doesn't say anything proactive, it just says one thing not to do).