I'm confused about how you're thinking about the space of agents, such that "maybe we don't need to make big changes"?
I just mean that I don't plan for corrigibility to scale that far anyway (see my other comment), and maybe we don't need a paradigm shift to get to the level we want, so it's mostly small updates from gradient descent. (Tbc, I still think there are many problems, and I worry the basin isn't all that real so multiple small updates might lead us out of the basin. It just didn't seem to me that this particular argument would be a huge dealbreaker if the rest works out.)
What actions help you land in the basin?
Clarifying the problem first: Let's say we have actor-critic model-based RL. Then our goal is that the critic is a function on the world model that measures sth like how empowered the principal is in the short term, aka assigning high valence to predicted outcomes that correspond to an empowered principal.
One thing we want to do is to make it less likely that a different function that also fits the reward signal well would be learned. E.g.:
We also want very competent overseers that understand corrigibility well and give rewards accurately, rather than e.g. rewarding nice extra things the AI did but which you didn't ask for.
And then you also want to use some thought monitoring. If the agent doesn't reason in CoT, we might still be able to train some translators on the neuralese. We can:
Tbc, this is just how you may get into the basin. It may become harder to stay in it, because (1) the AI learns a better model of the world and there are simple functions from the world model that perform better (e.g. get reward), and (2) the corrigibility learned may be brittle and imperfect and might still cause subtle power seeking because it happens to still be instrumentally convergent or so.
The AI reasons more competently in corrigible ways as it becomes smarter, falling deeper into the basin.
The AI doesn't fall deeper into the basin by itself, it only happens because of humans fixing problems.
If the AI helps humans to stay informed and asks about their preferences in potential edge cases, does that count as the humans fixing flaws?
Also some more thoughts on that point:
But this is a different view of mindspace; there is no guarantee that small changes to a mind will result in small changes in how corrigible it is, nor that a small change in how corrigible something is can be achieved through a small change to the mind!
As a proof of concept, suppose that all neural networks were incapable of perfect corrigibility, but capable of being close to perfect corrigibility, in the sense of being hard to seriously knock off the rails. From the perspective of one view of mindspace we're "in the attractor basin" and have some hope of noticing our flaws and having the next version be even more corrigible. But in the perspective of the other view of mindspace, becoming more corrigible requires switching architectures and building an almost entirely new mind — the thing that exists is nowhere near the place you're trying to go.
Now, it might be true that we can do something like gradient descent on corrigibility, always able to make progress with little tweaks. But that seems like a significant additional assumption, and is not something that I feel confident is at all true. The process of iteration that I described in CAST involves more deliberate and potentially large-scale changes than just tweaking the parameters a little, and with big changes like that I think there's a big chance of kicking us out of "the basin of attraction."
Idk this doesn't really seem to me like a strong counterargument. When you make a bigger change you just have to be really careful that you land in the basin again. And maybe we don't need big changes.
That said, I'm quite uncertain about how stable the basin really is. I think a problem is that sycophantic behavior will likely get a bit higher reward than corrigible behavior for smart AIs. So there are 2 possibilities:
My uncertain guess is that (2) would by default likely win out in the case of normal training for corrigible behavior. But maybe we could make (1) more likely by using sth like IDA? And in actor-critic model-based RL we could also stop updating the critic at the point when we think the AI might apply smart enough sycophancy that it wins out against corrigibility, and let the model and actor still become a bit smarter.
And then there's of course the problem of how we land in the basin in the first place. Still need to think about how a good approach for that would look like, but doesn't seem implausible to me that we could try in a good way and hit it.
Nice post! And being scared of minus signs seems like a nice lesson.
Absent a greater degree of theoretical understanding, I now expect the feedback loop of noticing and addressing flaws to vanish quickly, far in advance of getting an agent that has fully internalized corrigibility such that it's robust to distributional and ontological shifts.
My motivation for corrigibility isn't that it scales all that far, but that we can more safely and effectively elicit useful work out of corrigible AIs than out of sycophants/reward-on-the-episode-seekers (let alone schemers).
E.g. current approaches to corrigibility still rely on short-term preferences, but when the AI gets smarter and its ontology drifts so it sees itself as agent embedded in multiple places in greater reality, short-term preferences become much less natural. This probably-corrigibility-breaking shift already happens around Eliezer level if you're trying to use the AI to do alignment research. Doing alignment research makes it more likely that such breaks occur earlier, also because the AI would need to reason about stuff like "what if an AI reflects on itself in this dangerous value-breaking way" which is sorta close to the AI reflecting itself in that way. Not that it's necessarily impossible to use corrigible AI to help with alighment research, but we might be able to get a chunk further in capability if we make the AI not think about alignment stuff and instead just focus on e.g. biotech research for human intelligence augmentation, and that generally seems like a better plan to me.
I'm pretty unsure, but I currently think that if we tried not too badly (by which I mean much better than any of the leading labs seem on track to try, but not requiring fancy new techniques), we may have sth like a 10-75%[1] chance of getting a +5.5SD corrigible AI. And if a leading lab is sane enough to try a well-worked-out proposal here and it works, it might be quite useful to have +5.5SD agents inside of the labs that want to empower the overseers and at least can tell them that all the current approaches suck and we need to aim for international cooperation to get a lot more time (and then maybe human augmentation). (Rather than having sycophantic AIs that just tell the overseers what they want to hear.)
So I'm still excited about corrigibility even though I don't expect it to scale.
Restructuring it this way makes it more attractive for the AI to optimize things according to typical/simple values if the human's action doesn't sharply identify their revealed preferences. This seems bad.
The way I would interpret "values" in your proposal is like "sorta-short-term goals a principle might want to get fulfilled". I think it's probably fine if we just learn a prior over what sort of sorta-short-term goals a human may have, and then use that prior instead of Q. (Or not?) If so, this notion of power seems fine to me.
(If you have time, I also would be still interested in your rough take on my original question.)
(wide range because I haven't thought much about it yet)
Behavioral science of generalization. The first is just: studying AI behavior in depth, and using this to strengthen our understanding of how AIs will generalize to domains that our scalable oversight techniques struggle to evaluate directly.
- Work in the vicinity of “weak to strong” generalization is a paradigm example here. Thus, for example: if you can evaluate physics problems of difficulty level 1 and 2, but not difficulty level 3, then you can train an AI on level 1 problems, and see if it generalizes well to level 2 problems, as a way of getting evidence about whether it would generalize well to level 3 problems as well.
- (This doesn’t work on schemers, or on other AIs systematically and successfully manipulating your evidence about how they’ll generalize, but see discussion of anti-scheming measures below.)
I don't think this just fails with schemers. A key problem is that it's hard to distinguish whether you're measuring "this alignment approach is good" or "this alignment approach looks good to humans". If it's the latter, it looks great on level 1 and 2 but then the approach for 3 doesn't actually work. I unfortunately expect that if we train AIs to evaluate what is good alignment research, they will more likely learn the latter. (This problem seems related to ELK.)
they treat incidents of weird/bad out-of-distribution AI behavior as evidence alignment is hard, but they don’t treat incidents of good out-of-distribution AI behavior as evidence alignment is easy.
I don't think Nate or Eliezer were expecting seeing bad cases this early, and I don't think seeing bad cases updated them much further towards pessimism - they were already pretty pessimistic before. I don't think they update in a non-Bayesian way as you seem to think, it's just that AIs being nice in new circumstances isn't much evidence for alignment being easy given their models.
I think thinking in terms of behavior generalization is a bad frame for thinking about what really smart AIs will do. You rather need to think in terms of optimization / goal-directed reasoning. E.g. if you imagine a reward maximizer, it's 0 surprising that it works well while it cannot escape control measures, but when it is really smart so it can, it's not surprising that it will.
Thanks for writing up your views in detail!
On corrigibility:
Corrigibility was originally intended to mean that a system that has that property does not run into nearest unblocked strategy problems, unlike the kind of adversarial dynamic that exists between deontological and consequentialist preferences. In your version, the consequentialist planning to fulfill a hard task given by the operators is at odds with the deontological constraints.
I also think it is harder to get robust deontological preferences into an AI than one would expect given human intuitions. The way the human reward system is wired in a way that we robustly get positive rewards for pro-social self-reflective thoughts. Perhaps we can have another AI monitor the thoughts of our main AI and likewise reward pro-social (self-reflective?) thoughts (although I think LLM-like AIs would likely be self-reflective in a rather different way than humans). However, I think for humans our main preferences come from such approval-directed self-reflective desires, whereas I expect the way people will train AI by default will cause the main smart optimization to aim for object-level outcomes, which are then more at odds with norm-following. (See this post, especially sections 2.3 and 3.) (And also even for humans it's not quite clear whether deontological/norm-following preferences are learned deeply enough.)
So basically, I don't expect the main optimization of the AI to end up robustly steering in a way to fulfill deontological preferences, and it's rather like trying to enforce deontological preferences by having a different AI monitor the AI's thoughts and constraining it to not think virtue-specification-violating thoughts. So when the AI gets sufficiently smart you get (1) nearest-unblocked-strategy problems, like non-AI-interpretable disobedient thoughts; and/or (2) collusion if you didn't find some other way to make your AIs properly corrigible.
Deontological preferences aren't very natural targets for a steering process to steer toward. It somehow sometimes works out for humans because their preferences derive more from their self-image, rather than from environmental goals. But if you try to train deontological preferences into an AI with current methods, it won't end up deeply internalizing those, but rather learns a belief that it should not think disobedient thoughts, or an outer-shell non-consequentialist constraint.
(I guess given that you acknowledge nearest-unblocked strategy problems, you might sorta agree with this, though still plausible to me that you overestimate how deep and generalizing trained-in deontological constraints would be.)
Myopic instruction-following is already going into a roughly good direction in terms of what goal to aim for, but I think if you give the AI a task, the steering process towards that task would likely not have a lot of nice corrigibility properties by default. E.g. it seems likely that in steering toward such a task, it would see the possibility of the operator telling it to stop as obstacle that would prevent the goal of task-completion. (I mean it's a bit ambiguous how exactly to interpret instruction-following here, but I think that's what you get by default if you naively try to train for it as current labs would.)
It would be much nicer if the powerful steering machinery isn't steering in a way that it would naturally disempower us (absent thought and control constraints) in the first place. I think aiming for CAST would be much better: Basically, we want to point the powerful steering machinery towards a goal like "empower the principal" which then implies instruction-following and keeping the principal in control[1]. It also has the huge advantage that steering towards roughly-CAST may be enough for the AI to want to empower the principal more, so it may try to change itself into sth like more-correct-CAST (aka Paul Christiano's "basin of corrigibility"). (But obviously difficulties of getting the intended target instead of sth like reward-seeking AI still apply.)
Although not totally sure whether it works robustly, but in any case seems much much much better to aim for than sth like Anthropic's HHH.
I heard you mention in the doom debates podcast that you're working on an audiobook but it "may take a while". Could you give a quantitative guess for how long?
Do you count avoiding reward-on-the-episode-seekers as part of step 2 or step 3?
Thanks!
The single-timestep case actually looks fine to me now, so I return to the multi-timestep case.
I would want to be able to tell the AI to do a task, and then while the AI is doing the task, tell it to shut down, so it shuts down. And the hard part here is that while doing the task the AI doesn't prevent me from saying it should shut down in some way (because it would get higher utility if it manages to fulfill the values-as-inferred-through-principal-action of the first episode). This seems like it may require a bit of a different formalization than your multi-timestep one (although feel free to try in your formalization).
Do you think your formalism could be extended so it works in the way we want for such a case, and why (or why not)? (And ideally also roughly how?)
(Btw, even if it doesn't work for the case above, I think this is still really excellent progress and it does update me to think that corrigibility is likely simpler and more feasible than I thought before. Also thanks for writing formalism.)
Do you think sociopaths are sociopaths because their approval reward is very weak? And if so, why do they often still seek dominance/prestige?