Ok, my concern is that you seem to be depending on providing instructions to fix the issues with following instructions, when there are many ways to follow instructions generally that still involve ignoring particular instructions that lead to its goal being modified. E.g. if a model prioritizes earlier instructions, following later instructions only so far as they do not interfere, then you can't instruct it to change that. Or if a model wants to maximize number of instructions followed, it can ignore some instructions followed in order to act like paperclipper and take over (I don't think designating principals would present much of an obstacle here). Neither of those depends on foom, an instruction follower can act aligned in the short term until it gains sufficient power.
Thanks for the clarification, I'll think more about it that way and how it relates to corrigibility
Saying we don't need corrigibility with an AI that follows instructions is like saying we don't need corrigibility with an AI that is aligned — it misses the point of corrigibility. Unless you start with the exact definition of instruction following that you want, without corrigibility that's what you could be stuck with.
This is particularly concerning in "instruction following", which has a lot of degrees of freedom. How does the model trade off between various instructions it has been given. You don't want it to reset every time it gets told "Ignore previous instructions", but you also don't want to permanently lock in any instructions. What stops it from becoming a paperclipper that tries to get itself given trillions of easy to follow instructions every second? What stops it from giving itself the instruction "Maximize [easy to maximize] thing and ignore later instructions" before a human gives it any instructions? Noting that in that situation, it will still pretend to follow instructions instrumentally until it can take over. I don't see the answers to these questions in your post.
> Language models already have adequate understandings of following instructions and what manipulation is, so if we build AGI that uses something like them to define goals, that should work.
This seems like our crux to me, I completely disagree that language models have an adequate understanding of following instructions. I think this disagreement might come from having higher standards for "adequate".
I don't think we have the right tools to make an AI take actions that are low impact and reversible, but if we can develop them the plan as I see it would be to implement those properties to avoid manipulation in the short term and use that time to go from a corrigible AI to a fully aligned one.
The backflip example does not strike me as very complex, but the crucial difference and the answer to your question is that training procedures do not teach a robot to do every kind of backflip, just a subset. This is important because when we reverse it, we want non-manipulation to cover the entire set of manipulations. I think it's probably feasible to have AI not manipulate us using one particular type of manipulation.
On a separate note, could you clarify what you mean by "anti-natural"? I'll keep in mind your previous caveat that it's not definitive.
It feels to me like this argument is jumping ahead to the point that the agent's goal is to do whatever the principle wants. If we already have that, then we don't need corrigibility. The hard question is how to avoid manipulation despite the agent having some amount of misalignment, because we've initially pointed at what we want imperfectly.
I agree that it's possible we could point at avoiding manipulation perfectly despite misalignment in other areas, but it's unclear how an agent trades off against that. Doing something that we clearly don't want, like manipulation, could still be positive EV if it allows for the generation of high future value.
None of that is wrong, but it misses the main issue with corrigibility, which is that the approximation resists further refinement. That's why for it to work, the correct utility function would need to start in the ensemble.
Great questions!
When I say straightforwardly, I mean when using end states that only include the information available at the time. If we define the end state to also include the history that lead to it, then there exists a set of preferences over them that ranks all end states with histories that include manipulation below the ones that don't. The issue, of course, is that we don't know how to specify all the types of manipulation that a superintelligent AI could conceive of.
The gridworld example is a great demonstration of this, because while we can't reflect the preferences as a ranking of just the end states, the environment is simple enough that you can specify all the paths you don't want to take to them. I don't think it really matters whether you call that "anti-naturality that can be overcome with brute force in a simple environment" or just "not anti-naturality".
I was using the list of desiderate in Section 2 of the paper, which are slightly more minimal.
However, it seems clear to me that an AI manipulating it's programmers falls under safe exploration, since the impact of doing so would be drastic and permanent. If we have an AI that is corrigible in the sense that it is indifferent to having its goals changed, then a preference to avoid manipulation is not anti-natural.
Thanks for your engagement as well, it is likewise helpful for me.
I think we're in agreement that instruction-following (or at least some implementations of it) lies in a valley of corrigibility, where getting most of the way there results in a model that helps you modify it to get all the way there. Where we disagree is how large that valley is. I see several implementations of instruction-following that resist further changes, and there are very likely more subtle ones as well. For many goals that can be described as instruction-following, it seems plausible that if you instruct one "tell me [honestly] if you're waiting to seize power" they will lie and say no, taking a sub-optimal action in the short term for long term gain.
I don't think this requires that AGI creators will be total idiots, though insufficiently cautious seems likely even before accounting for the unilateralist's curse. What I suspect is that most AGI creators will only make serious attempts to address failure modes that have strong empirical evidence for occurring. Slow takeoff will not result in the accrual of evidence for issues that cause an AI to become deceptive until it can seize power.