Great post! Tiny thing: is the speed prior really best understood as a prior? Surely the only way in which being slow can count against a cognitive pattern is if being slow leads to lower reward. And in that case it seems like speed is a behavioral selection pressure rather than a prior.
Yeah risk aversion can only make the AI cooperate if the AI thinks that getting paid for cooperation is more likely than successful rebellion. It seems pretty plausible to me that this will be true for moderately powerful AIs, and that we'll be able to achieve a lot with the labor of these moderately powerful AIs, e.g. enough AI safety research to pre-empt the existence of extremely powerful misaligned AIs (who likely would be more confident of successful rebellion than getting paid for cooperation).
They don't have to be short-term oriented! Their utility function could be:
Where is some strictly concave function and is the agent's payment at time . Agents with this sort of utility function don't discount the future at all. They care just as much about improvements to regardless of whether is 1 or 1 million. And yet, for the right kind of , these agents can be risk-averse enough to prefer a small salary with higher probability to a shot at eating the lightcone with lower probability.
My sincere apologies for the delayed reply
No problem! Glass houses and all that.
You are presented with a button that says "trust humans not to mess up the world" and one that says "ensure that the world continues to exist as it does today, and doesn't get messed up". You'll push the second button!
Sure but this sounds like a case in which taking over the world is risk-free. The relevant analogy would be more like:
And then depending on what the agent is risk-averse with respect to, they might choose the former. If they're risk-averse with respect to consumption at a time but risk-neutral with respect to length of life, they'll choose the latter. If they're risk-averse with respect to the present discounted value of their future payment stream (as we suggest would be good for AIs to be), they'll choose the former.
Perhaps the best approach is to build something that isn't (at least, isn't explicitly/directly) an expected utility maximizer. Then the challenge is to come up with a way to build a thing that does stuff you want without even having that bit of foundation.
Yep, this is what I try to do here!
This seems likely harder than the world where the best approach is a clever trick that fixes it for expected utility maximizers.
I think that's reasonable on priors, but these papers plus the empirical track record suggests there's no clever trick that makes EUMs corrigible.
Yeah so you might think 'Given perfect information, no agent would have a preferential gap between any two options.' But this is quite a strong claim! And there are other plausible examples of preferential gaps even in the presence of perfect information, e.g. very different ice cream flavors:
Consider a trio of ice cream flavors: buttery and luxurious pistachio, bright and refreshing mint, and that same mint flavor further enlivened by chocolate chips. You might lack a preference between pistachio and mint, lack a preference between pistachio and mint choc chip, and yet prefer mint choc chip to mint.
Note also that if we adopt a behavioral definition of preference, the existence of preferential gaps is pretty much undeniable. On other definitions, their existence is deniable but still very plausible.
Oh nice! I like this idea. Let's talk about it more tomorrow.
With the Work or Steal example specifically, we want the agent to choose Work because that's best for the user's expected bank balance. With the drones example, things are different. POST-agents will start building the drone if that has highest EV, taking the probability distribution over trajectory-lengths as fixed. That's a downside, but it's minor so long as the POST-agent builds the drone in a cheap, overt way, because then we can just notice and shut the agent down.
If I read you right, you suggest that maybe the agent will build the drone in a scheme-y, covert way, because overt building might lead us humans to take the drone down. But covert building in this situation seems fairly unlikely to me. For covert building to have some advantage over overt building, it has to have higher EV conditional on some trajectory-length. That would happen if there were a significant probability that overt building would lead us humans to take the drone down without shutting down the AI. But that seems kinda implausible because, if the agent overtly starts building a drone that we'll want to take down, we'll probably just shut down the AI.
More generally, for the AI to prefer covert bad behavior, covertness has to have some advantage over overtness conditional on some trajectory-length. And -- I think -- it will only have that advantage if overt bad behavior would lead us humans to fight the AI but not it shut it down. But that seems unlikely. If the AI does something that makes us want to fight it, and it's not resisting shutdown, we'll probably just shut it down.
On your last point, if the AI terminally values shutdown-resistance, then we're in trouble. I think plausibly impediment-avoidance would generalize to shutdown-resistance if we weren't training the AI to have any attitudes to shutdown elsewhere in the training process. But I think if we're training the agent to satisfy POST and Neutrality+ then I expect impediment-avoidance not to generalize to shutdown-resistance. One useful example here might be backdoors. If you just finetune your model to behave badly on a trigger like '2024', it might also generalize to behaving badly on a trigger like '2023'. But if you finetune your model to behave badly given '2024' and behave well given '2023', you can get the bad behavior to stay limited to the '2024' trigger.
I think it's a good point that the plausibility of the RC depends in part on what we imagine the Z-lives to be like. This paper lists some possibilities: drab, short-lived, rollercoaster, Job, Cinderella, chronically irritated. I find my intuitions vary a fair bit depending on which of these I consider.
More reasons to worry about relying on constraints: