Elliott Thornley (EJT)

More reasons to worry about relying on constraints:

As you say, your constraints might be insufficiently general ('nearest unblocked strategy,' etc. This seems like a big issue to me. People like Jesus and the Buddha seem to have gained huge amounts of influence without needing to violate any obvious deontological constraints.)
Your constraints might be insufficiently strong (e.g. maybe the constraints are strong enough to keep the AI compliant all throughout training but then the AI gets a really great opportunity in deployment...).
Your constraints might be just 'outer shell,' like humans' instinctual fear of heights (Barnett and Gillen). The AI might see them as an obstacle to overcome, rather than as a part of its terminal values.
Your constraints might actually be false beliefs that later get revised (e.g. that lying never pays)(Barnett and Gillen).
Your constraints might cause theoretical problems that motivate the AI to revise them away (e.g. money pumps, intransitivities, violations of the Independence of Irrelevant Alternatives, implausible dependence on which outcome is designated as the status quo, paralysis, trouble dealing with risk, arbitrariness of constraints' exact boundaries).
Your constraints might cause other misalignments (e.g. the AI wants to take extreme measures to prevent other agents from lying too).
Your constraints might make the AI incapable (e.g. they might falsify the strategy-stealing assumption, or make AIs too timid [e.g. maybe the AI will be extremely reluctant to say anything it's not absolutely certain of]).
Your constraints might fail to motivate the AI to do good alignment work (e.g. the AI produces alignment slop).
Your constraints might make the AI bad at moral philosophy (and we might need AI-powered moral philosophy to get a really good future).

Great post! Tiny thing: is the speed prior really best understood as a prior? Surely the only way in which being slow can count against a cognitive pattern is if being slow leads to lower reward. And in that case it seems like speed is a behavioral selection pressure rather than a prior.

Yeah risk aversion can only make the AI cooperate if the AI thinks that getting paid for cooperation is more likely than successful rebellion. It seems pretty plausible to me that this will be true for moderately powerful AIs, and that we'll be able to achieve a lot with the labor of these moderately powerful AIs, e.g. enough AI safety research to pre-empt the existence of extremely powerful misaligned AIs (who likely would be more confident of successful rebellion than getting paid for cooperation).

They don't have to be short-term oriented! Their utility function could be:

Where $f$ is some strictly concave function and $p_{i}$ is the agent's payment at time $i$ . Agents with this sort of utility function don't discount the future at all. They care just as much about improvements to $p_{i}$ regardless of whether $i$ is 1 or 1 million. And yet, for the right kind of $f$ , these agents can be risk-averse enough to prefer a small salary with higher probability to a shot at eating the lightcone with lower probability.

My sincere apologies for the delayed reply

No problem! Glass houses and all that.

You are presented with a button that says "trust humans not to mess up the world" and one that says "ensure that the world continues to exist as it does today, and doesn't get messed up". You'll push the second button!

Sure but this sounds like a case in which taking over the world is risk-free. The relevant analogy would be more like:

Choose between 'Trust humans not to mess up the world' and '50% chance of immediate death, 50% chance you ensure the world continues to exist as it does today and doesn't get messed up.'

And then depending on what the agent is risk-averse with respect to, they might choose the former. If they're risk-averse with respect to consumption at a time but risk-neutral with respect to length of life, they'll choose the latter. If they're risk-averse with respect to the present discounted value of their future payment stream (as we suggest would be good for AIs to be), they'll choose the former.

Perhaps the best approach is to build something that isn't (at least, isn't explicitly/directly) an expected utility maximizer. Then the challenge is to come up with a way to build a thing that does stuff you want without even having that bit of foundation.

Yep, this is what I try to do here!

This seems likely harder than the world where the best approach is a clever trick that fixes it for expected utility maximizers.

I think that's reasonable on priors, but these papers plus the empirical track record suggests there's no clever trick that makes EUMs corrigible.

Yeah so you might think 'Given perfect information, no agent would have a preferential gap between any two options.' But this is quite a strong claim! And there are other plausible examples of preferential gaps even in the presence of perfect information, e.g. very different ice cream flavors:

Consider a trio of ice cream flavors: buttery and luxurious pistachio, bright and refreshing mint, and that same mint flavor further enlivened by chocolate chips. You might lack a preference between pistachio and mint, lack a preference between pistachio and mint choc chip, and yet prefer mint choc chip to mint.

Note also that if we adopt a behavioral definition of preference, the existence of preferential gaps is pretty much undeniable. On other definitions, their existence is deniable but still very plausible.

Oh nice! I like this idea. Let's talk about it more tomorrow.

With the Work or Steal example specifically, we want the agent to choose Work because that's best for the user's expected bank balance. With the drones example, things are different. POST-agents will start building the drone if that has highest EV, taking the probability distribution over trajectory-lengths as fixed. That's a downside, but it's minor so long as the POST-agent builds the drone in a cheap, overt way, because then we can just notice and shut the agent down.

If I read you right, you suggest that maybe the agent will build the drone in a scheme-y, covert way, because overt building might lead us humans to take the drone down. But covert building in this situation seems fairly unlikely to me. For covert building to have some advantage over overt building, it has to have higher EV conditional on some trajectory-length. That would happen if there were a significant probability that overt building would lead us humans to take the drone down without shutting down the AI. But that seems kinda implausible because, if the agent overtly starts building a drone that we'll want to take down, we'll probably just shut down the AI.

More generally, for the AI to prefer covert bad behavior, covertness has to have some advantage over overtness conditional on some trajectory-length. And -- I think -- it will only have that advantage if overt bad behavior would lead us humans to fight the AI but not it shut it down. But that seems unlikely. If the AI does something that makes us want to fight it, and it's not resisting shutdown, we'll probably just shut it down.

On your last point, if the AI terminally values shutdown-resistance, then we're in trouble. I think plausibly impediment-avoidance would generalize to shutdown-resistance if we weren't training the AI to have any attitudes to shutdown elsewhere in the training process. But I think if we're training the agent to satisfy POST and Neutrality+ then I expect impediment-avoidance not to generalize to shutdown-resistance. One useful example here might be backdoors. If you just finetune your model to behave badly on a trigger like '2024', it might also generalize to behaving badly on a trigger like '2023'. But if you finetune your model to behave badly given '2024' and behave well given '2023', you can get the bad behavior to stay limited to the '2024' trigger.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments