EJT — LessWrong

Yeah risk aversion can only make the AI cooperate if the AI thinks that getting paid for cooperation is more likely than successful rebellion. It seems pretty plausible to me that this will be true for moderately powerful AIs, and that we'll be able to achieve a lot with the labor of these moderately powerful AIs, e.g. enough AI safety research to pre-empt the existence of extremely powerful misaligned AIs (who likely would be more confident of successful rebellion than getting paid for cooperation).

They don't have to be short-term oriented! Their utility function could be:

Where $f$ is some strictly concave function and $p_{i}$ is the agent's payment at time $i$ . Agents with this sort of utility function don't discount the future at all. They care just as much about improvements to $p_{i}$ regardless of whether $i$ is 1 or 1 million. And yet, for the right kind of $f$ , these agents can be risk-averse enough to prefer a small salary with higher probability to a shot at eating the lightcone with lower probability.

My sincere apologies for the delayed reply

No problem! Glass houses and all that.

You are presented with a button that says "trust humans not to mess up the world" and one that says "ensure that the world continues to exist as it does today, and doesn't get messed up". You'll push the second button!

Sure but this sounds like a case in which taking over the world is risk-free. The relevant analogy would be more like:

Choose between 'Trust humans not to mess up the world' and '50% chance of immediate death, 50% chance you ensure the world continues to exist as it does today and doesn't get messed up.'

And then depending on what the agent is risk-averse with respect to, they might choose the former. If they're risk-averse with respect to consumption at a time but risk-neutral with respect to length of life, they'll choose the latter. If they're risk-averse with respect to the present discounted value of their future payment stream (as we suggest would be good for AIs to be), they'll choose the former.

Perhaps the best approach is to build something that isn't (at least, isn't explicitly/directly) an expected utility maximizer. Then the challenge is to come up with a way to build a thing that does stuff you want without even having that bit of foundation.

Yep, this is what I try to do here!

This seems likely harder than the world where the best approach is a clever trick that fixes it for expected utility maximizers.

I think that's reasonable on priors, but these papers plus the empirical track record suggests there's no clever trick that makes EUMs corrigible.

Yeah so you might think 'Given perfect information, no agent would have a preferential gap between any two options.' But this is quite a strong claim! And there are other plausible examples of preferential gaps even in the presence of perfect information, e.g. very different ice cream flavors:

Consider a trio of ice cream flavors: buttery and luxurious pistachio, bright and refreshing mint, and that same mint flavor further enlivened by chocolate chips. You might lack a preference between pistachio and mint, lack a preference between pistachio and mint choc chip, and yet prefer mint choc chip to mint.

Note also that if we adopt a behavioral definition of preference, the existence of preferential gaps is pretty much undeniable. On other definitions, their existence is deniable but still very plausible.

Oh nice! I like this idea. Let's talk about it more tomorrow.

With the Work or Steal example specifically, we want the agent to choose Work because that's best for the user's expected bank balance. With the drones example, things are different. POST-agents will start building the drone if that has highest EV, taking the probability distribution over trajectory-lengths as fixed. That's a downside, but it's minor so long as the POST-agent builds the drone in a cheap, overt way, because then we can just notice and shut the agent down.

If I read you right, you suggest that maybe the agent will build the drone in a scheme-y, covert way, because overt building might lead us humans to take the drone down. But covert building in this situation seems fairly unlikely to me. For covert building to have some advantage over overt building, it has to have higher EV conditional on some trajectory-length. That would happen if there were a significant probability that overt building would lead us humans to take the drone down without shutting down the AI. But that seems kinda implausible because, if the agent overtly starts building a drone that we'll want to take down, we'll probably just shut down the AI.

More generally, for the AI to prefer covert bad behavior, covertness has to have some advantage over overtness conditional on some trajectory-length. And -- I think -- it will only have that advantage if overt bad behavior would lead us humans to fight the AI but not it shut it down. But that seems unlikely. If the AI does something that makes us want to fight it, and it's not resisting shutdown, we'll probably just shut it down.

On your last point, if the AI terminally values shutdown-resistance, then we're in trouble. I think plausibly impediment-avoidance would generalize to shutdown-resistance if we weren't training the AI to have any attitudes to shutdown elsewhere in the training process. But I think if we're training the agent to satisfy POST and Neutrality+ then I expect impediment-avoidance not to generalize to shutdown-resistance. One useful example here might be backdoors. If you just finetune your model to behave badly on a trigger like '2024', it might also generalize to behaving badly on a trigger like '2023'. But if you finetune your model to behave badly given '2024' and behave well given '2023', you can get the bad behavior to stay limited to the '2024' trigger.

I think it's a good point that the plausibility of the RC depends in part on what we imagine the Z-lives to be like. This paper lists some possibilities: drab, short-lived, rollercoaster, Job, Cinderella, chronically irritated. I find my intuitions vary a fair bit depending on which of these I consider.

Needless to say, none of these analogies show up in my published papers

This is kind of wild. The analogies clearly helped Tao a lot, but his readers don't get to see them! This has got me thinking about a broader kind of perverse incentive in academia: if you explain something really well, your idea seems obvious or your problem seems easy, and so your paper is more likely to get rejected by reviewers.

On a linguistic level I think "risk-averse" is the wrong term, since it usually, as I understand it, describes an agent which is intrinsically averse to taking risks, and will pay some premium for a sure-thing. (This is typically characterized as a bias, and violates VNM rationality.) Whereas it sounds like Will is talking about diminishing returns from resources, which is, I think, extremely common and natural and we should expect AIs to have this property for various reasons.

That's not quite right. 'Risk-averse with respect to quantity X' just means that, given a choice between two lotteries A and B with the same expected value of X, the agent prefers the lottery with less spread. Diminishing marginal utility from extra resources is one way to get risk aversion with respect to resources. Risk-weighted expected utility theory is another. Only RWEUT violates VNM. When economists talk about 'risk aversion,' they almost always mean diminishing marginal utility.

diminishing returns from resources... is, I think, extremely common and natural and we should expect AIs to have this property for various reasons.

Can you say more about why?

Making a deal with humans to not accumulate as much power as possible is likely an extremely risky move for multiple reasons, including that other AIs might come along and eat the lightcone.

But AIs with sharply diminishing marginal utility to extra resources wouldn't care much about this. They'd be relevantly similar to humans with sharply diminishing marginal utility to extra resources, who generally prefer collecting a salary over taking a risky shot at eating the lightcone. (Will and I are currently writing a paper about getting AIs to be risk-averse as a safety strategy, where we talk about stuff like this in more detail.)

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments