benedelman — LessWrong

LESSWRONG
LW

Replying toAI will change the world, but won’t take it over by playing “3-dimensional chess”.

AI will change the world, but won’t take it over by playing “3-dimensional chess”.

Ah, I think I understand what you meant now. The reward for this agent is not determined by the actual long-term consequences of its action, but by the predicted long-term consequences. In that case, yes, this seems like it might be an interesting middle ground between what we are calling short-term and long-term AIs. Though it still feels closer to a long-term agent to me—I'm confused about why you think it would both (a) not plan ahead of time to disempower humans, and (b) disempower humans when it has the chance. If the predictive model is accurate enough such that it is predictable that disempowering humans would be instrumentally useful, then wouldn't the model incorporate that into its earlier plans?

Replying toAI will change the world, but won’t take it over by playing “3-dimensional chess”.

benedelman3y

AI will change the world, but won’t take it over by playing “3-dimensional chess”.

In those terms, what we're suggesting is that, in the vision of the future we sketch, the same sorts of solutions might be useful for preventing both AI takeover and human takeover. Even if an AI has misaligned goals, coordination and mutually assured destruction and other "human alignment" solutions could be effective in stymying it, so long as the AI isn't significantly more capable than its human-run adversaries.

Replying toAI will change the world, but won’t take it over by playing “3-dimensional chess”.

benedelman3y

AI will change the world, but won’t take it over by playing “3-dimensional chess”.

Re your second critique: why do you think an AI system (without superhuman long-term planning ability) would be more likely to take over the world this way than an actor controlled by humans (augmented with short-term AI systems) who have long-term goals that would be instrumentally served by world domination?

Replying toAI will change the world, but won’t take it over by playing “3-dimensional chess”.

benedelman3y

AI will change the world, but won’t take it over by playing “3-dimensional chess”.

I'm confused about your first critique. You say the agent has a goal of generating a long-term plan which leads to as much long-term profit as possible; why do you call this a short-term goal, rather than a long-term goal? Do you mean that the agent only takes actions over a short period of time? That's true in some sense in your example, but I would still characterize this as a long-term goal because success (maximizing profit) is determined by long-term results (which depend on the long-term dynamics of a complex system, etc.).

Replying toAI will change the world, but won’t take it over by playing “3-dimensional chess”.

benedelman3y

AI will change the world, but won’t take it over by playing “3-dimensional chess”.

Thanks for laying out the case for this scenario, and for making a concrete analogy to a current world problem! I think our differing intuitions on how likely this scenario is might boil down to different intuitions about the following question:

To what extent will the costs of misalignment be borne by the direct users/employers of AI?

Addressing climate change is hard specifically because the costs of fossil fuel emissions are pretty much entirely borne by agents other than the emitters. If this weren't the case, then it wouldn't be a problem, for the reasons you've mentioned!

I agree that if the costs of misalignment are nearly entirely externalities, then your argument is convincing. And... (read more)

Replying toAI will change the world, but won’t take it over by playing “3-dimensional chess”.

benedelman3y

AI will change the world, but won’t take it over by playing “3-dimensional chess”.

Thank you for the insightful comments!! I've added thoughts on Mechanisms 1 and 2 below. Some reactions to your scattered disagreements (my personal opinions; not Boaz's):

I agree that extracting short-term modules from long-term systems is more likely than not to be extremely hard. (Also that we will have a better sense of the difficulty in the nearish future as more researchers work on this sort of task for current systems.)
I agree that the CEO point might be the weakest in the article. It seems very difficult to find high-quality evidence about the impact of intelligence on long-term strategic planning in complex systems, and this is a major source of my uncertainty about

benedelman3y*

AI will change the world, but won’t take it over by playing “3-dimensional chess”.

I agree that this sort of deceptive misalignment story is speculative but a priori plausible. I think it's very difficult to reason about these sorts of nuanced inductive biases without having sufficiently tight analogies to current systems or theoretical models; how this will play out (as with other questions of inductive bias) probably depends to a large extent on what the high-level structure of the AI system looks like. Because of this, I think it's more likely than not that our predictions about what these inductive biases will look like are pretty off-base. That being said, here are the first few specific reasons to doubt the scenario which come to mind right... (read more)

Replying toAI will change the world, but won’t take it over by playing “3-dimensional chess”.

benedelman3y

AI will change the world, but won’t take it over by playing “3-dimensional chess”.

My main objection to this misalignment mechanism is that it requires people/businesses/etc. to ignore the very concern you are raising. I can imagine this happening for two reasons:

A small group of researchers raise alarm that this is going on, but society at large doesn't listen to them because everything seems to be going so well. This feels unlikely unless the AIs have an extremely high level of proficiency in hiding their tampering, so that the poor performance on the intended objective only comes back to bite the AI's employers once society is permanently disempowered by AI. Nigh-infallibly covering up tampering sounds like a very difficult task even for an AI that is

Boaz Barak

Boaz Barak, benedelman

By Boaz Barak and Ben Edelman

[Cross-posted on Windows on Theory blog; See also Boaz’s posts on longtermism and AGI via scaling as well as other “philosophizing” posts.]

[Disclaimer: Predictions are very hard, especially about the future. In fact, this is one of the points of this essay. Hence, while for concreteness, we phrase our claims as if we are confident about them, these are not mathematically proven facts. However we do believe that the claims below are more likely to be true than false, and, even more confidently, believe some of the ideas herein are underrated in current discussions around risks from future AI systems.]

[To the LessWrong audience: we realize this piece is stylistically different from many... (read 6964 more words →)

134