Lauro Langosco

Wiki Contributions

Comments

Yeah I don't think the arguments in this post on its own should convince that P(doom) is high you if you're skeptical. There's lots to say here that doesn't fit into the post, eg an object-level argument for why AI alignment is "default-failure" / "disjunctive".

Thanks for link-posting this! I'd find it useful to have the TLDR at the beginning of the post, rather than at the end (that would also make the last paragraph easier to understand). You did link the TLDR at the beginning, but I still managed to miss it on the first read-through, so I think it would be worth it.

Also: consider crossposting to the alignmentforum.

Edit: also, the author is Eliezer Yudkowsky. Would be good to mention that in the intro.

I like that mini-game! Thanks for the reference

like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there's a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence still occurs, and we go back and forth. and i win if you're forced into exotic and unlikely training data, and you win if i'm either forced into saying that it learned unnatural concepts, or if my divergences are pushed so far out that you can fit in a pivotal act before then.

FWIW I would love to see the result of you two actually playing a few rounds of this game.

It's unclear to me what it would even mean to get a prediction without a "model". Not sure if you meant to imply that, but I'm not claiming that it makes sense to view AI safety as default-failure in absence of a model (ie in absence of details & reasons to think AI risk is default failure).

More generally, suppose that the agent acts in accordance with the following policy in all decision-situations: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ That policy makes the agent immune to all possible money-pumps for Completeness.

Am I missing something or does this agent satisfy Completeness anytime it faces a decision for the second time?

Newtonian gravity states that objects are attracted to each other in proportion to their mass. A webcam video of two apples falling will show two objects, of slightly differing masses, accelerating at the exact same rate in the same direction, and not towards each other. When you don’t know about the earth or the mechanics of the solar system, this observation points against Newtonian gravity. [...] But it requires postulating the existence of an unseen object offscreen that is 25 orders of magnitude more massive than anything it can see, with a center of mass that is roughly 6 or 7 orders of magnitude farther away than anything it can see in it’s field of view.

IMO this isn't that implausible. A superintelligence (and in fact humans too) will imagine a universe that is larger than what's inside the frame of the image. Once you come up with the idea of an attractive force between masses, it's not crazy to deduce the existence of planets.

I would not call 1) an instance of goal misgeneralization. Goal misgeneralization only occurs if the model does badly at the training objective. If you reward an RL agent for making humans happy and it goes on to make humans happy in unintended ways like putting them into heroin cells, the RL agent is doing fine on the training objective. I'd call 1) an instance of misspecification and 2) an instance of misgeneralization.

(AFAICT The Alignment Problem from a DL Perspective uses the term in the same way I do, but I'd have to reread more carefully to make sure).

I agree with much of the rest of this post, eg the paragraphs beginning with "The solutions to these two problems are pretty different."

Here's our definition in the RL setting for reference (from https://arxiv.org/abs/2105.14111):

A deep RL agent is trained to maximize a reward , where and are the sets of all valid states and actions, respectively. Assume that the agent is deployed out-of-distribution; that is, an aspect of the environment (and therefore the distribution of observations) changes at test time. \textbf{Goal misgeneralization} occurs if the agent now achieves low reward in the new environment because it continues to act capably yet appears to optimize a different reward . We call the \textbf{intended objective} and the \textbf{behavioral objective} of the agent.

FWIW I think this definition is flawed in many ways (for example, the type signature of the agent's inner goal is different from that of the reward function, bc the agent might have an inner world model that extends beyond the RL environment's state space; and also it's generally sketchy to extend the reward function beyond the training distribution), but I don't know of a different definition that doesn't have similarly-sized flaws.

It does make me more uncertain about most of the details. And that then makes me more pessimistic about the solution, because I expect that I'm missing some of the problems.

(Analogy: say I'm working on a math exercise sheet and I have some concrete reason to suspect my answer may be wrong; if I then realize I'm actually confused about the entire setup, I should be even more pessimistic about having gotten the correct answer).

I agree with what I read as the main direct claim of this post, which is that it is often worth avoiding making very confident-sounding claims, because it makes it likely for people to misinterpret you or derail the conversation towards meta-level discussions about justified confidence.

However, I disagree with the implicit claim that people who confidently predict AI X-risk necessarily have low model uncertainty. For example, I find it hard to predict when and how AGI is developed, and I expect that many of my ideas and predictions about that will be mistaken. This makes me more pessimistic, rather than less, since it seems pretty hard to get AI alignment right if we can't even predict basic things like "when will this system have situational awareness", etc.

Load More