if you told me “you can build superintelligent agents which don’t try to seek power by penalizing them for becoming more able to achieve their own goal”, I wouldn’t exactly die of shock

This seems broadly reasonable to me, but I don't think it can work under the threat model of optimal agents. "Impact" / "more able" as defined in this sequence can only be caused by events the agent didn't perfectly predict, because impact requires a change in the agent's belief about the reward it can accumulate. In a deterministic environment with a truly optimal agent, the agent's beliefs will never change as it executes the optimal policy, and so there will never be impact. So AUP_conceptual using the definition of impact/power in this sequence doesn't seem like it solves the problem under the threat model of perfectly optimal agents. (That's fine! We won't have those!)In practice, I interpret variants of AUP-the-method (as in the previous post) as trying to get safety via some combination of two things:

  • Power proxy: When using a set of auxiliary reward functions, the agent's beliefs about attainable utility for the auxiliary rewards changes, because it is not following an optimal policy for them. This forms a good proxy for power that is compatible with the agent having perfect beliefs. The main problem here is that proxies can be gamed (as in various subagent constructions).
  • Starting from a "dumber" belief: (Super unclear / fuzzy) Given that the agent's actual beliefs won't change, we can instead have it measure the difference between its beliefs and some "dumber" beliefs, e.g. its beliefs if it were following an inaction policy or a random policy for N timesteps, followed by an optimal policy. The problem here is that you aren't leveraging the AI's understanding of the environment, and so in practice I'd expect the effect of this is going to depend pretty significantly on the environment.

I like AUP primarily because of the first reason: while the power proxy is not ungameable, it certainly seems quite good, and seems like it only deviates from our intuitive notion of power in very weird circumstances or under adversarial optimization. While this means it isn't superintelligence-safe, it still seems like an important idea that might be useful in other ways.

Once you remove the auxiliary rewards and only use the primary reward R, I think you have mostly lost this benefit: at this point you are saying "optimize for R, but don't optimize for long-term R", which seems pretty weird and not a good proxy for power. At this point I think you're only getting the benefit of starting from a "dumber" belief, or perhaps you shift reward acquisition to be closer to the present than the far future, but this seems pretty divorced from the CCC and all of the conceptual progress made in this sequence. It seems much more in the same spirit as quantilization and/or satisficing, and I'd rather use one of those two methods (since they're simpler and easier to understand).(I analyzed a couple of variants of AUP-without-auxiliary-rewards here; I think it mostly supports my claim that these implementations of AUP are pretty similar in spirit to quantilization / satisficing.)

Showing 3 of 7 replies (Click to show all)
2TurnTrout2moWhen I say that our intuitive sense of power has to do with the easily exploitable opportunities available to an actor, that refers to opportunities which e.g. a ~human-level intelligence could notice and take advantage of. This has some strange edge cases, but it's part of my thinking. The key point is that AUPconceptual relaxes the problem: This is not trivial. I think it's a useful question to ask (especially because we can formalize so many of these power intuitions), even if none of the formalizations are perfect.
2rohinmshah2moProbably I'm just missing something, but I don't see why you couldn't say something similar about: E.g. It seems like the main difference is that for power in particular is that there's more hope that we could formalize power without reference to humans (which seems harder to do for e.g. "niceness"), but then my original point applies.

(This discussion was continued privately – to clarify, I was narrowly arguing that AUP is correct, but that this should only provide a mild update in favor of implementations working in the superintelligent case.)

Reasons for Excitement about Impact of Impact Measure Research

by TurnTrout 4 min read27th Feb 20208 comments


Ω 13

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Can we get impact measurement right? Does there exist One Equation To Rule Them All?

I think there’s a decent chance there isn’t a simple airtight way to implement AUP which lines up with AUP, mostly because it’s just incredibly difficult in general to perfectly specify the reward function.

Reasons why it might be feasible: we’re trying to get the agent to do the goal without it becoming more able to do the goal, which is conceptually simple and natural; since we’ve been able to handle previous problems with AUP with clever design choice modifications, it’s plausible we can do the same for all future problems; since there are a lot of ways to measure power due to instrumental convergence, that increases the chance at least one of them will work; intuitively, this sounds like the kind of thing which could work (if you told me “you can build superintelligent agents which don’t try to seek power by penalizing them for becoming more able to achieve their own goal”, I wouldn’t exactly die of shock).

Even so, I am (perhaps surprisingly) not that excited about actually using impact measures to restrain advanced AI systems. Let’s review some concerns I provided in Reasons for Pessimism about Impact of Impact Measures:

  • Competitive and social pressures incentivize people to cut corners on safety measures, especially those which add overhead. Especially so for training time, assuming the designers slowly increase aggressiveness until they get a reasonable policy.
  • In a world where we know how to build powerful AI but not how to align it (which is actually probably the scenario in which impact measures do the most work), we play a very unfavorable game while we use low-impact agents to somehow transition to a stable, good future: the first person to set the aggressiveness too high, or to discard the impact measure entirely, ends the game.
  • In a What Failure Looks Like-esque scenario, it isn't clear how impact-limiting any single agent helps prevent the world from "gradually drifting off the rails".

You might therefore wonder why I’m working on impact measurement.


Within Matthew Barnett’s breakdown of how impact measures could help with alignment, I'm most excited about impact measure research as deconfusion. Nate Soares explains:

By deconfusion, I mean something like “making it so that you can think about a given topic without continuously accidentally spouting nonsense.”

To give a concrete example, my thoughts about infinity as a 10-year-old were made of rearranged confusion rather than of anything coherent, as were the thoughts of even the best mathematicians from 1700. “How can 8 plus infinity still be infinity? What happens if we subtract infinity from both sides of the equation?” But my thoughts about infinity as a 20-year-old were not similarly confused, because, by then, I’d been exposed to the more coherent concepts that later mathematicians labored to produce. I wasn’t as smart or as good of a mathematician as Georg Cantor or the best mathematicians from 1700; but deconfusion can be transferred between people; and this transfer can spread the ability to think actually coherent thoughts.

In 1998, conversations about AI risk and technological singularity scenarios often went in circles in a funny sort of way. People who are serious thinkers about the topic today, including my colleagues Eliezer and Anna, said things that today sound confused. (When I say “things that sound confused,” I have in mind things like “isn’t intelligence an incoherent concept,” “but the economy’s already superintelligent,” “if a superhuman AI is smart enough that it could kill us, it’ll also be smart enough to see that that isn’t what the good thing to do is, so we’ll be fine,” “we’re Turing-complete, so it’s impossible to have something dangerously smarter than us, because Turing-complete computations can emulate anything,” and “anyhow, we could just unplug it.”) Today, these conversations are different. In between, folks worked to make themselves and others less fundamentally confused about these topics—so that today, a 14-year-old who wants to skip to the end of all that incoherence can just pick up a copy of Nick Bostrom’s Superintelligence.

Similarly, suppose you’re considering the unimportant and trivial question of "does instrumental convergence exist?", which we can now crisply state as "do most reward functions induce farsighted optimal policies which take over the planet (more formally, which gain a lot of )?".

You’re a bit confused if you argue in the negative by saying “you’re anthropomorphizing; chimpanzees don’t try to do that” (chimpanzees aren’t optimal) or “the set of reward functions which does this has measure 0, so we’ll be fine” (for any[1] trajectory, there exists a positive measure subset of reward functions for which it is optimal; see Lemma 35 of Optimal Farsighted Agents Tend to Gain Power).

You’re a bit confused if you argue in the affirmative by saying “unintelligent animals fail to gain resources and die; intelligent animals gain resources and thrive. Therefore, since we are talking about really intelligent agents, of course they’ll gain resources and avoid correction.” (animals aren’t optimal, and evolutionary selection pressures narrow down the space of possible “goals” they could be effectively optimizing).

After reading this paper on the formal roots of instrumental convergence, instead of arguing about whether chimpanzees are representative of power-seeking behavior, we can just discuss how, under an agreed-upon reward function distribution, optimal action is likely to flow through the future of our world. We can think about to what extent the paper's implications apply to more realistic reward function distributions (which don't identically distribute reward over states).[2] Since we’re less confused, our discourse doesn’t have to be crazy.

But also since we’re less confused, the privacy of our own minds doesn’t have to be crazy. It's not that I think that any single fact or insight or theorem downstream of my work on AUP is totally obviously necessary to solve AI alignment. But it sure seems good that we can mechanistically understand instrumental convergence and power, know what “impact” means instead of thinking it’s mostly about physical change to the world, think about how agents affect each other, and conjecture why goal-directedness seems to lead to doom by default.[3]

Attempting to iron out flaws from our current-best AUP equation makes one intimately familiar with how and why power-seeking incentives can sneak in even when you’re trying to keep them out in the conceptually correct way. This point is harder for me to articulate, but I think there’s something vaguely important in understanding how this works.

Oh, and then there’s always that time when possibility theory highlighted a significant hole in our theoretical understanding of the main formalism of reinforcement learning. And if you told me two years ago that you could possibly solve side-effect avoidance in the short-term with one simple trick (“just preserve your ability to optimize a single random reward function, lol”), I’d have thought you were nuts. Clearly, there’s something wrong with our models of reinforcement learning environments if these results are so surprising.

Research on AUP has yielded an unusually high rate of deconfusion and insights, probably because we’re thinking about what it means for the agent to interact with us.

  1. Technically, this only applies to non-dominated possibilities, but the distinction isn’t relevant here. If there's a reachable state with high power, there exists a non-dominated trajectory going through it. ↩︎

  2. When combined with our empirical knowledge of the difficulty of reward function specification, you might begin to suspect that there are lots of ways the agent might be incentivized to gain control, many openings through which power-seeking incentives can permeate – and your reward function would have to penalize all of these! If you were initially skeptical, this might make you think that power-seeking behavior may be more difficult to avoid than you initially thought. ↩︎

  3. If we collectively think more and end up agreeing that AUP solves impact measurement, it would be interesting that you could solve such a complex, messy-looking problem in such a simple way. If, however, CCC ends up being false, I think that would also be a new and interesting fact not currently predicted by our models of alignment failure modes. ↩︎


Ω 13