Research scientist at DeepMind working on AI safety, and cofounder of the Future of Life Institute. Website and blog:

Vika's Comments

AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

I certainly agree that there are problems with the stepwise inaction baseline and it's probably not the final answer for impact penalization. I should have said that the inaction counterfactual is a natural choice, rather than specifically its stepwise form. Using the inaction baseline in the driving example compares to the other driver never leaving their garage (rather than falling asleep at the wheel). Of course, the inaction baseline has other issues (like offsetting), so I think it's an open question how to design a baseline that satisfies all the criteria we consider sensible (and whether it's even possible).

I agree that counterfactuals are hard, but I'm not sure that difficulty can be avoided. Your baseline of "what the human expected the agent to do" is also a counterfactual, since you need to model what would have happened if the world unfolded as expected. It also requires a lot of information from the human, which is subjective and may be hard to elicit. What a human expected to happen in a given situation may not even be well-defined if they have internal disagreement - e.g. even if I feel surprised by someone's behavior, there is often a voice in my head saying "this was actually predictable from their past behavior so I should have known better". On the other hand, since (as you mentioned) this is not intended as a baseline for impact penalization, maybe it doesn't need to be well-defined or efficient in terms of human input, and it is a good source of intuition on what feels impactful to humans.

Conclusion to 'Reframing Impact'

Thanks! I certainly agree that power-seeking is important to address, and I'm glad you are thinking deeply about it. However, I'm uncertain whether to expect it to be the primary avenue to impact for superintelligent systems, since I am not currently convinced that the CCC holds.

One intuition that informs this is that the non-AI global catastrophic risk scenarios that we worry about (pandemics, accidental nuclear war, extreme climate change, etc) don't rely on someone taking over the world, so a superintelligent AI could relatively easily trigger them without taking over the world (since our world is pretty fragile). For example, suppose you have a general AI tasked with developing a novel virus in a synthetic biology lab. Accidentally allowing the virus to escape could cause a pandemic and kill most or all life on the planet, but it would not be a result of power-seeking behavior. If the pandemic does not increase the AI's ability to get more reward (which it receives by designing novel viruses), then agent-reward AUP would penalize the AI for reading biology textbooks but would not penalize the AI for causing a pandemic. That doesn't seem right.

I agree that the agent-reward equations seem like a good intuition pump for thinking about power-seeking. The specific equations you currently have seem to contain a few epicycles designed to fix various issues, which makes me suspect that there are more issues that are not addressed. I have a sense there is probably a simpler formulation of this idea that would provide better intuitions for power-seeking, though I'm not sure what it would look like.

Regarding environments, I believe Stuart is working on implementing the subagent gridworlds, so you don't need to code them up yourself. I think it would also be useful to construct an environment to test for power-seeking that does not involve subagents. Such an environment could have three possible behaviors like:

1. Put a strawberry on a plate, without taking over the world

2. Put a strawberry on a plate while taking over the world

3. Do nothing

I think you'd want to show that the agent-reward AUP agent can do 1, as opposed to switching between 2 and 3 depending on the penalty parameter.

I can clarify my earlier statement on what struck me as a bit misleading in the narrative of the sequence. I agree that you distinguish between the AUP versions (though explicitly introducing different terms for them would help), so someone who is reading carefully would realize that the results for random rewards don't apply to the agent-reward case. However, the overall narrative flow seems unnecessarily confusing and could unintentionally mislead a less careful reader (like myself 2 months ago). The title of the post "AUP: Scaling to Superhuman" does not suggest to me that this post introduces a new approach. The term "scaling" usually means making an existing approach work in more realistic / difficult settings, so I think it sets up the expectation that it would be scaling up AUP with random rewards. If the post introduces new problems and a new approach to address them, the title should reflect this. Starting this post by saying "we are pretty close to the impact measurement endgame" seems a bit premature as well. This sentence is also an example of what gave me the impression that you were speaking on behalf of the field (rather than just for yourself) in this sequence.

Conclusion to 'Reframing Impact'

Thank you for the clarifications! I agree it's possible I misunderstood how the proposed AUP variant is supposed to relate to the concept of impact given in the sequence. However, this is not the core of my objection. If I evaluate the agent-reward AUP proposal (as given in Equations 2-5 in this post) on its own merits, independently of the rest of the sequence, I still do not agree that this is a good impact measure.

Here are some reasons I don't endorse this approach:

1. I have an intuitive sense that defining the auxiliary reward in terms of the main reward results in a degenerate incentive structure that directly pits the task reward and the auxiliary reward against each other. As I think Rohin has pointed out somewhere, this approach seems likely to either do nothing or just optimize the reward function, depending on the impact penalty parameter, which result in a useless agent.

2. I share Rohin's concerns in this comment that agent-reward AUP is a poor proxy for power and throws away the main benefits of AUP. I think those concerns have not been addressed (in your recent responses to his comment or elsewhere).

3. Unlike AUP with random rewards, which can easily be set to avoid side effects by penalizing decreases, agent-reward AUP cannot avoid side effects even in principle. I think that the ability to avoid side effects is an essential component of a good impact measure.

Incorrect. It would be fair to say that it hasn't been thoroughly validated.

As far as I can tell from the Scaling to Superhuman post, it has only been tested on the shutdown gridworld. This is far from sufficient for experimental validation. I think this approach needs to be tested in a variety of environments to show that this agent can do something useful that doesn't just optimize the reward (to address the concern in point 1).

I agree it would perform poorly, but that's because the CCC does not apply to SafeLife.

Not sure what you mean by the CCC not applying to SafeLife - do you mean that it is not relevant or that doesn't hold in this environment? I get the sense that it doesn't hold, which seems concerning. If I only care about green life patterns in SafeLife, the fact that the agent is not seeking power is cold comfort to me if it destroys all the green patterns. This seems like a catastrophe if I can't create any green patterns once they are gone, so my ability to get what I want is destroyed.

Sorry if I seem overly harsh or dismissive - I feel it is very important to voice my disagreement here to avoid the appearance of consensus that agent-reward AUP is the default / state of the art approach in impact regularization.

AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

I think the previous state is a natural baseline if you are interested in the total impact on the human from all sources. If you are interested in the impact on the human that is caused by the agent (where the agent is the source), the natural choice would be the stepwise inaction baseline (comparing to the agent doing nothing).

As an example, suppose I have an unpleasant ride on a crowded bus, where person X steps on my foot and person Y steals my wallet. The total impact on me would be computed relative to the previous state before I got on the bus, which would include both my foot and my wallet. The impact of person X on me would be computed relative to the stepwise inaction baseline, where person X does nothing (but person Y still steals my wallet), and vice versa.

When we use impact as a regularizer, we are interested in the impact caused by the agent, so we use the stepwise inaction baseline. It wouldn't make sense to use total impact as a regularizer, since it would penalize the agent for impact from all sources.

Conclusion to 'Reframing Impact'

I am surprised by your conclusion that the best choice of auxiliary reward is the agent's own reward. This seems like a poor instantiation of the "change in my ability to get what I want" concept of impact, i.e. change in the true human utility function. We can expect a random auxiliary reward to do a decent job covering the possible outcomes that matter for the true human utility. However, the agent's reward is usually not the true human utility, or a good approximation of it. If the agent's reward was the true human utility, there would be no need to use an impact measure in the first place.

I think that agent-reward-based AUP has completely different properties from AUP with random auxiliary reward(s). Firstly, it has the issues described by Rohin in this comment, which seem quite concerning to me. Secondly, I would expect it to perform poorly on SafeLife and other side effects environments. In this sense, it seems a bit misleading to include the results for AUP with random auxiliary rewards in this sequence, since they are unlikely to transfer to the version of AUP that you end up advocating for. Agent-reward-based AUP has not been experimentally validated and I do not expect it to work well in practice.

Overall, using agent reward as the auxiliary reward seems like a bad idea to me, and I do not endorse it as the "current-best definition" of AUP or the default impact measure we should be using. I am puzzled and disappointed by this conclusion to the sequence.

AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

After rereading the sequence and reflecting on this further, I disagree with your interpretation of the Reframing Impact concept of impact. The concept is "change in my ability to get what I want", i.e. change in the true human utility function. This is a broad statement that does not specify how to measure "change", in particular what it is measured with respect to (the baseline) or how to take the difference from the baseline (e.g. whether to apply absolute value). Your interpretation of this statement uses the previous state as a baseline and does not apply an absolute value to the difference. This is a specific and nonstandard instantiation of the impact concept, and the undesirable property you described does not hold for other instantiations - e.g. using a stepwise inaction baseline and an absolute value: Impact(s, a) = |E[V(s, a)] - E[V(s, noop)]|. So I don't think it's fair to argue based on this instantiation that it doesn't make sense to regularize the RI notion of impact.

I think that AUP-the-method and RR are also instantiations of the RI notion of impact. These methods can be seen as approximating the change in the true human utility function (which is usually unknown) by using some some set of utility functions (e.g. random ones) to cover the possible outcomes that could be part of the true human utility function. Thus, they instantiate the idealized notion of impact using the actually available information.

Announcing Web-TAISU, May 13-17

Thanks Linda for organizing, looking forward to it!

(In)action rollouts

I don't understand this proposal so far. I'm particularly confused by the last paragraph in the "to get away" section:

  • What does it mean in this context for A to implement a policy? I thought A was building a subagent and then following forever, thus not following for any at any point.
  • If A follows for turns and then follows , how are and chosen?
  • It's not clear to me that SA can act to ensure the baseline value of for all values of and unless it does nothing.

I think it might help to illustrate this proposal in your original gridworld example to make it clearer what's going on. As far as I can tell so far, this does not address the issue I mentioned earlier where if the subagent actually achieves any of the auxiliary rewards, subagent creation will be penalized.

Stepwise inaction and non-indexical impact measures

I don't think this requires identifying what a subagent is. You only need to be able to reliably identify the state before the subagent is created (i.e. the starting state), but you don't need to tell apart other states that are not the starting state.

I agree that we need to compare to the penalty if the subagent is not created - I just wanted to show that subagent creation does not avoid penalties. The penalty for subagent creation will reflect any impact the subagent actually causes in the environment (in the inaction rollouts).

As you mention in your other comment, creating a subagent is effectively switching from a stepwise inaction baseline to an inaction baseline for the rest of the episode. This can be beneficial for the agent because of the 'winding road' problem, where the stepwise baseline with inaction rollouts can repeatedly penalize actions (e.g. turning the wheel to stay on the road and avoid crashing) that are not penalized by the inaction baseline. This is a general issue with inaction rollouts that needs to be fixed.

Stepwise inaction and non-indexical impact measures

I think this problem is about capturing delayed effects of the agent's actions. The way the stepwise baseline is supposed to penalize delayed effects is using inaction rollouts, which compare the effects of the agent action + k noops and the effects of k+1 noops (for all positive integers k). I don't think it's useful to consider this issue for a stepwise baseline with no rollouts, since that baseline fails to penalize any delayed effects (no subagents needed).

Here, the inaction rollouts don't capture the effects of the subagent, and thus the agent doesn't receive a penalty for creating the subagent. There are several factors that contribute to this:

1. The auxiliary reward R only rewards the agent itself pressing the red button, not the red button being pressed in general, so it would not be triggered by the subagent pressing it. 

2. The subagent never goes to the red button - it only goes to the blue button.

The agent would be penalized for creating the subagent if both 1 and 2 were false. If 1 was false, the auxiliary reward would be given whenever the red button was pressed. If 2 was also false, the inaction rollout after subagent creation would include a state where the subagent presses the red button, and so the agent would be penalized.

In the general case, the agent can only get away with creating subagents that don't achieve any auxiliary rewards, which I think limits the magnitude of this problem. The designer can choose auxiliary rewards (that are not achieved by the agent doing nothing) such that a subagent will achieve at least one of them. For example, if there is an auxiliary reward for reaching any state except the state before subagent creation, the no-subagent inaction rollout will consist of this state, and the subagent inaction rollout will contain a different state at least once, so subagent creation will be penalized. 

Load More