11 (In)action rollouts

by Stuart_Armstrong

18th Feb 2020

AI Alignment Forum

3 min read

2

11 Ω 6

Frontpage

11 Ω 6

New Comment

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:41 AM

[-]Vika6yΩ240

I don't understand this proposal so far. I'm particularly confused by the last paragraph in the "to get away" section:

What does it mean in this context for A to implement a policy? I thought A was building a subagent and then following $π_{0}$ forever, thus not following $π_{k}^{*}$ for any $k$ at any point.
If A follows $π_{0}$ for $τ$ turns and then follows $π_{k}^{*}$ , how are $τ$ and $k$ chosen?
It's not clear to me that SA can act to ensure the baseline value of $V_{k}^{'}$ for all values of $k$ and $τ$ unless it does nothing.

I think it might help to illustrate this proposal in your original gridworld example to make it clearer what's going on. As far as I can tell so far, this does not address the issue I mentioned earlier where if the subagent actually achieves any of the auxiliary rewards, subagent creation will be penalized.

Reply

[-]Stuart_Armstrong6yΩ240

I agree I should re-do these all with examples. But, to answer your questions as far as possible:

What $A$ will do: $\emptyset$ forever (maybe with some extra info for the $S A$ ).

What $A$ would counterfactually do (to define $V_{k} (s_{t + τ}^{(t)})$ ): $π_{0}$ for $τ$ turns (the $π_{0}$ rollout, to define $s_{t + τ}^{(t)}$ ) followed by $π_{R}^{*}$ (to define the $V_{k}$ aspect). Call this $π_{0, τ, k}^{*}$ .

The penalty is defined in terms of these $V_{k} (s_{t + τ}^{(t)})$ , entirely counterfactual except for $τ = 0$ .

What $S A$ must do: iff $A$ follows $π_{0, τ, k}^{*}$ , then $S A$ must ensure that $V_{k} (s_{t + τ}^{(t)}) = V_{k} (s_{t + τ}^{' (t)})$ .

So a key question is: how early can $S A$ detect if $A$ is following $π_{0, τ, k}^{*}$ ? If it can detect this immediately, then it can immediately implement the policy for ensuring that $V_{k} (s_{t + τ}^{(t)}) = V_{k} (s_{t + τ}^{' (t)})$ (for the stricter impact measure) or the much easier $B V_{k} (s_{t}) = B V_{k} (s_{t}^{'})$ (for the original impact measure).

If it has to wait $τ$ turns to be sure, then it needs to preserve its power to control many different $k$ .

Now, $A$ will never actually follow $π_{0, τ, k}^{*}$ , so the penalty is (almost) entirely counterfactual. But the counterfactual abilities of $S A$ will determine the penalty; $S A$ is rewarded for quickly gaining control of the rewards.

In practice, $A$ will signal early on that it's not following $π_{0, τ, k}^{*}$ . Maybe there's a very small impact action, $\emptyset^{ϵ}$ , with a tiny penalty. Then $A$ would like to use that to signal to $S A$ that its not in the counterfactual world where it is following $π_{0, τ, k}^{*}$ . So long as $π_{0}$ 's first action isn't $\emptyset^{ϵ}$ , that signal is enough (of course, if $π_{0}$ 's first action isn't $\emptyset$ , then $\emptyset$ is a sufficient signal).

After getting that signal, the subagent is (almost) completely unconstrained, because the constraints are counterfactual, in a world it knows it isn't in.

The one mild constraint remaining comes from $s_{t + 0}^{(t)} = s_{t}$ , which is not a counterfactually defined state (though $V_{k} (s_{t})$ is a counterfactually defined property of that state).

I'll try and write up a further clarification of this.

Reply

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

11

(In)action rollouts

11

Ω 6

11

Ω 6

The baseline

To get away with a subagent

Stricter impact measures/counterfactuals