Stuart_Armstrong's Comments

The Adventure: a new Utopia story

Yeah, I was doing a lot of editing (which is how I got the blue colour), and didn't want it to appear little by little.

The Adventure: a new Utopia story

I recently reformatted it, so that it would colour and indent properly, and put it in markdown. That seems to have reset the date.

(In)action rollouts

I agree I should re-do these all with examples. But, to answer your questions as far as possible:

What will do: forever (maybe with some extra info for the ).

What would counterfactually do (to define ): for turns (the rollout, to define ) followed by (to define the aspect). Call this .

The penalty is defined in terms of these , entirely counterfactual except for .

What must do: iff follows , then must ensure that .

So a key question is: how early can detect if is following ? If it can detect this immediately, then it can immediately implement the policy for ensuring that (for the stricter impact measure) or the much easier (for the original impact measure).

If it has to wait turns to be sure, then it needs to preserve its power to control many different .

Now, will never actually follow , so the penalty is (almost) entirely counterfactual. But the counterfactual abilities of will determine the penalty; is rewarded for quickly gaining control of the rewards.

In practice, will signal early on that it's not following . Maybe there's a very small impact action, , with a tiny penalty. Then would like to use that to signal to that its not in the counterfactual world where it is following . So long as 's first action isn't , that signal is enough (of course, if 's first action isn't , then is a sufficient signal).

After getting that signal, the subagent is (almost) completely unconstrained, because the constraints are counterfactual, in a world it knows it isn't in.

The one mild constraint remaining comes from , which is not a counterfactually defined state (though is a counterfactually defined property of that state).

I'll try and write up a further clarification of this.

Stepwise inaction and non-indexical impact measures

For example, if there is an auxiliary reward for reaching any state except the state before subagent creation, the no-subagent inaction rollout will consist of this state, and the subagent inaction rollout will contain a different state at least once, so subagent creation will be penalized.

This requires identifying what a subagent is in general, a very tricky unsolved problem (which I feel is unsolvable).

There's another issue; it's not enough to show that the subagent triggers a penalty. We need to show the penalty is larger than not creating the subagent. Since the penalty is zero after the subagent is created, and since the subagent has very fine control over the rewards (much finer than actions that don't include creating an intelligent being), creating a subagent might be lower penalty than almost any other action.

It won't be a lower penalty than the agent doing nothing for ever, of course. But we typically want the agent to do something, so will calibrate the penalty or R_0 for that. And it's plausible that creating the subagent will have lower penalty (and/or higher R_0) than any safe "something".

Stepwise inaction and non-indexical impact measures

The inaction rollouts effectively transforms the stepwise inaction baseline into an inaction baseline (starting from the moment the subagent is created; thus the agent has a bit more control than in a true inaction baseline).

Therefore the results on the inaction baseline apply ( ).

This means that restrictions on increased power for the agent ("make sure you never have the power to increase the rewards") become restrictions on the actual policy followed for the subagent ("make sure you never increase these rewards").

Roughly, attainable utility becomes twenty billion questions.

For the original example, this means that the agent cannot press the red button nor gain the ability to teleport. But while the subagent cannot press the red button, it can gain the ability to teleport.

Subagents and impact measures: summary tables

AU and RR have the same problem, formally, in terms of excess power; it's just that AU wants low power and RR wants high power, so they don't have the same problem in practice.

Stepwise inaction and non-indexical impact measures

It needs to benefit for ; not necessarily for or alone.

It seems to me that, generically, since is supposed to be a hindrance to the agent, taking a few turns to then neutralise should be beneficial.

In theory: does building the subagent have an "impact"?

It's the delta of that with that is penalised, not large changes on its own.

In theory: does building the subagent have an "impact"?

I've removed that sentence, because its a bit more complicated than that; see the next two posts in the sequence, and the summary post:

Load More