<@Previously@>(@Seeking Power is Provably Instrumentally Convergent in MDPs@) we’ve seen that if we take an MDP, and have a distribution over state-based reward functions, such that the reward for two different states is iid, then farsighted (i.e. no discount) optimal agents tend to seek “power”. This post relaxes some of these requirements, giving sufficient (but not necessary) criteria for the determining instrumental convergence.

Some of these use a new kind of argument. Suppose that action A leads you to a part of the MDP modeled by a graph G1, and B leads you to a part of the MDP modeled by a graph G2. If there is a subgraph of G2 that is isomorphic to G1, then we know that whatever kinds of choices the agent would have by taking action A, the agent would also have those choices from action B, and so we know B is at least as likely as A. This matches our intuitive reasoning -- collecting resources is instrumentally convergent because you can do the same things that you could if you didn’t collect resources, as well as some additional things enabled by your new resources.

Reply

[-]FactorialCode5y30

One hypothesis I have is that even in the situation where there is no goal distribution and the agent has a single goal, subjective uncertainty makes powerful states instrumentally convergent. The motivating real world analogy being that you are better able to deal with unforeseen circumstances when you have more money.

Reply

[-]ryan_b5yΩ120

I have a question about this conclusion:

When , you're strictly more likely to navigate to parts of the future which give you strictly more options (in a graph-theoretic sense). Plus, these parts of the future give you strictly more power.

What about the case where agents have different time horizons? My question is inspired by one of the details of an alternative theory of markets, the Fractal Market Hypothesis. The relevant detail is an investment horizon, which is how long an investor keeps the asset. To oversimplify, the theory argues that markets work normally with a lot of investors with different investment horizons; when uncertainty increases, investors shorten their horizons, and then when everyone's horizons get very short we have a panic.

I thought this might be represented by step function in the discount rate, but reviewing the paper it looks like $γ$ is continuous. It also occurs to me that this should be similar in terms of computation to setting $γ = 1$ and running it over fewer turns, but this doesn't seem like it would work as well for the case of modelling different discount rates on the same MDP.

Reply

[-]TurnTrout5yΩ240

What do you mean by "agents have different time horizons"?

To answer my best guess of what you meant: this post used "most agents do X" as shorthand for "action X is optimal with respect to a large-measure set over reward functions", but the analysis only considers the single-agent MDP setting, and how, for a fixed reward function or reward function distribution, optimal action for an agent tends to vary with the discount rate. There aren't multiple formal agents acting in the same environment.

Reply

[-]ryan_b5yΩ240

The single-agent MDP setting resolves my confusion; now it is just a curiosity with respect to directions future work might go. The action varies with discount rate result is essentially what interests me, so refocusing in the context of the single-agent case: what do you think of the discount rate being discontinuous?

So we are clear there isn't an obvious motivation for this, so my guess for the answer is something like "Don't know and didn't check because it cannot change the underlying intuition."

Reply

[-]TurnTrout5yΩ120

Discontinuous with respect to what? The discount rate just is, and there just is an optimal policy set for each reward function at a given discount rate, and so it doesn't make sense to talk about discontinuity without having something to govern what it's discontinuous with respect to. Like, teleportation would be positionally discontinuous with respect to time.

You can talk about other quantities being continuous with respect to change in the discount rate, however, and the paper proves prove the continuity of e.g. POWER and optimality probability with respect to .