My second mistake was thinking that danger was related to the quantity of RL finetuning. I muddled up agency/goal-directedness with danger, and was also wrong that RL is more likely to produce agency/goal-directedness, conditioned on high capability. It's a natural mistake, since stereotypical RL training is designed to incentivize goal-directedness. But if we condition on high capability, it wipes out that connection, because we already know the algorithm has to contain some goal-directedness.
Distinguish two notions of "goal-directedness":
The system has a fixed goal that it capably works towards across all contexts.
The system is able to capably work towards goals, but which it does, if any, may depend on the context.
My sense is that a high level of capability implies (2) but not (1). And that (1) is way more obviously dangerous. Do you disagree?
Thanks for the feedback!
... except, going through the proof one finds that the latter property heavily relies on the "uniqueness" of the policy. My policy can get the maximum goal-directedness measure if it is the only policy of its competence level while being very deterministic. It isn't clear that this always holds for the optimal/anti-optimal policies or always relaxes smoothly to epsilon-optimal/anti-optimal policies.
Yeah, uniqueness definitely doesn't always hold for the optimal/anti-optimal policy. I think the way MEG works here makes sense: if you're following the unique optimal policy for some utility function, that's a lot of evidence for goal-directedness. If you're following one of many optimal policies, that's a bit less evidence -- there's a greater chance that it's an accident. In the most extreme case (for the constant utility function) every policy is optimal -- and we definitely don't want to ascribe maximum goal-directedness to optimal policies there.
With regard to relaxing smoothly to epsilon-optimal/anti-optimal policies, from memory I think we do have the property that MEG is increasing in the utility of the policy for policies with greater than the utility of the uniform policy, and decreasing for policies with less than the utility of the uniform policy. I think you can prove this via the property that the set of maxent policies is (very nearly) just Boltzman policies with varying temperature. But I would have to sit down and think about it properly. I should probably add that to the paper if it's the case.
minimum for uniformly random policy (this would've been a good property, but unless I'm mistaken I think the proof for the lower bound is incorrect, because negative cross entropy is not bounded below.)
Thanks for this. The proof is indeed nonsense, but I think the proposition is still true. I've corrected it to this.
Instead of tracking who is in debt to who, I think you should just track the extent to which you’re in a favouring-exchanging relationship with a given person. Less to remember and runs natively on your brain.
- ...if the malign superintelligence knows what observations we would condition on, it can likely arrange to make the world match those observations, making the probability of our observations given a malign superintelligence roughly one
The probability of any observation given the existence of a malign superintelligence is 1? So P(observation | malign superintelligence) adds up to like a gajillion?
5.1.4. It may be that the easiest plan to find involves an unacceptable degree of power-seeking and control over irrelevant variables. Therefore, the score function should penalize divergence of the trajectory of the world state from the trajectory of the status quo (in which no powerful AI systems take any actions).
5.1.5. The incentives under 5.1.4 by default are to take control over irrelevant variables so as to ensure that they proceed as in the anticipated "status quo". Infrabayesian uncertainty about the dynamics is the final component that removes this incentive.
If you know which variables you want to remove the incentive to control, an alternative to penalising divergence is path-specific objectives, i.e. you compute the score function under an intervention on the model that sets the irrelevant variables to their status quo values. Then the AI has no incentive to control the variables, but no incentive to keep them the same either.
theorem is limited. only applies to cases where the decision node is not upstream of the environment nodes
I think you can drop this premise and modify the conclusion to “you can find a causal model for all variables upstream of the utility and not downstream of the decision.”
Lucius-Alexander SLT dialogue?
Yep, exactly.
But superhuman capabilities doesn’t seem to imply “applies all the optimisation pressure it can towards a goal”.
Like, being crazily good at research projects may require the ability to do goal-directed cognition. It doesn’t seem to require the habit of monomaniacally optimising the universe towards a goal.
I think whether or not a crazy good research AI is a monomaniacal universe optimiser probably depends on what kind of AI it is.