Epistemic status: either I’m confused, or non-obstruction isn’t what I want.
This is a response to Alex Turner’s Non-Obstruction: A simple Concept Motivating Corrigibility. Please read that first, and at least skim Reframing Impact where relevant.
It’s all good stuff.
I may very well be missing something: if not, it strikes me as odd that many smart people seem to have overlooked the below. From an outside-view, the smart money says I'm confused.
Feel free to mentally add “according to my current understanding”, “unless I’m missing something”, “it seems to me” as appropriate.
I’m writing this because:
- Non-obstruction seems like an important idea, but I don’t think it works.
- I’d like to find out whether/where I’m confused, why the below can't happen etc.
- I want to encourage the norm of thoughtful public criticism of important ideas - even where the critic is uncertain, and may have overlooked technical details.
- I stubbornly believe I'm correct :)
[ETA after talking to Alex, we agreed that this is a problem. He suggested allowing a small margin would clear up the likely cases, and I agree with this. I.e. you have a condition more like: the case with the AI activated must give at least 98% of the non-activated attainable utility for all counterfactual payoff functions.
So, for example, this should allow you the chance to do a long reflection, even if you're comparing to a baseline which hastily optimizes hard for a narrow target (which may mistakenly exclude the true payoff function). Under likely conditions, thinking for a few thousand years may lose quite a bit of attainable utility, but not 2%.
Ideally, I'd like a function which tells us the required margin on a set S for some notion of "well-behaved" non-obstruction, but I don't yet see how to do this. (not too important whether it'd return 0.01% or 0.001% margin, but where it'd return 80%, something strange would be happening)]
Summary of my claims:
- Non-obstruction seems to be useful where our AU landscape is pretty flat by default.
- Our AU landscape is probably spikey by default.
- Non-obstruction locks in default spike-tops in S, since it can only make Pareto improvements. (modulo an epsilon here or there)
- Locking in spike-tops is better than nothing, but we can, and should, do better.
Alex makes a good case for claim 1.
My argument for claim 2 is that if our AU landscape were flat, we wouldn’t be worried:
Our assumption isn’t that, by default, we remain equally poor at optimising for any goal. Rather it’s that we expect to get very powerful at optimising, but may catastrophically fail to aim at the right target, or to hit our intended target. So we may optimise extremely well for some narrow set of goals, but not elsewhere.
I.e. we’re worried that our AU landscape is spikey by default.
Perhaps there’s some flat baseline we could use for comparison, but “the outcome when system X is off” is not it. Almost all paths lead to high optimisation capability (if they don't lead to disaster first); most optimisation leads to spikes.
If all the default spikes fall outside the scope of human goals (e.g. paperclip spikes), then non-obstruction is probably only useful in ruling out hell-worlds - which is good, as far as it goes.
If at least one default spike falls within the scope of human goals (i.e. within Alex’s set S), then any “non-obstructive” AI must adopt a policy which will preserve that spike. The AI must predict that if it follows its policy, and we ours, we wouldn't ever sacrifice the spike-top option for a broader benefit over many other potential goals (if we ever would, then it’d have to let us, or fail to satisfy non-obstruction at that moment).
To be clear, this doesn’t look like a policy which shows us our mistake and allows us to avoid it with a broader, smoother optimisation (that would count as obstructive with respect to the spike-top goal). It looks like a policy that reliably pushes us towards making the same mistake of extreme, narrow optimisation we’d make by default, only with a somewhat more broadly desirable outcome.
This may include taking intentionally irreversible-for-us actions to lock in outcomes that will look undesirable to us in retrospect. These would need to be taken early enough that we’d still be on course to make our default mistake: this way there’s no moment when the AI’s being off actually helps us for any goal.
Here’s a picture of what happens when our default don’t-activate-the-AI outcome is that we (perhaps mistakenly) optimise the world hard for some overly narrow goal P within S. Let’s assume that P here is: “The 80s hairstyle enthusiast’s utopia” (causes undesirable spikiness, gets locked in place, highly regrettable in retrospect). [ETA this graph doesn't show the more probable case, where there are much narrower spikes over a small subsets of S; a single spike over all of S isn't at all likely]
Here our "non-obstructive" AI robustly improves upon default outcomes, but this is neither what we want, nor is it non-obstructive in everyday terms. While we're still unknowingly heading for the green line, it'll take action to lock in the red line and prevent our getting the blue line. It'll do this precisely because it knows we'd prefer the blue line, and it's constrained never to allow a default spike-top to be sacrificed.
A few thoughts:
- Perhaps if the system is capable of identifying and adapting its policy to pre-existing AU spikes, it can simply tell us about the spikes and what we’d need to do to avoid them.
- This amounts to “We could ask our system how to solve the alignment problem”. If it can, that's great - but this is not news.
- Perhaps we can use some baseline other than “what would happen without this system”.
- Nothing natural springs to mind, but maybe.
- Perhaps it’s better to use some kind of weighted integral over (attainable) utility difference on S, rather than requiring only Pareto improvement.
- This seems more promising to me. Pareto improvement just isn’t what I want, given that we may be starting from a spikey baseline. [ETA I no longer think this makes sense as a non-obstruction alternative; it's no longer aiming at something like corrigibility]
Our default AU landscape is likely to be spikey.
If our goal is to fight spikiness, it’s not enough for a system to avoid creating spikes - it needs to deal well with any spikes that happen by default.
Pareto improvement doesn’t deal well with existing spikes within S: it can lead to an AI that intentionally takes irreversible-by-us actions it knows we’ll wish it hadn’t.
Non-obstruction does not capture all the important benefits of corrigibility in such cases. Robust improvement on default outcomes is not sufficient to get what we want.
If our S is well chosen, maximising an integral seems preferable.
[ETA: I now think this misses the point - it looks nothing like corrigibility]
The most straightforward approach seems to be:
Integrate AU impact of our AI over all P in S, weighted by probability density at P
Presumably this isn't a revelation to anyone, and I'm not clear that it's always desirable.
I do think it beats non-obstruction.
Do please enlighten me if I'm confused, stating-the-obvious, missing something...