(with thanks to Owain Evans)
An ontological crisis happens when an agent's underlying model of reality changes, such as a Newtonian agent realising it was living in a relativistic world all along. These crises are dangerous if they scramble the agent's preferences: in the example above, an agent dedicated to maximise pleasure over time could transition to completely different behaviour when it transitions to relativistic time; depending on the transition, it may react by accelerating happy humans to near light speed, or inversely, ban them from moving - or something considerably more weird.
Peter de Blanc has a sensible approach to minimising the disruption ontological crises can cause to an AI, but this post is concerned with analyzing what happens when such approaches fail. How bad could it be? Well, this is AI, so the default is of course: unbelievably, hideously bad (i.e. situation normal). But in what ways exactly?
If the ontological crisis is too severe, the AI may lose the ability to do anything at all, as the world becomes completely incomprehensible to it. This is very unlikely; the ontological crisis was most likely triggered by the AIs own observations and deductions, so it is improbable that it will lose the plot completely in the transition.
A level below that is when the AI can still understand and predict the world, but the crisis completely scrambles its utility function. Depending on how the scrambling happens, this can be safe: the AI may lose the ability to influence the value of its utility function at all. If, for instance, the new utility function assigns wildly different values to distinct states in a chaotic system, the AI's actions become irrelevant. This might be if different worlds with different microstates but same macrostates get spread evenly across the utility values: unless the AI is an entropy genie, it cannot influence utility values through its decisions, and will most likely become catatonic.
More likely, however, is that the utility function is scrambled to something alien, but still AI-influenceable. Then the AI will still most likely have the convergent instrumental goals of gathering power, influence, pretending to be nice, before taking over when needed. The only saving grace is that its utility function is so bizarre, that we may be able to detect this in some way.
The most dangerous possibility is if the AI's new utility function resembles the old one, plus a lot of noise (noise from our perspective - from the AIs point of view, it all makes perfect sense). Human values are complex, so this would be the usual unfriendly AI scenario, but making it hard for us to notice the change.
A step below this is when the AI's new utility function resembles the old one, plus a little bit of noise. Human values remain complex, so this is still most likely an UFAI, but safety precautions built into its utility function (such as AI utility indifference or value learning or similar ideas) may not become completely neutered.
In summary:
| Type of crisis | Notes | Danger |
|---|---|---|
| World incomprehensible to AI |
Very unlikely | None |
| Utility completely scrambled, AI unable to influence it |
Uncertain how likely this is | Low |
| Utility scrambled, AI able to influence it |
We may be able to detect change | Very High |
| Lots of noise added to utility |
Difficult to detect change | Maximal |
| Some noise added to utility |
Small chance of not being so bad, some precautions may remain useful. | High |
I'm fairly skeptical about trying to understand AI behavior at this level, given the current state of affairs (that is, I think the implicit picture of AI behavior on which these analyses rely is quite unlikely, so that the utility of this sort of thinking is reduced by an order of magnitude). Anyway, some specific notes:
The utility scrambled situation is probably as dangerous as more subtle perturbations if you are dealing with human-level AI, as keeping human onlookers happy is instrumentally valuable (and this sort of reasoning is obvious to an AI as clever as we are on this axis, never mind one much smarter).
The presumed AI architecture involves human designers specifying a prior and utility function over the same ontology, which seems quite unlikely from here. In more realistic situations, the question of value generalization seems important beyond ontological crises, and in particular if it goes well before reaching an ontological crisis it seems overwhelmingly likely to continue to go well.
An AI of the sort you envision (with a prior and a utility function specified in the ontology of that prior) can never abandon its ontology. It will instead either become increasingly confused, or build a model for its observations in the original ontology (if the prior is sufficiently expressive). In both cases the utility function continues to apply without change, in contrast to the situation in de Blanc's paper (where an AI is explicitly shifting from one ontology to another). If the utility function was produced by human designers it may no longer correspond with reality in the intended way.
It seems extremely unlikely that an AI with very difficult to influence values will be catatonic. More likely hypotheses suggest themselves, such as: doing things that would be good in (potentially unlikely worlds) where value is more easily influenced, amassing resources to better understand whether value can be influenced, or having behavior controlled in apparently random (but quite likely extremely destructive) ways that give a tiny probabilistic edge. For only very rare values will killing yourself be a good play (since this suggests utility can be influenced by killing yourself, but not by doing anything more extreme).
The rest are unrelated to the substance of the post, except insofar as they relate to the general mode of thinking:
As far as I can tell, AI indifference doesn't work (see my comment here). I don't think it is salvageable, but even if it is it at least seems to require salvaging.
Note that depending on the structure of "evidence for goals" in the value indifference proposal, it is possible that an AI can in fact purposefully influence its utility function and will be motivated to do so. To see that the proof sketch given doesn't work, notice that I have some probability distribution over what I will be doing in a year, but that (despite the fact that this "obeys the axioms of probability") I can in fact influence the result and not just passively learn more about it. An agent in this framework is automatically going to be concerned with acausal control of its utility function, if its notion of evidence is sufficiently well-developed. I don't know if this is an issue.
An important point that I think doesn't have a post highlighting it. An AI that only cares about moving one dust speck by one micrometer on some planet in a distant galaxy if that planet satisfies ... (read more)