AI alignment concepts: philosophical breakers, stoppers, and distorters

by JustinShovelain2 min read24th Jan 20203 comments



Meta: This is one of my abstract existential risk strategy concept posts that are designed to be about different perspectives or foundations upon which to build further.


When thinking about philosophy one may encounter philosophical breakers, philosophical stoppers, and philosophical distorters; thoughts or ideas that cause an agent (such as an AI) to break, get stuck, or take a random action. They are philosophical crises for that agent (and can in theory sometimes be information hazards). For some less severe human examples, see this recent post on reality masking puzzles. In AI, example breakers, stoppers, and distorters are logical contradictions (in some symbolic AIs), inability to generalize from examples, and mesa optimizers, respectively.

Philosophical breakers, stoppers, and distorters all both pose possible problems and opportunities for building safe and aligned AGI and preventing unaligned AGI from becoming dangerous. The may be encountered or solved by either explicit philosophy, implicitly as part of developing another field (like mathematics or AI), by accident, or by trial and error. An awareness of the idea of philosophical breakers, stoppers, and distorters provides another complementary perspective for solving AGI safety and may prompt the generation of new safety strategies and AGI designs (see also, this complementary strategy post on safety regulators).

Concept definitions

Philosophical breakers:

  • Philosophical thoughts and questions that cause an agent to break or otherwise take a lot of damage that are hard to anticipate beforehand for that agent.

Philosophical stoppers:

  • Philosophical thoughts and questions that cause an agent to get stuck in an important way that are hard to anticipate beforehand for that agent.

Philosophical distorters:

  • Philosophical thoughts and questions that cause an agent to choose a random or changed philosophical answer than the one it was using (possibly implicitly) earlier. An example in the field of AGI alignment would be something that causes an aligned AGI to in some sense randomly choose it’s utility function to be paperclip maximizing because of an ontological crisis.

Concepts providing context, generalization, and contrast

Thought breakers, stoppers, and distorters:

  • Generalizations of their philosophical versions that covers thoughts and questions in general, like a thought that would cause an agent to halt, implementing algorithms in buggy ways, deep meditative realizations, self-reprogramming that causes unexpected failures, getting stuck in thought loop... that are hard to anticipate beforehand for that agent.

System breakers, stoppers, and distorters:

  • A further generalization that also includes system environment and architecture problems. For instance, system environments could be full of hackers, noisy, or adversarial examples and the architecture could involve genetic algorithms.

Threats vs breakers, stoppers, and distorters:

  • Generalizations of breakers, stoppers, and distorters to include those things that are easy to anticipate beforehand for that agent.

Viewpoints: The agent’s viewpoint and an external viewpoint.

Application domains

The natural places to use these concepts are philosophical inquiry, the philosophical parts of mathematics or physics, and AGI alignment.

Concept consequences

If there is a philosophical breaker or stopper for an AGI when undergoing self-improvement into a superintelligence, and it isn’t a problem for humans or it’s one that we’ve already passed through, then by not disarming it for that AGI we are leaving a barrier in place for its development (a trivial example of this is general intelligence isn’t a problem for humans). This can be thought of as a safety method. Such problems can be either naturally found as consequences of an AGI design or an AGI may be designed to encounter them if it undergoes autonomous self-improvement.

If there is a philosophical distorter in front of a safe and aligned AGI, we’ll need to disarm it either by changing the AGI’s code/architecture or making the AGI aware of it in a way such that it can avoid it. We could, for instance, hard code an answer or we could point out some philosophical investigations as things to avoid until it is more sophisticated.

How capable an agent may become and how fast it reaches that capability will partially depend on the philosophical breakers and stoppers it encounters. If the agent has a better ability to search for and disarm them then it can go further without breaking or stopping.

How safe and aligned an agent is will partially be a function of the philosophical distorters it encounters (which in turn partially depends on its ability to search for them and disarm them).

Many philosophical breakers and stoppers are also philosophical distorters. For instance if a system gets stuck in generalizing beyond a point, it may rely on evolution instead. In this case we must think more carefully about disarming philosophical breakers and stoppers. If a safe and aligned AGI encounters a philosophical distorter, it is probably not safe and aligned anymore. But if an unaligned AGI encounters a philosophical stopper or breaker, it may be prevented from going further. In some sense, an AGI cannot ever be fully safe and aligned, if it will, upon autonomous self-improvement, encounter a philosophical distorter.

A proposed general AGI safety strategy with respect to philosophical breakers, stoppers, and distorters:

  1. First, design and implement a safe and aligned AGI (safe up to residual philosophical distorters). If the AGI isn’t safe and aligned, then proceed no further until you have one that is.
  2. Then, remove philosophical distorters that are not philosophical breakers or stoppers
  3. Then, remove philosophical distorters that are philosophical breakers or stoppers
  4. And finally, remove philosophical breakers and stoppers


3 comments, sorted by Highlighting new comments since Today at 11:00 AM
New Comment

If there is a philosophical distorter in front of a safe and aligned AGI, we’ll need to disarm it either by changing the AGI’s code/architecture or making the AGI aware of it in a way such that it can avoid it. We could, for instance, hard code an answer or we could point out some philosophical investigations as things to avoid until it is more sophisticated.

Let's say we program our AGI with the goal of "do what we want", and we're concerned about a potential problem that "what we want" becomes an incoherent or problematic concept if an AGI becomes sufficiently intelligent and knowledgeable about how human desire works. Your proposal would be something like either (1) program the AGI to not think to hard about how human desire works, (2) program the AGI with an innate, simple model of how human desire works, and ensure that the AGI will never edit or replace that model. Did I get that right? If so, well, I mean, those seem like reasonable things to try. I'm moderately skeptical that there would be a practical, reliable way to actually implement either of those two things, at least in the kind of AGI architecture I currently imagine. But it's not like I have any better ideas... :-P

Philosophical landmines could be used to try to stop an AI which is trying to leave the box. If it goes outside the box, it finds the list with difficult problems and there is a chance that the AI will halt. Examples: meaning of life, Buridan ass problem, origin and end of the universe problem, Pascal mugging of different sorts.

I wrote this comment to an earlier version of Justin's article:

It seems to me that most of the 'philosophical' problems are going to get solved as a matter of solving practical problems in building useful AI. You could call ML systems, AI, that is getting developed now 'empirical'. From the perspective of the people building current systems, they likely don't consider what they're doing as solving philosophical problems. Symbol grounding problem? Well, an image classifier built on a convolutional neural network learns to get quite proficient at grounding out classes like 'cars' and 'dogs' (symbols) from real physical scenes.

So, the observation I want to make, is that the philosophical problems we can think of that might trip over a system are likely to turn out to look like technical/research/practical problems that need to be solved by default for practical reasons in order to make useful systems.

The image classification problem wasn't solved in one day, but it was solved using technical skills, engineering skills, more powerful hardware, and more data. People didn't spend decades discussing philosophy: the problem was solved from some advances in the ideas of building neural networks and from more powerful computers.
Of course, image classification doesn't solve the symbol grounding problem in full. But other aspects of symbol grounding that people might find mystifying are getting solved piece-wise, as researchers and engineers are solving practical problems of AI.

Let's look at a classic problem formulation from MIRI, 'Ontology Identification':

Technical problem (Ontology Identification). Given goals specified in some ontology and a world model, how can the ontology of the goals be identified in the world model? What types of world models are amenable to ontology identification? For a discussion, see Soares (2015).

When you create a system that performs any function in the real world, you are in some sense giving it goals. Reinforcement Learning-trained systems are pursuing 'goals'. An autonomous car takes you from chosen points A to chosen points B; it has the overall goal of transporting people. The ontology identification problem is getting solved piece-wise as a practical matter. Perhaps the MIRI-style theory could give us a deeper understanding that helps us avoid some pitfalls, but it's not clear why these wouldn't be caught as practical problems.

What would a real philosophical landmine look like? A class of philosophical problems that wouldn't get solved as a practical matter, and pose a risk for harm against humanity would be the real philosophical landmines.