A simulation is often used to describe a computer program imitating a physical process. In this post, I use the term "reverse simulation" to refer to a scenario when a computer program is simulated using physics.

The approach I am suggesting here, is meant to address one problem of AI alignment that belongs to the philosophy of mind:

How can we prevent AGI from accidentally causing suffering of minds, when we do not yet know an objective theory of minds?

It is not evident to me that suffering of minds can be assigned a utility. Of course, assuming one could assign a utility, there might be a correct decision theory. However, if suffering of minds can not be accurately described, then there is some philosophical unsoundness that follows from an inaccurate description.

The basic idea is to use two steps for aligning AGI:

  1. Make the AGI produce a computer simulation of an environment with strong safety properties, which carries low-risk side-effects of producing suffering of minds
  2. Make the AGI use reverse simulation of the computer program, in a context such that interference with other side-effects is low-risk

In this approach, I assume AGI can be used safely to do step 1 and 2.

However, due to the lack of an objective theory of minds, I have no proof that this approach is safe. The strongest property of safety I believe is provable, assuming a very strong theorem prover is built in the future, is that the AGI is aligned with a theory of a relatively safe environment regarding unintentional suffering of minds.

New to LessWrong?

New Comment