A simulation is often used to describe a computer program imitating a physical process. In this post, I use the term "reverse simulation" to refer to a scenario when a computer program is simulated using physics.
The approach I am suggesting here, is meant to address one problem of AI alignment that belongs to the philosophy of mind:
How can we prevent AGI from accidentally causing suffering of minds, when we do not yet know an objective theory of minds?
It is not evident to me that suffering of minds can be assigned a utility. Of course, assuming one could assign a utility, there might be a correct decision theory. However, if suffering of minds can not be accurately described, then there is some philosophical unsoundness that follows from an inaccurate description.
The basic idea is to use two steps for aligning AGI:
In this approach, I assume AGI can be used safely to do step 1 and 2.
However, due to the lack of an objective theory of minds, I have no proof that this approach is safe. The strongest property of safety I believe is provable, assuming a very strong theorem prover is built in the future, is that the AGI is aligned with a theory of a relatively safe environment regarding unintentional suffering of minds.