I'm working on writing a paper about an idea I previously outlined for addressing false positives in AI alignment research. This is the first completed draft of one of the subsections arguing for the adoption of a particular, necessary hinge proposition to reason about aligned AGI. I appreciate feedback on this subsection especially regarding if you agree with the line of reasoning and if you think I've ignored anything important that should be addressed here. Thanks!
AGI alignment is typically phrased in terms of aligning AGI with human interests, but this hides some of the complexity of the problem behind determining what "human interests" are. Taking "interests" as a synonym for "values", we can begin to make some progress by treating alignment as at least partially the problem of teaching AGI human values (Soares, 2016). Unfortunately, what constitutes human values is currently unknown since humans may not be aware of the extent of their own values or may not hold reflexively consistent values (Scanlon, 2003). Further complicating matters, humans are not rational, so their values cannot be deduced from their behavior unless some normative assumptions are made (Tversky, 1969), (Armstrong and Mindermann, 2017). This is a special case of Hume's is-ought problem—that axiology cannot be inferred from ontology alone—and it complicates the problem of training AGI on human values (Hume, 1739).
Perhaps some of the difficulty could be circumvented if a few normative assumptions were made, like assuming that rational preferences are always better than irrational preferences or assuming that suffering supervenes on preference satisfaction. This poses an immediate problem for our false positive reduction strategy by introducing additional variables that will necessarily increase the chance of a false positive. Maybe we could avoid making any specific normative assumptions prior to the creation of aligned AGI by expecting the AGI to discover them via a process like Yudkowsky's coherent extrapolated volition (Yudkowsky, 2004). This may avoid the need to make as many assumptions, but still requires making at least one—that moral facts exist to permit the correct choice of normative assumptions—and reveals a deep philosophical problem at the heart of AGI alignment—meta-ethical uncertainty.
Meta-ethical uncertainty stems from epistemic circularity and the problem of the criterion since it is not possible to know the criteria by which to asses which moral facts are true or even if any moral facts exist without first assuming to know what is good and true (Chisholm, 1982). We cannot hope to resolve meta-ethical uncertainty here, but we can at least decide what impact particular assumptions about the existence of moral facts have upon false positives in AGI alignment. Specifically, whether or not moral facts exists and, if they do, what moral facts should be assumed to be true.
On the one hand suppose we assume that moral facts exist, then we could build aligned AGI on the presupposition that it could at least discover moral facts even if no moral facts were specified in advance and then use knowledge of these facts to constrain its values such that they aligned with humanity's values. Now suppose this assumption is false and moral facts do not exist, then our moral-facts-assuming AGI would either never discover any moral facts to constrain its values to be aligned with human values or would constrain itself with arbitrary moral facts that would not be sure to produce value alignment with humanity.
On the other hand suppose we assume that moral facts do not exist, then we must build aligned AGI to reason about and align itself with the axiology of humanity in the absence of any normative assumptions, likely on a non-cognitivist basis like emotivism. Now suppose this assumption is false and moral facts do exist, then our moral-facts-denying AGI would discover the existence of moral facts, at least implicitly, by their influence on the axiology of humanity and would align itself with humanity as if it had started out assuming moral facts existed but at the cost of solving the much harder problem of learning axiology without the use of normative assumptions.
Based on this analysis it seems that assuming the existence of moral facts, let alone assuming any particular moral facts, is more likely to produce false positives than assuming moral facts do not exist because denying the existence of moral facts gives up the pursuit of a class of alignment schemes that may fail, namely those that depend on the existence of moral facts. Doing so likely makes finding and implementing a successful alignment scheme harder, but it does this by replacing difficulty tied to uncertainty around a metaphysical question that may not be resolved in favor of alignment to uncertainty around implementation issues that through sufficient effort may be made to work. Barring a result showing that moral nihilism—the assumption that no moral facts exist—implies the impossibility of building aligned AGI, it seems the best hinge proposition to hold in order to reduce false positives in AGI alignment due to meta-ethical uncertainty.
Why "moral nihilism—the assumption that no moral facts exist—implies the impossibility of building aligned AGI"?
I personally think that alignment is impossible because, shortly speaking, "humans don't have values" - they have complex unstable behaviour, which could be simplified by using idea of preferences, but there is no actual preferences.
This two claims seems similar: non-existence of values and non-existence of moral facts, but are they actually the same claim?
This is an interesting point to bring up, and maybe one I'll find a way to explore more.
You're right that there is a sense in which humans don't seem to have preferences or values because both are fuzzy categories that draw arbitrary distinctions. The best way to rescue them seems to be to think of values as thoughts that meet at least the following criteria:
There might be others but I think that's enough to let us carve up the space of things that are candidate behaviors such that we exclude the things that are not values and include the things that are. Also this gives a rather odd version of what preferences/values are because it eliminates the normative assumptions normally made when talking about preferences in order to better reflect the "unstable behavior" of humans. I still find value in talking about values, though, because it gives a way to distinguish between the interior experience of value and the exterior experience of acting.
On you're second point I disagree for the reason just given: there is a sense in which we can talk about values that are non-normative and so avoid the need for moral facts, making them not equivalent. I have, though, neglected to address in this section (or elsewhere in the paper) this take on what we mean by values and made sure it's clear I'm holding it separate from some other discussions about values/preferences where normativity is sought or assumed.
I have started to write longer text where I will explore the idea of the non-existence of values, and I expect to have ready to present version in 2-3 months, and will share it on LW, so we then could again compare our views on this topic. In it I will adress 3 different ways who we coul learn persons values: 1) his action 2) his emotion 3) his external claims - which could be different from his thoughts about his values, so your idea may be the 4th level.
Great, I look forward to it! That's a topic I've been somewhat unwilling to tackle just now because I've identified what I'm working on now as crucial to other lines of thinking I want to explore but consider it crucial to how we're going to address problems we face in designing alignment schemes.
As you mentionned, no axiology can be inferred from ontology alone.
Even with meta-ethical uncertainty, if we want to build an agent that takes decisions/actions, it needs some initial axiology. If you include (P) "never consider anything as a moral fact" as part of your axology, then two things might happen:
I see a problem here. If 1) holds, then it has considered (P) has a moral fact, absurd. If 2) holds, then your agent has lost the meta-ethics principle you wanted him to keep.
So maybe you wanted to put the meta-ethics uncertainty inside the ontology ? It this is what you meant, that doesn't seem to solve the axiology problem.
It's not necessarily that we don't want an agent that never considers something to be a moral fact, but that we are safer if we designed AGI under the assumption that moral facts don't exist. This does not necessarily mean precluding an AGI coming to believe in the existence of moral facts and almost certainly doesn't mean giving the AGI a preference to not believe in moral facts, but that alignment schemes should not depend on moral facts existing, and thus should not depend on the ability to make normative assumptions that will be compatible with alignment.