Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Thanks to feedback on my recent post I've been reconsidering some of my thinking about how to address meta-ethical uncertainty as it affect alignment. I've run into something unexpected, and I'm curious to see what others think. To quote the key paragraph from my current draft:

A further issue with assuming moral facts do not exist is that it's unclear how to resolve value conflicts in the absence of assumptions about value norms. Certainly any policy can be used to resolve value conflicts for any reason, but the choice of policy by an AGI we consider aligned would still count as a normative assumption about values in practice if not in principle to the extent the AGI is considered to be aligning its behavior with human interests. Thus we are forced to assume moral facts exist in some form if alignment is to be solvable because otherwise we would have no way of deciding how to resolve value conflicts and could not construct an AGI we would consider aligned.

This is not to say we must assume moral realism, cognitivism, or any other particular position about the nature of moral facts, merely accept that moral facts exist since if they don't alignment is impossible in principle since an AGI will have no way to choose what to do when human values conflict. We could perhaps in practice build aligned AGI without doing this by just picking norms based on what someone preferred, but this seems to violate the spirit of the intention behind "alignment" to be for all moral agents and not just one particular human (to be fair addressing just one human is hard enough!) and not just that particular human's current preference that would not necessarily be endorsed upon reflective equilibrium (but the idea that preferences under reflective equilibrium are better is itself a normative assumption of the sort we would be hard pressed to make if we did not assume the existence of moral facts!).

Also note that all of this is made harder by not having a precise definition of alignment I can lean on. Yes, I wrote one, but it depends in part on results I'm considering formally in this paper so I can't use it.

I'm still thinking this through and updating on feedback I received on my first draft, but this seems like a significant enough departure from my earlier line of thinking to solicit additional feedback to consider issues I may be ignoring. Thanks!

New Comment
11 comments, sorted by Click to highlight new comments since:

Why do we need moral facts to tell us how our preferences should be extrapolated? Can’t we want our preferences to be extrapolated in a certain way?

I guess the relevant difference between “I want my preferences to be extrapolated in a certain way” and something more mundane, like “I want to drive to the store”, is that the latter is a more informed preference. But neither is perfectly informed, because driving to the store can have unintended effects too. So ideally we want to get to the point where launching an AI feels like a reasonably informed decision, like driving to the store. That will take a lot of research, but I’m pretty sure it doesn’t require the existence of moral facts, anymore than driving to the store requires moral facts.

The trouble is that moral facts do not just mean realist notions of moral facts, but include anti-realist and non-cognitivist notions of moral propositions like "do what happens when you extrapolate my preferences in a certain way". In fact the closer I look the more it seems like moral facts are a different sort of confusion than the one I originally thought they were.

And yes we may be able to build AGI we consider aligned without fully resolving issues of meta-ethical uncertainty (in fact, I'm counting on it, because meta-ethical uncertainty cannot be fully resolved!), but I'd just like to address your claim that driving to the store does not require moral facts. There is a weak sense in which it doesn't because you can just drive to the store and do what you want to do this, but so can we also build AGI that just does what it wants without need of moral facts. I doubt we would call such an AGI aligned, though, because its behavior is not constrained by the wants of others.

We might say "well, let's just build an AGI that does what we want not what it wants", but then we have to decide what we want and, importantly, figure out what to do about places where what we want comes into conflict with itself. Because we're not trying to build AGI that is aligned with just a single human but with all of humanity (and maybe with all moral agents) we're forced to figure out how to resolve those conflicts in a way independent of any individual, and this is solidly in the realm of what people consider ethics.

Perhaps we can approach this in a new way, but we'd be foolish to fail to notice and appreciate the many ways people have already tried to address this, and these approaches are grounded in the notion of moral facts even if moral facts are a polymorphic category that is metaphysically quite different depending on your ethics.

Yeah, that makes sense. I think resolving any conflict is a judgment call one way or the other, and our judgment calls when launching the AI should be as informed as possible. Maybe some of them can be delegated to humans after the AI is launched, though preventing the AI from influencing them is a problem of its own.

>unclear how to resolve value conflicts in the absence of assumptions about value norms.

This post seems to resolve value conflicts, without assuming moral facts:

Would that be ok for what you have in mind?

Maybe. My guess is not because you probably make a choice somewhere where the AI would resolve conflicts using it's own norm confirming values, but thanks for the suggestion and I'm going to read through carefully both because my initial impression may be wrong and because if I turn out to be right identifying these sort if things is exactly the sort if task I want to enable people to do with my present work.

So, having now reread Stuart's linked post, I'll say that for me the issue is not well resolved. It still assumes the presence of normative assumptions and even choose some specific normative assumptions to make about how to resolve conflicts (although we could argue that the source of these is vague enough that they aren't fully the sort of normative assumptions we'd ultimately make and more proto-assumptions).

I'll also say I got a bit mixed up when I wrote this post and would now back off the argument that alignment seems to require moral facts. I think instead it only requires normative assumptions, which is still a problem from the same reason by at least separates it from the philosophical problem of moral facts even if it leaves most of the practical problem of picking norms.

What exactly is "moral facts exist" supposed to mean? This whole approach smells off to me -- it looks like you're trying to manipulate your confusion as if it were a known quantity. What metaethical baggage is brought to the table by supposing that "do moral facts exist" is a coherent question at all (assuming you do mean something specific by it)?

Well if I'm totally honest I agree with you, but I'm using analytic methods here since it's what my intended audience prefers. I'm working around this by engaging deeply with epistemic uncertainty as it's approached within the analytic tradition, but I also think it smells right now and I haven't totally figured out how to make my intended cuts using the method.

What do you mean by moral facts? It sounds in context like "ways to determine which values to give precedence to in the event of a conflict." But such orders of precedence wouldn't be facts, they'd be preferences. And if they're preferences, why are you concerned that they might not exist?

Moral facts are true moral propositions, where we'd a consider a proposition to be a moral proposition if it makes a claim about something that normally fall under the field of ethics. It's hard to be more precise without speculating about the nature of moral facts and picking up a particular system of ethics. For example, moral facts in an anti-realist, non-cognitivist sense may be something like reflexive relationships between moral agents (something like the golden rule) or even values an individual agents holds as true.

I agree we can talk about, say, an AGI having preferences for how it will resolve value conflicts, but I don't think we can leave the issue there because the AGI's preferences now take on the functional role of moral facts in its reasoning about conflicting human values. This does not mean the AGI is necessarily certain about these moral facts and may change its preferences so that the moral facts effectively change as it learns more.

Perhaps you've revealed some of the trouble, though, in that "moral facts" are too fuzzy a category for the sort of analysis I was attempting and my confusion results from the willingness to consider some things moral facts that are not and I would be better served by more carefully engaging with the terminology.

I'm wondering what your argument is that insisting on the existence of moral facts is *not* a (self-)deceptive way of "picking norms based on what someone prefer[s]" in such a way as to make them appear objective, rather than arbitrary.

Even supposing moral facts do exist, it does not follow that humans can, would, or could know them, correct? Therefore, the actual implementation would still fall back on preferences.