Previous Intro: Formal Metaethics and Metasemantics for AI Alignment

I’m nearing the completion of a hopefully much more readable version of the ideas previously released as set-theoretic code. This takes the form of a detailed outline, currently in WorkFlowy, in which you can easily expand/collapse subsections which elaborate on their parents’ content. You can find the current draft here.

Although it’s not polished, I’m releasing it in preparation for a Q&A I’ll be holding at the University of California Berkeley AI and Philosophy working group, which I hope you will attend. I’ll likely make some brief introductory remarks but reserve most of the time for answering questions. The working group is part of the UC Berkeley Social Science Matrix and will be held at:

Barrows Hall, 8th Floor, Mezzanine Level
Wed, Dec 4th 12:30-2:30pm
(only the first hour is reserved for this Q&A)


Here I’ve reproduced just the first few levels of the outline. Click here to see their elaboration (currently ~4,800 words).

  • Given mathematical models of the world and the adult human brains in it, an ethical goal function for AI can be constructed by applying a social welfare function to the set of extensional rational utility functions of the brains.
    • The mathematical model of a world or brain is to be given as a causal Markov model.
      • A causal Markov model is a convenient model for generating a causal model.
        • The notion of a causal model is taken directly from Judea Pearl.
          • A causal model is composed of:
        • A causal Markov model is composed of:
        • A causal Markov model (cmm) generates a causal model (cm) as follows:
    • A brain’s rational utility function is the utility function that would be arrived at by the brain’s decision algorithm if it were to make more optimal decisions while avoiding unrelated distortions of value.
      • A brain’s decision algorithm is the one that best satisfies these desiderata:
        • First, it must take the mathematical form of a decision algorithm, which is a tuple composed of:
        • Next, there must be an implementation function which maps brain states to decision states such that these two routes from a brain state to a decision event always arrive at the same result:
        • It achieves a high rate of compression of the brain’s causal transition function.
        • It is probabilistically coherent, including with its represented causal models.
        • It is instrumentally rational in both its first-order and higher-order utility functions.
        • It is ambitious, trying to explain as much as possible with the decision algorithm.
      • The final formulation specifying the rational utility function gets rather complicated but we can build up to it with a couple initial approximations:
      • Final specification: Simulate all possible continuations of an agent and apply a social welfare function to their utility functions while weighting them by optimality of prescriptions, agential identity and likelihood.
      • The advantages of this metaethics include:
    • Extension: The rational utility function of a brain above is couched in terms of the brain’s own represented expressions, but for interpersonal comparisons, we first cash them out extensionally in terms of their referents in the world.
    • The social welfare function might be thought of as choosing a center of gravity between the extensional rational utility functions.
    • The details above form an initial prototype.

Read the full version here.

New to LessWrong?

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 7:09 PM
A brain’s rational utility function is the utility function that would be arrived at by the brain’s decision algorithm if it were to make more optimal decisions while avoiding unrelated distortions of value.

By what mechanism do you think we can assess how unrelated and how much distortion of value is happening? Put another way, what are "values" in this model such that they are are separate from the utility function and how could you measure whether or not the utility function is better optimizing for those values?

Here, the optimal decisions would be the higher-order outputs which maximize higher-order utility. They are decisions about what to value or how to decide rather than about what to do.

To capture rational values, we are trying to focus on the changes to values that flow out of satisfying one’s higher-order decision criteria. By unrelated distortions of value, I pretty much mean changes in value from any other causes, e.g. from noise, biases, or mere associations.

In the code and outline I call the lack of distortion Agential Identity (similar to personal identity). I had previously tried to just extract the criteria out of the brain and directly operate on them. But now, I think the brain is sufficiently messy that we can only simulate many continuations and aggregate them. That opens up a lot of potential to stray far from the original state. This Agential Identity helps ensure we’re uncovering your dispositions rather than that of a stranger or a funhouse mirror distortion.

Here, the optimal decisions would be the higher-order outputs which maximize higher-order utility. They are decisions about what to value or how to decide rather than about what to do.

What constitutes utility here, then? For example, some might say utility is grounded in happiness or meaning, in economics we often measure utility in money, and I've been thinking along the lines of grounding utility (through value) in minimization of prediction error. It's fine that you are concerned with higher-order processes (I'm assuming by that you mean processes about processes, like higher-order outputs is outputs about outputs, higher-order utility is utility about utility), and maybe you are primarily concerned with abstractions that let you ignore these details, but then it must still be that those abstractions can be embodied in specifics at some point or else they are abstractions that don't describe reality well. After all, meta-values/preferences/utility functions are still values/preferences/utility functions.

To capture rational values, we are trying to focus on the changes to values that flow out of satisfying one’s higher-order decision criteria. By unrelated distortions of value, I pretty much mean changes in value from any other causes, e.g. from noise, biases, or mere associations.

How do you distinguish whether something is a distortion or not? You point to some things that you consider distortions, but I'm still unclear on the criteria by which you know distortions from the rational values you are looking for. One person's bias may be another person's taste. I realize some of this may depend on how you identify higher-order processes, but even if that's the case we're still left with the question as it applies to those directly, i.e. is some particular higher-order decision criterion a distortion or rational?

In the code and outline I call the lack of distortion Agential Identity (similar to personal identity). I had previously tried to just extract the criteria out of the brain and directly operate on them. But now, I think the brain is sufficiently messy that we can only simulate many continuations and aggregate them. That opens up a lot of potential to stray far from the original state. This Agential Identity helps ensure we’re uncovering your dispositions rather than that of a stranger or a funhouse mirror distortion.

This seems strange to me, because much of what makes a person unique lies in their distortions (speaking loosely here), not in their lack. Normally when we think of distortions they are taking an agent away from a universal perfected norm, and that universal norm would ideally be the same for all agents if it weren't for distortions. What leads you to think there are some personal dispositions that are not distortions and not universal because they are caused by the shared rationality norm?

Officially, my research is metaethical. I tell the AI how to identify someone’s higher-order utility functions but remain neutral on what those actually are in humans. Unofficially, I suspect they amount to some specification of reflective equilibrium and prescribe changing one’s values to be more in line with that equilibrium.

On distortion, I’m not sure what else to say but repeat myself. Distortions are just changes in value not governed by satisfying higher-order decision criteria. The examples I gave are not part of the specification, they’re just things I expect to be included.

Distortion is also not meant to specify all irrationality or nonoptimality. It’s just a corrective to a necessary part of the parliamentary procedure. We must simulate the brain’s continuation in some specific circumstance or other and that brings its own influences. So, I wouldn’t call a higher-order criterion a distortion even if it gets rejected. It’s more like a prima facie reason that gets overruled. In any case, we can evaluate such criteria as rational or not but we’d be doing so by invoking some (other unless reflective) higher-order criteria.

For the most part, I don’t believe in norms universal to all agents. Given our shared evolutionary history, I expect significant overlap among humans but that there’d also be some subtle differences from development and the environment. It may also be worth mentioning that even with the same norm, we can preserve uniqueness if for instance, it takes one’s current state into consideration.

Your social welfare function produces a total preference ordering over outcomes, but not a mapping to real-numbered utilities. How do you decide between the median outcome and a 50% chance of either the best or worst? Do you plan to have the preference ordering be over probability distributions over outcomes?

The social welfare compromise varies with our ontology. If everyone agrees that two outcomes are the same, and therefore we represent them as the same outcome, it changes relative distances between preference orderings.

Nice catch. Yes, I think I’ll have to change the ordinal utility functions to range over lotteries rather than simply outcomes.

In this initial version, I am just assuming the ontology of the world is given, perhaps from just an oracle or the world model the AI has inferred.