epistemic status: Working notes of three different people on the same question, likely useless/incomprehensible to anyone else

The question

How to find the right abstraction level of human values

Problems in over- or underfitting human values:

We can learn human values by observing their actions and distilling them into a preference relation. This learned preference relation can overfit human values (eg: Humans want to raise their left arm by 2 cm on 2022-05-07 if they’re in some specific place) or it can underfit human values (eg: Humans care only about maximizing money). If our preference relation overfits, we expect to not find some known biases, e.g. the Allais Paradox. There are also both inconsistencies that are “too abstract” and “too concrete”:

  • Too abstract: If I have three cities , , and , and I traveled , then one might conclude that I have an inconsistency, but in reality I made the travels and
  • Too concrete: (?) If I hugged my friend at location and time , but not at and , but the information about time was disregarded, we might conclude that , which is inconsistent, but in reality I pretty much always want to hug my friend, regardless of time and place.

For a set of worlds W, the learned preference relation represents which world is preferred to another (this generates a graph which can be any graph, with cycles & disconnected components). If we overfit human values, we assume they’re way more rational than they actually are, if we underfit them, we assume they’re way less rational than they actually are. So there is a spectrum over the complexity of the learned preference relation: from overfitting/complexity/rationality/concreteness to underfitting/simplicity/irrationality/abstraction.

Cognitive biases as a lower bound for finding the right abstraction level of human preferences

Behavior commonly considered irrational can give pointers towards which abstraction level to use. Literature on cognitive biases give a lower bound on the abstraction level of human values. For example, scope neglect is only a bias if we consider the preference “Save the most birds possible” to be our actual preference, and not “Get the most pleasure possible out of saving birds for the least cost of saving them”. A learner having inferred the first preference will judge humans as exhibiting irrational behavior when scope neglect applies, whereas another learner having inferred the second preference will judge humans as perfectly rational in this respect. Therefore, scope neglect being widely recognized as a cognitive bias, means our true preference is the first one, even though it fits less with human observations. We think of scope neglect as a bug of the hardware we run on, not an intrinsic part of our preferences.

The intuition is that we don’t think of our values as containing the cognitive biases, because we model our values at a higher abstraction level than the actual brain implementation.

If we take as an analogy the HTTP protocol. What we would call the actual protocol, is the one described in detail in the RFC, even though it may contain some underspecified parts, or even some internal inconsistencies (i.e. the protocol is irrational). We would not call an HTTP client the HTTP protocol, only a mere implementation of it, even though this one is fully deterministic, and specifies far more precisely the actual communication taking place. Any discrepancy between the specification and the implementation is called a bug. In the same way, human values are more of a specification, and the brain an implementation. Human values are sometimes inconsistent and underspecified, but the implementation on the brain is completely deterministic and specifies far more of our actual behavior. Therefore, cognitive biases are the bugs in the brain implementation of our actual values, not the values themselves.

Therefore, we postulate that cognitive biases, as examples of irrational human behaviors, exist only when the estimated abstraction level of human behavior is over a certain abstraction threshold. Using this knowledge, we can devise an algorithm to find a lower bound on the abstraction level of human value.

If we have known biases/inconsistent preferences, and we have a learner L that learns inconsistent preferences from looking at the data at a level of abstraction, we can have some confidence that we won’t underfit by selecting only preference relations that contain known biases (where we potentially don’t need to know the things the bias is about, just the “shape” of the inconsistency it creates). While learning, we might want to execute the following pseudocode (given a set of state-action pairs ):

def lcpf(Q):
    for abstraction in range(1, 0):
         P=L(Q, abstraction)
        if B⊆P
            return P

def hcpf(Q)
    for abstraction in range(0, 1):
         P=L(Q, abstraction)
         if B⊆P
              return P

The function lcpf returns preferences that are least complex, but still fit the known irrationalities perfectly. (One would expect that as we increase the size of , the abstraction at which is returned converges towards 0, if just for quantum noise). We can call this level of abstraction “lowest complexity perfect fit”. We probably don’t want to stop at the LCPF, but somewhere to the simpler side of it, maybe using known human values as an indicator. The function hcpf returns preferences that are as complex as possible, but still fit the known biases. The actual human inconsistent preferences t lie somewhere .

Open Questions

  • Are there incomparable values?
  • After inferring some values, does observing a seemingly irrational behavior means humans are irrational in this respect, or that the values were incorrectly inferred ?
  • When should irrationality be explained away or resolved?
  • There are some truly irrational behaviors that we want to ignore (fear of heights?) and there is some randomness in action, eg at one time I hug a friend, at another I don’t, but in both cases I was very close to my hugging threshold and there is no “real preference” that needs to be explained. How do we distinguish between those?
  • We can guarantee that we will at most overfit the preferences, can we also guarantee to at most underfit them by filtering for known consistent preferences (but here we might run into a problem with how we can identify that they are present, do they have an “identifying shape”?)
  • How can we do this for not just preferences over pure options, but over lotteries as well?

Potentially relevant resources

New to LessWrong?

New Comment