epistemic status: Working notes of three different people on the same question, likely useless/incomprehensible to anyone else
How to find the right abstraction level of human values
We can learn human values by observing their actions and distilling them into a preference relation. This learned preference relation can overfit human values (eg: Humans want to raise their left arm by 2 cm on 2022-05-07 if they’re in some specific place) or it can underfit human values (eg: Humans care only about maximizing money). If our preference relation overfits, we expect to not find some known biases, e.g. the Allais Paradox. There are also both inconsistencies that are “too abstract” and “too concrete”:
For a set of worlds W, the learned preference relation P⊆W×W represents which world is preferred to another (this generates a graph G=(W,W×W) which can be any graph, with cycles & disconnected components). If we overfit human values, we assume they’re way more rational than they actually are, if we underfit them, we assume they’re way less rational than they actually are. So there is a spectrum over the complexity of the learned preference relation: from overfitting/complexity/rationality/concreteness to underfitting/simplicity/irrationality/abstraction.
Behavior commonly considered irrational can give pointers towards which abstraction level to use. Literature on cognitive biases give a lower bound on the abstraction level of human values. For example, scope neglect is only a bias if we consider the preference “Save the most birds possible” to be our actual preference, and not “Get the most pleasure possible out of saving birds for the least cost of saving them”. A learner having inferred the first preference will judge humans as exhibiting irrational behavior when scope neglect applies, whereas another learner having inferred the second preference will judge humans as perfectly rational in this respect. Therefore, scope neglect being widely recognized as a cognitive bias, means our true preference is the first one, even though it fits less with human observations. We think of scope neglect as a bug of the hardware we run on, not an intrinsic part of our preferences.
The intuition is that we don’t think of our values as containing the cognitive biases, because we model our values at a higher abstraction level than the actual brain implementation.
If we take as an analogy the HTTP protocol. What we would call the actual protocol, is the one described in detail in the RFC, even though it may contain some underspecified parts, or even some internal inconsistencies (i.e. the protocol is irrational). We would not call an HTTP client the HTTP protocol, only a mere implementation of it, even though this one is fully deterministic, and specifies far more precisely the actual communication taking place. Any discrepancy between the specification and the implementation is called a bug. In the same way, human values are more of a specification, and the brain an implementation. Human values are sometimes inconsistent and underspecified, but the implementation on the brain is completely deterministic and specifies far more of our actual behavior. Therefore, cognitive biases are the bugs in the brain implementation of our actual values, not the values themselves.
Therefore, we postulate that cognitive biases, as examples of irrational human behaviors, exist only when the estimated abstraction level of human behavior is over a certain abstraction threshold. Using this knowledge, we can devise an algorithm to find a lower bound on the abstraction level of human value.
If we have known biases/inconsistent preferences, and we have a learner L that learns inconsistent preferences from looking at the data at a level of abstraction, we can have some confidence that we won’t underfit by selecting only preference relations P that contain known biases B (where we potentially don’t need to know the things the bias is about, just the “shape” of the inconsistency it creates). While learning, we might want to execute the following pseudocode (given a set of state-action pairs Q⊆S×A):
for abstraction in range(1, 0):
for abstraction in range(0, 1):
The function lcpf returns preferences P that are least complex, but still fit the known irrationalities perfectly. (One would expect that as we increase the size of Q, the abstraction at which P is returned converges towards 0, if just for quantum noise). We can call this level of abstraction “lowest complexity perfect fit”. We probably don’t want to stop at the LCPF, but somewhere to the simpler side of it, maybe using known human values as an indicator. The function hcpf returns preferences that are as complex as possible, but still fit the known biases. The actual human inconsistent preferences t lie somewhere hcpf≤t≤lcpf.