Now that I've written Learning Normativity, I have some more clarity around the concept of "normativity" I was trying to get at, and want to write about it more directly. Whereas that post was more oriented toward the machine learning side of things, this post is more oriented toward the philosophical side. However, it is still relevant to the research direction, and I'll mention some issues relevant to value learning and other alignment approaches.
How can we talk about what you "should" do?
A Highly Dependent Concept
Now, obviously, what you should do depends on your goals. We can (at least as a rough first model) encode this as a utility function (but see my objection).
What you should do also depends on what's the case. Or, really, it depends on what you believe is the case, since that's what you have to go on.
Since we also have uncertainty about values (and we're interested in building machines which should have value uncertainty as well, in order to do value learning), we have to talk about beliefs-about-goals, too. (Or beliefs about utility functions, or however it ends up getting formalized.) This includes moral uncertainty.
Even worse, we have a lot of uncertainty about decision theory -- that is, we have uncertainty about how to take all of this uncertainty we have, and make it into decisions. Now, ideally, decision theory is not something the normatively correct thing depends on, like all the previous points, but rather is a framework for finding the normatively correct thing given all of those things. However, as long as we're uncertain about decision theory, we have to take that uncertainty as input too -- so, if decision theory is to give advice to realistic agents who are themselves uncertain about decision theory, decision theory also takes decision-theoretic uncertainty as an input. (In the best case, this makes bad decision theories capable of self-improvement.)
Clearly, we can be uncertain about how that is supposed to work.
By now you might get the idea. "Should" depends on some necessary information (let's call them the "givens"). But for each set of givens you claim is complete, there can be reasonable doubt about how to use those givens to determine the output. So we can create meta-level givens about how to use those givens.
Rather than stopping at some finite level, such as learning the human utility function, I'm claiming that we should learn all the levels. This is what I mean by "normativity" -- the information at all the meta-levels, which we would get if we were to unpack "should" forever. I'm putting this out there as my guess at the right type signature for human values.
I'm not mainly excited about this because I'm especially excited about including moral uncertainty or uncertainty about the correct decision theory into a friendly AI -- or because I think those are going to be particularly huge failure modes which we need to avert. Rather, I'm excited about this because it is the first time I've felt like I've had any handles at all for getting basic alignment problems right (wireheading, human manipulation, goodharting, ontological crisis) without a feeling that things are obviously going to blow up in some other way.
Normative vs Descriptive Reasoning
At this stage you might accuse me of committing the "turtles all the way down" fallacy. In Passing The Recursive Buck, Eliezer describes the error of accidentally positing an infinite hierarchy of explanations:
The general antipattern at work might be called "Passing the Recursive Buck".
How do you stop a recursive buck from passing?
You use the counter-pattern: The Recursive Buck Stops Here.
But how do you apply this counter-pattern?
You use the recursive buck-stopping trick.
And what does it take to execute this trick?
Recursive buck stopping talent.
And how do you develop this talent?
Get a lot of practice stopping recursive bucks.
However, In Where Recursive Justification Hits Rock Bottom, Eliezer discusses a kind of infinite-recursion reasoning applied to normative matters. He says:
But I would nonetheless emphasize the difference between saying:
"Here is this assumption I cannot justify, which must be simply taken, and not further examined."
"Here the inquiry continues to examine this assumption, with the full force of my present intelligence—as opposed to the full force of something else, like a random number generator or a magic 8-ball—even though my present intelligence happens to be founded on this assumption."
Still... wouldn't it be nice if we could examine the problem of how much to trust our brains without using our current intelligence? Wouldn't it be nice if we could examine the problem of how to think, without using our current grasp of rationality?
When you phrase it that way, it starts looking like the answer might be "No".
So, with respect to normative questions, such as what to believe, or how to reason, we can and (to some extent) should keep unpacking reasons forever -- every assumption is subject to further scrutiny, and as a practical matter we have quite a bit of uncertainty about meta-level things such as our values, how to think about our values, etc.
This is true despite the fact that with respect to the descriptive questions the recursive buck must stop somewhere. Taking a descriptive stance, my values and beliefs live in my neurons. From this perspective, "human logic" is not some advanced logic which logicians may discover some day, but rather, just the set of arguments humans actually respond to. Again quoting another Eliezer article,
The phrase that once came into my mind to describe this requirement, is that a mind must be created already in motion. There is no argument so compelling that it will give dynamics to a static thing. There is no computer program so persuasive that you can run it on a rock.
So in a descriptive sense the ground truth about your values is just what you would actually do in situations, or some information about the reward systems in your brain, or something resembling that. In a descriptive sense the ground truth about human logic is just the sum total of facts about which arguments humans will accept.
But in a normative sense, there is no ground truth for human values; instead, we have an updating process which can change its mind about any particular thing; and that updating process itself is not the ground truth, but rather has beliefs (which can change) about what makes an updating process legitimate. Quoting from the relevant section of Radical Probabilism:
The radical probabilist does not trust whatever they believe next. Rather, the radical probabilist has a concept of virtuous epistemic process, and is willing to believe the next output of such a process. Disruptions to the epistemic process do not get this sort of trust without reason.
I worry that many approaches to value learning attempt to learn a descriptive notion of human values, rather than the normative notion. This means stopping at some specific proxy, such as what humans say their values are, or what humans reveal their preferences to be through action, rather than leaving the proxy flexible and trying to learn it as well, while also maintaining uncertainty about how to learn, and so on.
I've mentioned "uncertainty" a lot while trying to unpack my hierarchical notion of normativity. This is partly because I want to insist that we have "uncertainty at every level of the hierarchy", but also because uncertainty is itself a notion to which normativity applies, and thus, generates new levels of the hierarchy.
Just as one might argue that logic should be based on a specific set of axioms, with specific deduction rules (and a specific sequent calculus, etc), one might similarly argue that uncertainty should be managed by a specific probability theory (such as the Kolmogorov axioms), with a specific kind of prior (such as a description-length prior), and specific update rules (such as Bayes' Rule), etc.
This general approach -- that we set up our bedrock assumptions from which to proceed -- is called "foundationalism".
I claim that we can't keep strictly to Bayes' Rule -- not if we want to model highly-capable systems in general, not if we want to describe human reasoning, and not if we want to capture (the normative) human values. Instead, how to update in a specific instance is a more complex matter which agents must figure out.
I claim that the Kolmogorov axioms don't tell us how to reason -- we need more than an uncomputable ideal; we also need advice about what to do in our boundedly-rational situation.
And, finally, I claim that length-based priors such as the Solomonoff prior are malign -- description length seems to be a really important heuristic, but there are other criteria which we want to judge hypotheses by.
So, overall, I'm claiming that a normative theory of belief is a lot more complex than Solomonoff would have you believe. Things that once seemed objectively true now look like rules of thumb. This means the question of normativity correct behavior is wide open even in the simple case of trying to predict what comes next in a sequence.
Now, Logical Induction addresses all three of these points (at least, giving us progress on all three fronts). We could take the lesson to be: we just had to go "one level higher", setting up a system like logical induction which learns how to probabilistically reason. Now we are at the right level for foundationalism. Logical induction, not classical probability theory, is the right principle for codifying correct reasoning.
Or, if not logical induction, perhaps the next meta-level will turn out to be the right one?
But what if we don't have to find a foundational level?
I've updated to a kind of quasi-anti-foundationalist position. I'm not against finding a strong foundation in principle (and indeed, I think it's a useful project!), but I'm saying that as a matter of fact, we have a lot of uncertainty, and it sure would be nice to have a normative theory which allowed us to account for that (a kind of afoundationalist normative theory -- not anti-foundationalist, but not strictly foundationalist, either). This should still be a strong formal theory, but one which requires weaker assumptions than usual (in much the same way reasoning about the world via probability theory requires weaker assumptions than reasoning about the world via pure logic).
My main objection to anti-foundationalist positions is that they're just giving up; they don't answer questions and offer insight. Perhaps that's a lack of understanding on my part. (I haven't tried that hard to understand anti-foundationalist positions.) But I still feel that way.
So, rather than give up, I want to provide a framework which holds across meta-levels (as I discussed in Learning Normativity).
This would be a framework in which an agent can balance uncertainty at all the levels, without dogmatic foundational beliefs at any level.
Doesn't this just create a new infinite meta-level, above all of the finite meta-levels?
A mathematical analogy would be to say that I'm going for "cardinal infinity" rather than "ordinal infinity". The first ordinal infinity is , which is greater than all finite numbers. But is less than . So building something at "level " would indeed be "just another meta-level" which could be surpassed by level , which could be surpassed by , and so on.
Cardinal infinities, on the other hand, don't work like that. The first infinite cardinal is , but -- we can't get bigger by adding one. This is the sort of meta-level I want: a meta-level which also oversees itself in some sense, so that we aren't just creating a new level at which problems can arise.
This is what I meant by "collapsing the meta-levels" in Learning Normativity. The finite levels might still exist, but there's a level at which everything can be put together.
Still, even so, isn't this still a "foundation" at some level?
Well, yes and no. It should be a framework in which a very broad range of reasoning could be supported, while also making some rationality assumptions. In this sense it would be a theory of rationality purporting to "explain" (ie categorize/organize) all rational reasoning (with a particular, but broad, notion of rational). In this sense it seems not so different from other foundational theories.
On the other hand, this would be something more provisional by design -- something which would "get out of the way" of a real foundation if one arrived. It would seek to make far fewer claims overall than is usual for a foundationalist theory.
What's the hierarchy?
So far, I've been pretty vague about the actual hierarchy, aside from giving examples and talking about "meta-levels".
The analogy brings to mind a linear hierarchy, with a first level and a series of higher and higher levels. Each next level does something like "handling uncertainty about the previous level".
However, my recursive quantilization proposal created a branching hierarchy. This is because the building block for that hierarchy required several inputs.
I think the exact form of the hierarchy is a matter for specific proposals. But I do think some specific levels ought to exist:
- Object-level values.
- Information about value-learning, which helps update the object-level values.
- Object-level beliefs.
- Generic information about what distinguishes a good hypothesis. This includes Occam's razor as well as information about what makes a hypothesis malign.
It's difficult to believe humans have a utility function.
It's easier to believe humans have expectations on propositions, but this still falls apart at the seams (EG, not all propositions are explicitly represented in my head at a given moment, it'll be difficult to define exactly which neural signals are the expectations, etc).
We can try to define values as what we would think if we had a really long time to consider the question; but this has its own problems, such as humans going crazy or experiencing value drift if they think for too long.
We can try to define values as what a human would think after an hour, if that human had access to HCH; but this relies on the limited ability of a human to use HCH to accelerate philosophical progress.
Imagine a value-learning system where you don't have to give any solid definition of what it is for humans to have values, but rather, can give a number of proxies, point to flaws in the proxies, give feedback on how to reason about those flaws, and so on. The system would try to generalize all of this reasoning, to figure out what the thing being pointed at could be.
We could describe humans deliberating under ideal conditions, point out issues with humans getting old, discuss what it might mean for those humans to go crazy or experience value drift, examine how the system is reasoning about all of this and give feedback, discuss what it would mean for those humans to reason well or poorly, ...
We could never entirely pin down the concept of human values, but at some point, the system would be reasoning so much like us (or rather, so much like we would want to reason) that this wouldn't be a concern.
Comparison to Other Approaches
This is most directly an approach for solving meta-philosophy.
Obviously, the direction indicated in this post has a lot in common with Paul-style approaches. My outside view is that this is me reasoning my way around to a Paul-ish position. However, my inside view still has significant differences, which I haven't fully articulated for myself yet.