Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Now that I've written Learning Normativity, I have some more clarity around the concept of "normativity" I was trying to get at, and want to write about it more directly. Whereas that post was more oriented toward the machine learning side of things, this post is more oriented toward the philosophical side. However, it is still relevant to the research direction, and I'll mention some issues relevant to value learning and other alignment approaches.

How can we talk about what you "should" do?

A Highly Dependent Concept

Now, obviously, what you should do depends on your goals. We can (at least as a rough first model) encode this as a utility function (but see my objection).

What you should do also depends on what's the case. Or, really, it depends on what you believe is the case, since that's what you have to go on.

Since we also have uncertainty about values (and we're interested in building machines which should have value uncertainty as well, in order to do value learning), we have to talk about beliefs-about-goals, too. (Or beliefs about utility functions, or however it ends up getting formalized.) This includes moral uncertainty.

Even worse, we have a lot of uncertainty about decision theory -- that is, we have uncertainty about how to take all of this uncertainty we have, and make it into decisions. Now, ideally, decision theory is not something the normatively correct thing depends on, like all the previous points, but rather is a framework for finding the normatively correct thing given all of those things. However, as long as we're uncertain about decision theory, we have to take that uncertainty as input too -- so, if decision theory is to give advice to realistic agents who are themselves uncertain about decision theory, decision theory also takes decision-theoretic uncertainty as an input. (In the best case, this makes bad decision theories capable of self-improvement.)

Clearly, we can be uncertain about how that is supposed to work.

By now you might get the idea. "Should" depends on some necessary information (let's call them the "givens"). But for each set of givens you claim is complete, there can be reasonable doubt about how to use those givens to determine the output. So we can create meta-level givens about how to use those givens.

Rather than stopping at some finite level, such as learning the human utility function, I'm claiming that we should learn all the levels. This is what I mean by "normativity" -- the information at all the meta-levels, which we would get if we were to unpack "should" forever. I'm putting this out there as my guess at the right type signature for human values.

I'm not mainly excited about this because I'm especially excited about including moral uncertainty or uncertainty about the correct decision theory into a friendly AI -- or because I think those are going to be particularly huge failure modes which we need to avert. Rather, I'm excited about this because it is the first time I've felt like I've had any handles at all for getting basic alignment problems right (wireheading, human manipulation, goodharting, ontological crisis) without a feeling that things are obviously going to blow up in some other way. 

Normative vs Descriptive Reasoning

At this stage you might accuse me of committing the "turtles all the way down" fallacy. In Passing The Recursive Buck, Eliezer describes the error of accidentally positing an infinite hierarchy of explanations:

The general antipattern at work might be called "Passing the Recursive Buck". 

[...]

How do you stop a recursive buck from passing?

You use the counter-pattern:  The Recursive Buck Stops Here.

But how do you apply this counter-pattern?

You use the recursive buck-stopping trick.

And what does it take to execute this trick?

Recursive buck stopping talent.

And how do you develop this talent?

Get a lot of practice stopping recursive bucks.

Ahem.

However, In Where Recursive Justification Hits Rock Bottom, Eliezer discusses a kind of infinite-recursion reasoning applied to normative matters. He says:

But I would nonetheless emphasize the difference between saying:

"Here is this assumption I cannot justify, which must be simply taken, and not further examined."

Versus saying:

"Here the inquiry continues to examine this assumption, with the full force of my present intelligence—as opposed to the full force of something else, like a random number generator or a magic 8-ball—even though my present intelligence happens to be founded on this assumption."

Still... wouldn't it be nice if we could examine the problem of how much to trust our brains without using our current intelligence?  Wouldn't it be nice if we could examine the problem of how to think, without using our current grasp of rationality?

When you phrase it that way, it starts looking like the answer might be "No".

So, with respect to normative questions, such as what to believe, or how to reason, we can and (to some extent) should keep unpacking reasons forever -- every assumption is subject to further scrutiny, and as a practical matter we have quite a bit of uncertainty about meta-level things such as our values, how to think about our values, etc.

This is true despite the fact that with respect to the descriptive questions the recursive buck must stop somewhere. Taking a descriptive stance, my values and beliefs live in my neurons. From this perspective, "human logic" is not some advanced logic which logicians may discover some day, but rather, just the set of arguments humans actually respond to. Again quoting another Eliezer article

The phrase that once came into my mind to describe this requirement, is that a mind must be created already in motion.  There is no argument so compelling that it will give dynamics to a static thing.  There is no computer program so persuasive that you can run it on a rock.

So in a descriptive sense the ground truth about your values is just what you would actually do in situations, or some information about the reward systems in your brain, or something resembling that. In a descriptive sense the ground truth about human logic is just the sum total of facts about which arguments humans will accept.

But in a normative sense, there is no ground truth for human values; instead, we have an updating process which can change its mind about any particular thing; and that updating process itself is not the ground truth, but rather has beliefs (which can change) about what makes an updating process legitimate. Quoting from the relevant section of Radical Probabilism:

The radical probabilist does not trust whatever they believe next. Rather, the radical probabilist has a concept of virtuous epistemic process, and is willing to believe the next output of such a process. Disruptions to the epistemic process do not get this sort of trust without reason.

I worry that many approaches to value learning attempt to learn a descriptive notion of human values, rather than the normative notion. This means stopping at some specific proxy, such as what humans say their values are, or what humans reveal their preferences to be through action, rather than leaving the proxy flexible and trying to learn it as well, while also maintaining uncertainty about how to learn, and so on.

I've mentioned "uncertainty" a lot while trying to unpack my hierarchical notion of normativity. This is partly because I want to insist that we have "uncertainty at every level of the hierarchy", but also because uncertainty is itself a notion to which normativity applies, and thus, generates new levels of the hierarchy.

Normative Beliefs

Just as one might argue that logic should be based on a specific set of axioms, with specific deduction rules (and a specific sequent calculus, etc), one might similarly argue that uncertainty should be managed by a specific probability theory (such as the Kolmogorov axioms), with a specific kind of prior (such as a description-length prior), and specific update rules (such as Bayes' Rule), etc.

This general approach -- that we set up our bedrock assumptions from which to proceed -- is called "foundationalism".

I claim that we can't keep strictly to Bayes' Rule -- not if we want to model highly-capable systems in general, not if we want to describe human reasoning, and not if we want to capture (the normative) human values. Instead, how to update in a specific instance is a more complex matter which agents must figure out.

I claim that the Kolmogorov axioms don't tell us how to reason -- we need more than an uncomputable ideal; we also need advice about what to do in our boundedly-rational situation.

And, finally, I claim that length-based priors such as the Solomonoff prior are malign -- description length seems to be a really important heuristic, but there are other criteria which we want to judge hypotheses by.

So, overall, I'm claiming that a normative theory of belief is a lot more complex than Solomonoff would have you believe. Things that once seemed objectively true now look like rules of thumb. This means the question of normativity correct behavior is wide open even in the simple case of trying to predict what comes next in a sequence.

Now, Logical Induction addresses all three of these points (at least, giving us progress on all three fronts). We could take the lesson to be: we just had to go "one level higher", setting up a system like logical induction which learns how to probabilistically reason. Now we are at the right level for foundationalism. Logical induction, not classical probability theory, is the right principle for codifying correct reasoning.

Or, if not logical induction, perhaps the next meta-level will turn out to be the right one?

But what if we don't have to find a foundational level?

I've updated to a kind of quasi-anti-foundationalist position. I'm not against finding a strong foundation in principle (and indeed, I think it's a useful project!), but I'm saying that as a matter of fact, we have a lot of uncertainty, and it sure would be nice to have a normative theory which allowed us to account for that (a kind of afoundationalist normative theory -- not anti-foundationalist, but not strictly foundationalist, either). This should still be a strong formal theory, but one which requires weaker assumptions than usual (in much the same way reasoning about the world via probability theory requires weaker assumptions than reasoning about the world via pure logic).

Stopping at 

My main objection to anti-foundationalist positions is that they're just giving up; they don't answer questions and offer insight. Perhaps that's a lack of understanding on my part. (I haven't tried that hard to understand anti-foundationalist positions.) But I still feel that way.

So, rather than give up, I want to provide a framework which holds across meta-levels (as I discussed in Learning Normativity).

This would be a framework in which an agent can balance uncertainty at all the levels, without dogmatic foundational beliefs at any level.

Doesn't this just create a new infinite meta-level, above all of the finite meta-levels?

A mathematical analogy would be to say that I'm going for "cardinal infinity" rather than "ordinal infinity". The first ordinal infinity is , which is greater than all finite numbers. But  is less than . So building something at "level " would indeed be "just another meta-level" which could be surpassed by level , which could be surpassed by , and so on.

Cardinal infinities, on the other hand, don't work like that. The first infinite cardinal is , but  -- we can't get bigger by adding one. This is the sort of meta-level I want: a meta-level which also oversees itself in some sense, so that we aren't just creating a new level at which problems can arise.

This is what I meant by "collapsing the meta-levels" in Learning Normativity. The finite levels might still exist, but there's a level at which everything can be put together.

Still, even so, isn't this still a "foundation" at some level?

Well, yes and no. It should be a framework in which a very broad range of reasoning could be supported, while also making some rationality assumptions. In this sense it would be a theory of rationality purporting to "explain" (ie categorize/organize) all rational reasoning (with a particular, but broad, notion of rational). In this sense it seems not so different from other foundational theories.

On the other hand, this would be something more provisional by design -- something which would "get out of the way" of a real foundation if one arrived. It would seek to make far fewer claims overall than is usual for a foundationalist theory.

What's the hierarchy?

So far, I've been pretty vague about the actual hierarchy, aside from giving examples and talking about "meta-levels".

The  analogy brings to mind a linear hierarchy, with a first level and a series of higher and higher levels. Each next level does something like "handling uncertainty about the previous level".

However, my recursive quantilization proposal created a branching hierarchy. This is because the building block for that hierarchy required several inputs.

I think the exact form of the hierarchy is a matter for specific proposals. But I do think some specific levels ought to exist:

  • Object-level values.
  • Information about value-learning, which helps update the object-level values.
  • Object-level beliefs.
  • Generic information about what distinguishes a good hypothesis. This includes Occam's razor as well as information about what makes a hypothesis malign.

Normative Values

It's difficult to believe humans have a utility function.

It's easier to believe humans have expectations on propositions, but this still falls apart at the seams (EG, not all propositions are explicitly represented in my head at a given moment, it'll be difficult to define exactly which neural signals are the expectations, etc).

We can try to define values as what we would think if we had a really long time to consider the question; but this has its own problems, such as humans going crazy or experiencing value drift if they think for too long.

We can try to define values as what a human would think after an hour, if that human had access to HCH; but this relies on the limited ability of a human to use HCH to accelerate philosophical progress.

Imagine a value-learning system where you don't have to give any solid definition of what it is for humans to have values, but rather, can give a number of proxies, point to flaws in the proxies, give feedback on how to reason about those flaws, and so on. The system would try to generalize all of this reasoning, to figure out what the thing being pointed at could be.

We could describe humans deliberating under ideal conditions, point out issues with humans getting old, discuss what it might mean for those humans to go crazy or experience value drift, examine how the system is reasoning about all of this and give feedback, discuss what it would mean for those humans to reason well or poorly, ...

We could never entirely pin down the concept of human values, but at some point, the system would be reasoning so much like us (or rather, so much like we would want to reason) that this wouldn't be a concern.

Comparison to Other Approaches

This is most directly an approach for solving meta-philosophy.

Obviously, the direction indicated in this post has a lot in common with Paul-style approaches. My outside view is that this is me reasoning my way around to a Paul-ish position. However, my inside view still has significant differences, which I haven't fully articulated for myself yet.

New Comment
11 comments, sorted by Click to highlight new comments since: Today at 5:15 AM

we have an updating process which can change its mind about any particular thing; and that updating process itself is not the ground truth, but rather has beliefs (which can change) about what makes an updating process legitimate.

This should still be a strong formal theory, but one which requires weaker assumptions than usual

There seems to be a bit of a tension here. What you're outlining for most of the post still requires a formal system with assumptions within which to take the fixed point, but then that would mean that it can't change its mind about any particular thing. Indeed it's not clear how such a totally self-revising system could ever be a fixed point of constraints of rationality: since it can revise anything, it could only be limited by the physically possible.

This is most directly an approach for solving meta-philosophy.

Related to the last point, your project would seem to imply some interesting semantics. Usually our semantics are based on the correspondence theory: we start thinking about what the world can be like, and then expressions get their meaning from their relations to these ways the world can be like, particularly through our use. (We can already see how this would take us down a descriptive path.) This leads to problems where you can't explain the content what confused beliefs. Like for example most children believe for some time that their languages words for things are the real words for them. If you're human you propably understand what I was talking about, but if some alien was puzzled about what the "real" there was supposed to mean, I don't think I could explain it. Basically once you've unconfused yourself, you become unable to say what you believed.

Now if we're foundationalists, we say that thats because you didn't actually believe anything, and that that was just a linguistic token passed around your head but failing to be meaningful, because you didn't implement The Laws correctly. But if we want to have a theory like yours, it treats this cognitively, and so such beliefs must meaningful in some sense. I'm very curious what this would look like.

More generally this is a massively ambitious undertaking. If you succeeded it would solve a bunch of other issues, not even from running it but just from the conceptual understanding of how it would work. For example in your last post on signalling you mentioned:

First of all, it might be difficult to define the hypothetical scenario in which all interests are aligned, so that communication is honest. Taking an extreme example, how would we then assign meaning to statements such as "our interests are not aligned"?

I think a large part of embedded agency has a similar problem, where we try to build our semantics on "If I was there, I would think", and apply this to scenarios where we are essentially not there, because we're thinking about our non-existence, or about bugs that would make us not think that way, or some such. So if you solved this it would propably just solve anthropics as well. On the one hand this is exciting, on the other its a reason to be sceptical. And all of this eerily reminds me of German Idealism. In any case I think this is very good as a post.

There seems to be a bit of a tension here. What you're outlining for most of the post still requires a formal system with assumptions within which to take the fixed point, but then that would mean that it can't change its mind about any particular thing. Indeed it's not clear how such a totally self-revising system could ever be a fixed point of constraints of rationality: since it can revise anything, it could only be limited by the physically possible.

It's sort of like the difference between a programmable computer vs an arbitrary blob of matter. A programmable computer provides a rigid structure which can't be changed, but the set of assumptions imposed really is quite light. When programming language designers aim for "totally self-revising systems" (languages with more flexibility in their assumptions, such as Lisp), they don't generally attack the assumption that the hardware should be fixed. (Although occasionally they do go as far as asking for FPGAs.)

(a finite approximation of) Solomonoff Induction can be said to make "very few assumptions", because it can learn a wide variety of programs. Certainly it makes less assumptions than more special-case machine learning systems. But it also makes a lot more assumptions than the raw computer. In particular, it has no allowance for updating against the use of Bayes' Rule for evaluating which program is best.

I'm aiming for something between the Solomonoff induction and the programmable computer. It can still have a rigid learning system underlying it, but in some sense it can learn any particular way of selecting hypotheses, rather than being stuck with one.

Now if we're foundationalists, we say that thats because you didn't actually believe anything, and that that was just a linguistic token passed around your head but failing to be meaningful, because you didn't implement The Laws correctly. But if we want to have a theory like yours, it treats this cognitively, and so such beliefs must meaningful in some sense. I'm very curious what this would look like.

This seems like a rather excellent question which demonstrates a high degree of understanding of the proposal.

I think the answer from my not-necessarily-foundationalist but not-quite-pluralist perspective (a pluralist being someone who points to the alternative foundations proposed by different people and says "these are all tools in a well-equipped toolbox") is:

The meaning of a confused concept such as "the real word for X" is not ultimately given by any rigid formula, but rather, established by long deliberation on what it can be understood to mean. However, we can understand a lot of meaning through use. Pragmatically, what "the real word for X" seems to express is that there is a correct thing to call something, usually uniquely determined, which can be discovered through investigation (EG by asking parents). This implies that other terms are incorrect (EG other languages, or made-up terms). "Incorrect" here means normatively incorrect, which is still part of our map; but to cash out what that means to a greater degree, it means you can EG scold people who use wrong terms, and you should teach them better terms, etc.

To sum up, meaning in this view is broadly more inferentialist and less correspondence-based: the meaning of a thing is more closely tied with the inferences around that thing, than with how that thing corresponds to a territory. 

So if you solved this it would propably just solve anthropics as well.

I'm not seeing that implication at all! The way I see it, the framework "stands over" decision-theoretic issues such as anthorpics, providing no answers (only providing an epistemic arena in which uncertainty about them can be validly expressed rather than requiring some solution in order to define correct reasoning in the first place).

I now think you were right about it not solving anthropics. I interpreted afoundationalism insufficiently ambitiously; now that I have a proof-of-concept for normative semantics I can indeed not find it to presuppose an anthropics.

It's sort of like the difference between a programmable computer vs an arbitrary blob of matter. 

This is close to what I meant: My neurons keep doing something-like reinforcement learning, whether or not I theoretically believe thats valid. "I in fact can not think outside this" does adress the worry about a merely rational constraint.

On the other hand, we do want AI to eventually consider other hardware, and that might even be necessary for normal embedded agency, since we dont fully trust our hardware even when we dont want to normal-sense-change it. 

To sum up, meaning in this view is broadly more inferentialist and less correspondence-based: the meaning of a think is more closely tied with the inferences around that thing, than with how that thing corresponds to a territory. 

I broadly agree with inferentialism, but I don't think that entirely adresses it. The mark of confused, rather than merely wrong, beliefs is that they dont really have a coherent use. So for example it might be that theres a path through possible scenarios leading back to the starting point, where if at every step I adjust my reaction in a way that seems appropriate to me, I end up with a different reaction when I'm back at the start. If you tried to describe my practices here you would just explicitly account for the framing dependence. But then it wouldn't be confused! That framing-dependent concept you described also exists, but it seems quite different from the confused one. For the confused concept its essential that I consider it not dependent in this way. But if you try to include that in your description, by also describing the practices around my meta-beliefs about the concept, and the meta-meta beliefs, and so on, then you'd end up also describing the process with which I recognized it as confused and revised it. And then we're back in the position of already-having-recognized that its bullshit.

When you were only going up individual meta-levels, the propositions logical induction worked with could be meaningful even if they were wrong, because they were part of processes outside the logical induction process, and those were sufficient to give them truth-conditions. Now you want to determine both what to believe and how those beliefs are to be used in one go, and it's undermining that, because the "how beliefs are to be used" is what foundationalism kept fixed, and which gave them their truth conditions.

I'm not seeing that implication at all!

Well, this is a bit analogy-y, but I'll try to explain. I think theres a semantic issue with anthropics (indeed, under inferentialism all confusion can be expressed as a semantic issue). Things like "the propability that I will have existed if I do X now" are unclear. For example a descriptivist way of understanding conditional propabilities is something like "The token C means conditional propability iff whenever you believe xCy = p, then you would belief P(y) = p if you came to believe x". But that assumes not only that you are logically perfect but that you are there to have beliefs and answer for them. Now most of the time it's not a problem if you're not actually there, because we can just ask about if you were there (and you somehow got oxygen and glucose despite not touching anything, and you could see without blocking photons, etc but lets ignore that for now), even if you aren't actually. But this can be a problem with anthropic situations. Normally when a hypothetical involves you, you can just imagine it from your prespective, and when it doesn't involve you, you can imagine you were there. But if you're trying to imagine a scenario that involves you but you can't imagine it from your prespective, because you come into existence in it, or you have a mental defect in it, or something, then you have to imagine it from the third person. So you're not really thinking about yourself, you're thinking about a copy, which may be in quite a different epistemic situation. So if you can conceptually explain how to have semantics that accounts for my making mistakes, then I think that would propably be able to account for my not being there as well (in both cases, it's just the virtuous epistemic process missing). And that would tell us how to have anthropic beliefs, and that would unknot the area.

Really enjoying your posts on normativity! The way I summarize it internally is "Thinking about fixed-points for the meta aspect of human reasoning". How fixed-point-y do you think solutions are likely to be?

We could never entirely pin down the concept of human values, but at some point, the system would be reasoning so much like us (or rather, so much like we would want to reason) that this wouldn't be a concern.

I'm confused about this sentence, because it seems to promote an idea in contradiction with your other writing on normativity (and even earlier sections in this post). Because the quote says that at some level you could stop caring (which means we can keep going meta until there's not significant improvement, and stop there), while the rest of your writing says that we should deal with the whole hierarchy at once.

Because the quote says that at some level you could stop caring (which means we can keep going meta until there's not significant improvement, and stop there)

Hmmm, that's not quite what I meant. It's not about stopping at some meta-level, but rather, stopping at some amount of learning in the system. The system should learn not just level-specific information, but also cross-level information (like overall philosophical heuristics), which means that even if you stop teaching the machine at some point, it can still produce new reasoning at higher levels which should be similar to feedback you might have given.

The point is that human philosophical taste isn't perfectly defined, and even if we also teach the machine everything we can about how to interpret human philosophical taste, that'll still be true. However, at some point our uncertainty and the machine's uncertainty will be close enough to the same that we don't care. (Note: what it even means for them to be closely matched depends on the question of what it means for humans to have specific philosophical taste, which, if we could answer, we would have perfectly defined human philosophical taste -- the thing we can't do. Yet, in some good-enough sense, our own uncertainty eventually becomes well-represented by the machine's uncertainty. That's the stopping point at which we no longer need to provide additional explicit feedback to the machine.)

Hmmm, that's not quite what I meant. It's not about stopping at some meta-level, but rather, stopping at some amount of learning in the system. The system should learn not just level-specific information, but also cross-level information (like overall philosophical heuristics), which means that even if you stop teaching the machine at some point, it can still produce new reasoning at higher levels which should be similar to feedback you might have given.

Interesting. So the point is to learn how to move up the hierarchy? I mean, that makes a lot of sense. It is a sort of fixed point description, because then the AI can keep moving up the hierarchy as far as it wants, which mean the whole hierarchy is encoded by it's behavior. It's just a question of how far up it needs to go to get satisfying answers.
 

Is that correct?

Right. I mean, I would clarify that the whole point isn't to learn to go up the hierarchy; in some sense, most of the point is learning at a few levels. But yeah.

This is biting the bullet on the infinite regress horn of the Munchhausen trilemma, but given the finitude of human brain architecture I prefer biting the bullet on circular reasoning. We have a variety of overlays, like values, beliefs, goals, actions, etc. There is no canonical way they are wired together. We can hold some fixed as a basis while we modify others. We are a Ship of Neurath. Some parts of the ship feel more is-like (like the waterproofness of the hull) and some feel more ought-like (like the steering wheel).

Why not both? ;3

I have nothing against justifications being circular (IE the same ideas recurring on many levels), just as I have nothing in principle against finding a foundationalist explanation. A circular argument is just a particularly simple form of infinite regress.

But my main argument against only the circular reasoning explanation is that attempted versions of it ("coherentist" positions) don't seem very good when you get down to details.

Pure coherentist positions tend to rely on a stipulated notion of coherence (such as probabilistic coherence, or weighted constraint satisfaction, or something along those lines). These notions are themselves fixed. This could be fine if the coherence notions were sufficiently "assumption-lite" so as to not be necessarily Goodhart-prone etc, but so far it doesn't seem that way to me.

I'm predicting that you'll agree with me on that, and grant that the notion of coherence should itself be up for grabs. I don't actually think the coherentist/foundationalist/infinitist trilemma is that good a characterization of our disagreement here. My claim isn't so much the classical claim that there's an infinite regress of justification, as much as a claim that there's an infinite regress of uncertainty -- that we're uncertain at all the levels, and need to somehow manage that. This fits the ship-of-theseus picture just fine.

In other words, one can unroll a ship of theseus into an infinite hierarchy where each level says something about how the next level down gets re-adjusted over time. The reason for doing this is to achieve the foundationalist goal of understanding the system better, without the foundationalist method of fixing foundational assumptions. The main motive here is amplification. Taking just a ship of theseus, it's not obvious how to make it better besides running it forward faster (and even this has its risks, since the ship may become worse). If we unroll the hierarchy of wanting-to-become better, we can EG see what is good and bad about merely running it forward faster, and try to run it forward in good ways rather than bad ways (as well as other, more radical departures from simple fast-forward amplification).

One disagreement I have with your story is the argument "given the finitude of human brain architecture". The justification of a belief/norm/algorithm needn't be something already present in the head. A lot of what we do is given to us by evolution. We can notice those things and question whether they make sense by our current standards. Calling this process finite is kind of like calling a Turing machine finite. There's a finite core to it, but we can be surprised by what this core does given more working tape.

This is clarifying, thanks.

WRT the last paragraph, I'm thinking in terms of convergent vs divergent processes. So , fixed points I guess.