Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I have been pretty satisfied with my desiderata for learning normativity, but I haven't been very satisfied with my explanation of why exactly these desiderata are important. I have a sense that it's not just a grab-bag of cool stuff; something about trying to do all those things at once points at something important.

What follows are four different elevator pitches, which tell different stories about how it all hangs together. Desiderata are bolded.

Conceptual Difficulties with Outer Alignment

The classic problem of outer alignment is that we have no perfect loss function, so we can't just go optimize. The problem can be understood by thinking about Goodhart and how optimization amplifies. The classic response to this is value uncertainty and value learning, but wireheading, human manipulation, and no-free-lunch results make it seem plausible that we have the same problem one level up: we still don't know how to specify a perfect loss function for what we care about, and imperfect loss functions can still create big problems.

So, just like value-learning tackles the initial problem head-on by suggesting we manage our uncertainty about values and gain knowledge over time, learning at all levels suggests that we tackle the meta-problem directly, explicitly representing the fact that we don't have a perfectly good loss function at any level, but can manage that uncertainty and learn-to-learn over time.

Humans can only give explicit feedback at so many meta-levels, so between-level sharing is critical for any meaningful learning to take place at higher meta-levels. Otherwise, higher meta-levels remain highly uncertain, which itself makes learning at lower levels almost impossible (since you can't learn if you have high uncertainty about learning-to-learn).

A consequence of having no perfect loss function is no perfect feedback; no evidence about what the system should do can be considered absolute. A helpful measure for coping with this is to support uncertain feedback, so that humans can represent their uncertainty when they provide feedback. Ultimately, though, humans can have systematic biases which require reinterpretable feedback to untangle.

Even with all these tools, some forms of feedback would be difficult or impossible to articulate without process-level feedback: the ability to tell the system that specific patterns of thinking are good or bad, without needing to unpack those judgements in terms of consequences. To be meaningful, this requires whole-process feedback: we need to judge thoughts by their entire chain of origination. (This is technically challenging, because the easiest way to implement process-level feedback is to create a separate meta-level which oversees the rest of the system; but then this meta-level would not itself be subject to oversight.)

Finally, because it's not feasible for humans to approve every thought process by hand, it's critical to have learned generalization of process-level feedback. This doesn't sound like a big request, but is technically challenging when coupled with the other desiderata.

Recovering from Human Error

A different place to start is to motivate everything from designing a system to recover from errors which humans introduce

The ability to learn when there's no perfect feedback represents a desire to recover from input errors. Uncertain feedback and reinterpretable feedback follow from this as before.

We can't avoid all assumptions, but specifying a loss function is one area where we seem to assume much more than we bargain for; what we mean is something like "good things are roughly in this direction", but what we get is more like "good things are precisely this". We want to avoid making this type of mistake, hence no perfect loss function. Learning at all levels ensures that we can correct this type of mistake wherever it occurs. Between-level sharing is needed in order to get any traction with all-level learning.

Whole-process feedback can now be motivated by the desire to learn a whole new way of doing things, so that nothing is locked in by architectural mistakes. This of course implies process-level feedback

Learned generalization of feedback can be seen as a desire to pre-empt human error correction; learning the patterns of errors humans correct, so as to systematically avoid those sorts of things in the future.

We Need a Theory of Process-Level Feedback

We could also motivate things primarily through the desire to facilitate process-level feedback. Process-level feedback is obviously critical for inner alignment; we want to be able to tell a system to avoid specific kinds of hypotheses (which contain inner optimizers). However, although we can apply penalties to neural nets or such things, we lack a general theory of process-level feedback that's as rigorous as theories we have for other forms of learning. I think it's probably a good idea to develop such a theory.

In addition to inner alignment, process-level feedback could be quite beneficial to outer-alignment problems such as corrigibility, non-manipulation, and non-wireheading. As I argued in another section, we can often point out that something is wrong without being able to give a utility function which represents what we want. So process-level feedback just seems like a good tool to have when training a system, and perhaps a necessary one.

You might think process-level feedback is easy to theoretically model. In a Bayesian setting, we can simply examine hypotheses and knock out the bad ones (updating on not-this-hypothesis). However, this is an incredibly weak model of process-level feedback, because there is no learned generalization of process-level feedback! Learned generalization is important, because humans can't be expected to give feedback on each individual hypothesis, telling the system whether it's OK or full of inner optimizers. (If we develop a technology that can automatically do this, great; but otherwise, we need to solve it as a learning problem.)

The next-most-naive model is a two-level system where you have object-level hypotheses which predict data, and meta-level hypotheses which predict which object-level hypotheses are benign/malign. Humans provide malign/benign feedback about first-level hypotheses, and second-level hypotheses generalize this information. This proposal is not very good, because now there's no way to provide process-level feedback about second-level hypotheses; but absent any justification to the contrary, these are just as often malign. This illustrates the need for whole-process feedback.

This suggests a version of learning at all levels: if the process-level feedback at one level can be regarded as data for the next level, everything can be generalizable. However, between-level sharing is necessary, since patterns we want to avoid at one level will very often be patterns we want to avoid at all levels.

In this story, no perfect feedback and no perfect loss function are less important.

Generalizing Learning Theory

Another way to motivate things is through the purely theoretical desire to push learning theory as far as possible. Logical induction can be thought of as pure learning-theory progress, in which a very broad bounded-regret property was discovered which implied many other desirable properties. In particular, it generalized to non-sequential non-realizable settings, where Solomonoff induction only dealt with sequential prediction in realizable settings. Also, it dealt with a form of bounded rationality where Solomonoff induction only dealt with unbounded rationality.

So, why not try to push the boundaries of learning theory further?

As I discussed in the previous section, we can think of process-level feedback as feedback directly on hypotheses. Whole-process feedback ensures we focus on the interesting part of the problem, making hypotheses judge each other and themselves, rather than getting a boring partial solution by stratifying a system into seperate levels.

The learned generalization problem can be understood better through a learning-theory lense: the problem is that Bayesian setups like Solomonoff induction offer no regret bounds for updates about hypotheses, due to focusing exclusively on predicting information about sense-data. So although Bayesianism supports updates on any proposition, this does not mean we get the nice learning-theoretic guarantees with respect to all such updates. This seems like a pretty big hole in Bayesian learning theory.

So we want learned generalization of as many update types as possible, by which we mean that we want loss bounds on as many different types of feedback as possible.

Uncertain feedback is just another generalized feedback type for us to explore.

Reinterpretable feedback is a more radical suggestion. We can motivate this through a desire for a theory of meta-learning: how do we learn to learn, in the broadest possible way? This motivates thinking about no perfect feedback and no perfect loss function scenarios.

Learning at all levels could be motivated from the needs of process-level feedback, as in the previous section, or from the nature of the no-perfect-loss-function scenario, as in the first section. Between-level sharing is motivated from it as usual.

New Comment
7 comments, sorted by Click to highlight new comments since:

Planned summary for the Alignment Newsletter:

We’ve <@previously seen@>(@Learning Normativity: A Research Agenda@) desiderata for agents that learn normativity from humans: specifically, we would like such agents to:

1. **Learn at all levels:** We don’t just learn about uncertain values, we also learn how to learn values, and how to learn to learn values, etc. There is **no perfect loss function** that works at any level; we assume conservatively that Goodhart’s Law will always apply. In order to not have to give infinite feedback for the infinite levels, we need to **share feedback between levels**.
2. **Learn to interpret feedback:** Similarly, we conservatively assume that there is **no perfect feedback**, and so rather than fixing a model for how to interpret feedback, we want feedback to be **uncertain** and **reinterpretable**.
3. **Process-level feedback:** Rather than having to justify all feedback in terms of the consequences of the agent’s actions, we should also be able to provide feedback on the way the agent is reasoning. Sometimes we’ll have to judge the entire chain of reasoning with **whole-process feedback**.

This post notes that we can motivate these desiderata from multiple different frames:

1. _Outer alignment:_ The core problem of outer alignment is that any specified objective tends to be wrong. This applies at all levels, suggesting that we need to **learn at all levels**, and also **learn to interpret feedback** for the same reason. **Process-level feedback** is then needed because not all decisions can be justified based on consequences of actions.
2. _Recovering from human error:_ Another view that we can take is that humans don’t always give the right feedback, and so we need to be robust to this. This motivates all the desiderata in the same way as for outer alignment.
3. _Process-level feedback:_ We can instead view process-level feedback as central, since having agents doing the right type of _reasoning_ (not just getting good outcomes) is crucial for inner alignment. In order to have something general (rather than identifying cases of bad reasoning one at a time), we could imagine learning a classifier that detects whether reasoning is good or not. However, then we don’t know whether the reasoning of the classifier is good or not. Once again, it seems we would like to **learn at all levels**.
4. _Generalizing learning theory:_ In learning theory, we have a distribution over a set of hypotheses, which we update based on how well the hypotheses predict observations. **Process-level feedback** would allow us to provide feedback on an individual hypothesis, and this feedback could be **uncertain**. **Reinterpretable feedback** on the other hand can be thought of as part of a (future) theory of meta-learning.

To be meaningful, this requires whole-process feedback: we need to judge thoughts by their entire chain of origination. (This is technically challenging, because the easiest way to implement process-level feedback is to create a separate meta-level which oversees the rest of the system; but then this meta-level would not itself be subject to oversight.)

I thought you were going to say it's technically challenging because you need transparency / intepretability ... At least in human cognition (and logical induction too right?) thoughts-about-stuff and thoughts-about-thoughts-about-stuff and thoughts-about-thoughts-about-thoughts-about-stuff and thoughts-about-all-levels and so on are all mixed together in a big pot, and they share the same data type, and they're all inside the learned black box.

Well, transparency is definitely a challenge. I'm mostly saying this is a technical challenge even if you have magical transparency tools, and I'm kind of trying to design the system you would want to use if you had magical transparency tools.

But I don't think it's difficult for the reason you say. I don't think multi-level feedback or whole-process feedback should be construed as requiring the levels to be sorted out nicely. Whole-process feedback in particular just means that you can give feedback on the whole chain of computation; it's basically against sorting into levels.

Multi-level feedback means, to me, that if we have an insight about, EG, how to think about value uncertainty (which is something like a 3rd-level thought: 1st level is information about object level; 2nd level is information about the value function; 3rd level is information about how to learn the value function), we can give the system feedback about that. So the system doesn't need to sort things out into levels; it just needs to be capable of accepting feedback of each type.

To be meaningful, this requires whole-process feedback: we need to judge thoughts by their entire chain of origination. (This is technically challenging, because the easiest way to implement process-level feedback is to create a separate meta-level which oversees the rest of the system; but then this meta-level would not itself be subject to oversight.)

I'd be interested to hear this elaborated further. It seems to me to be technically challenging but not very; it feels like the sort of thing that we could probably solve with a couple people working on it part-time for a few years. I'm wondering if I'm underestimating the difficulties. At any rate it's fun to think about architectures. Maybe the system keeps a log of its thoughts, and has a process or subcomponent that reads the log, judges it, and then modifies the system accordingly. This component or process is not exempt from all this and occasionally ends up modifying itself. What would go wrong with this? Well, on a practical level maybe it would be too computationally expensive and/or be vulnerable to accidentally neutering itself or otherwise getting stuck in attractors where it self-modifies away the ability to make further good self-modifications. But both of those problems seem not too insurmountable to me.

I'd be interested to hear this elaborated further. It seems to me to be technically challenging but not very;

  1. I agree; I'm not claiming this is a multi-year obstacle even. Mainly I included this line because I thought "add a meta-level" would be what some readers would think, so, I wanted to emphasize that that's not a solution.
  2. To elaborate on the difficulty: this is challenging because of the recursive nature of the request. Roughly, you need hypotheses which not only claim things at the object level but also hypothesize a method of hypothesis evaluation ie make claims about process-level feedback. Your belief distribution then needs to incorporate these beliefs. (So how much you endorse a hypothesis can depend on how much you endorse that very hypothesis!) And, on top of that, you need to know how to update that big mess when you get more information. This seems sort of like it has to violate Bayes' Law, because when you make an observation, it'll not only shift hypotheses via likelihood ratios to that observation, but also, produce secondary effects where hypotheses get shifted around because other hypotheses which like/dislike them got shifted around. How all of this should work seems quite unclear.
  3. Part of the difficulty is doing this in conjunction with everything else, though. Asking for 1 thing that's impossible in the standard paradigm might have an easy answer. Asking for several, each might individually have easy answers, but combining those easy answers might not be possible.

Thanks! Well, I for one am feeling myself get nerd-sniped by this agenda. I'm resisting so far (so much else to do! Besides, this isn't my comparative advantage) but I'll definitely be reading your posts going forward and if you ever want to bounce ideas off me in a call I'd be down. :)

A different perspective, perhaps not motivating quite the same things as yours:

Embedded Reflective Consistency

A theory needs to be able to talk about itself and its position in and effect on the world. So in particular it will have beliefs about how the application of just this theory in just this position will influence whatever it is that we want the theory to do. Then reflective consistency demands that the theory rates itself well on its objective: If I have a belief, and also a belief that the first belief is most likely the result of deception, then clearly I have to change one of these.

Now, if there was a process that could get us there, then it would have to integrate process-level feedback, because the goal is a certain consistency with it. It would also give a result at all levels and sharing between them, because there is only one theory as output, which must judge its own activity (of which judging its own activity is part).

So far this looks like the theory process-level feedback, but I think it also has some things to say corrigibility as well. For one, we would put in a theory thats possibly not reflectively consistent, and get one out that is. The output space is much smaller than the input, and probably discrete. If we reasonably suppose that the mapping is continuous, then that means its resistant to small changes in input. Also, an embedded agent must see itself as made of potentially broken parts, so it must be able to revise any one thing.