Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Note: This is the first post from part five: possible approaches of the sequence on iterated amplification. The fifth section of the sequence breaks down some of these problems further and describes some possible approaches.


Suppose that I would like to train an RL agent to help me get what I want.

If my preferences could be represented by an easily-evaluated utility function, then I could just use my utility function as the agent’s reward function. But in the real world that’s not what human preferences look like.

So if we actually want to turn our preferences into a reward function suitable for training an RL agent, we have to do some work.

This post is about the straightforward parts of reward engineering. I’m going to deliberately ignore what seem to me to be the hardest parts of the problem. Getting the straightforward parts out of the way seems useful for talking more clearly about the hard parts (and you never know what questions may turn out to be surprisingly subtle).

The setting

To simplify things even further, for now I’ll focus on the special case where our agent is taking a single action a. All of the difficulties that arise in the single-shot case also arise in the sequential case, but the sequential case also has its own set of additional complications that deserve their own post.

Throughout the post I will imagine myself in the position of an “overseer” who is trying to specify a reward function R(a) for an agent. You can imagine the overseer as the user themselves, or (more realistically) as a team of engineer and/or researchers who are implementing a reward function intended to expresses the user’s preferences.

I’ll often talk about the overseer computing R(a) themselves. This is at odds with the usual situation in RL, where the overseer implements a very fast function for computing R(a) in general (“1 for a win, 0 for a draw, -1 for a loss”). Computing R(a) for a particular action a is strictly easier than producing a fast general implementation, so in some sense this is just another simplification. I talk about why it might not be a crazy simplification in section 6.

Contents

  1. Long time horizons. How do we train RL agents when we care about the long-term effects of their actions?
  2. Inconsistency and unreliability. How do we handle the fact that we have only imperfect access to our preferences, and different querying strategies are not guaranteed to yield consistent or unbiased answers?
  3. Normative uncertainty. How do we train an agent to behave well in light of its uncertainty about our preferences?
  4. Widely varying reward. How do we handle rewards that may vary over many orders of magnitude?
  5. Sparse reward. What do we do when our preferences are very hard to satisfy, such that they don’t provide any training signal?
  6. Complex reward. What do we do when evaluating our preferences is substantially more expensive than running the agent?
  • Conclusion.
  • Appendix: harder problems.

1. Long time horizons

A single decision may have very long-term effects. For example, even if I only care about maximizing human happiness, I may instrumentally want my agent to help advance basic science that will one day improve cancer treatment.

In principle this could fall out of an RL task with “human happiness” as the reward, so we might think that neglecting long-term effects is just a shortcoming of the single-shot problem. But even in theory there is no way that an RL agent can learn to handle arbitrarily long-term dependencies (imagine training an RL agent to handle 40 year time horizons), and so focusing on the sequential RL problem doesn’t address this issue.

I think that the only real approach is to choose a reward function that reflects the overseer’s expectations about long-term consequences — i.e., the overseer’s task involves both making predictions about what will happen, and value judgments about how good it will be. This makes the reward function more complex and in some sense limits the competence of the learner by the competence of the reward function, but it’s not clear what other options we have.

Before computing the reward function R(a), we are free to execute the action a and observe its short-term consequences. Any data that could be used in our training process can just as well be provided as an input to the overseer, who can use the auxiliary input to help predict the long-term consequences of an action.

2. Inconsistency and unreliability

A human judge has no hope of making globally consistent judgments about which of two outcomes are preferred — the best we can hope for is for their judgments to be right in sufficiently obvious cases, and to be some kind of noisy proxy when things get complicated. Actually outputting a numerical reward — implementing some utility function for our preferences — is even more hopelessly difficult.

Another way of seeing the difficult is to suppose that the overseer’s judgment is a noisy and potential biased evaluation of the quality of the underlying action. If both R(a) and R(a′) are both big numbers with a lot of noise, but the two actions are actually quite similar, then the difference will be dominated by noise. Imagine an overseer trying to estimate the impact of drinking a cup of coffee on Alice’s life by estimating her happiness in a year conditioned on drinking the coffee, estimating happiness conditioned on not drinking the coffee, and then subtracting the estimates.

We can partially address this difficulty by allowing the overseer to make comparisons instead of assessing absolute value. That is, rather than directly implementing a reward function, we can allow the overseer to implement an antisymmetric comparison function C(a, a′): which of two actions a and a′ is a better in context? This function can take real values specifying how muchone action is better than another, and should by antisymmetric.

In the noisy-judgments model, we are hoping that the noise or bias of a comparison C(a, a′) depends on the actual magnitude of the difference between the actions, rather than on the absolute quality of each action. This hopefully means that the total error/bias does not drown out the actual signal.

We can then define the decision problem as a zero-sum game: two agents propose different actions a and a′, and receive rewards C(a, a′) and C(a′, a). At the equilibrium of this game, we can at least rest assured that the agent doesn’t do anything that is unambiguously worse than another option it could think of. In general, this seems to give us sensible guarantees when the overseer’s preferences are not completely consistent.

One subtlety is that in order to evaluate the comparison C(a, a′), we may want to observe the short-term consequences of taking action a or action a′. But in many environments it will only be possible to take one action. So after looking at both actions we will need to choose at most one to actually execute (e.g. we need to estimate how good drinking coffee was, after observing the short-term consequences of drinking coffee but without observing the short-term consequences of not drinking coffee). This will generally increase the variance of C, since we will need to use our best guess about the action which we didn’t actually execute. But of course this is a source of variance that RL algorithms already need to contend with.

3. Normative uncertainty

The agent is uncertain not only about its environment, but also about the overseer (and hence the reward function). We need to somehow specify how the agent should behave in light of this uncertainty. Structurally, this is identical to the philosophical problem of managing normative uncertainty.

One approach is to pick a fixed yardstick to measure with. For example, our yardstick could be “adding a dollar to the user’s bank account.” We can then measure C(a, a′) as a multiple of this yardstick: “how many dollars would we have to add to the user’s bank account to make them indifferent between taking action a and action a′?” If the user has diminishing returns to money, it would be a bit more precise to ask: “what chance of replacing a with a′ is worth adding a dollar to the user’s bank account?” The comparison C(a, a′) is then the inverse of this probability.

This is exactly analogous to the usual construction of a utility function. In the case of utility functions, our choice of yardstick is totally unimportant — different possible utility functions differ by a scalar, and so give rise to the same preferences. In the case of normative uncertainty that is no longer the case, because we are specifying how to aggregate the preferences of different possible versions of the overseer.

I think it’s important to be aware that different choices of yardstick result in different behavior. But hopefully this isn’t an important difference, and we can get sensible behavior for a wide range of possible choices of yardstick — if we find a situation where different yardsticks give very different behaviors, then we need to think carefully about how we are applying RL.

For many yardsticks it is possible to run into pathological situations. For example, suppose that the overseer might decide that dollars are worthless. They would then radically increase the value of all of the agent’s decisions, measured in dollars. So an agent deciding what to do would effectively care much more about worlds where the overseer decided that dollars are worthless.

So it seems best to choose a yardstick whose value is relatively stable across possible worlds. To this effect we could use a broader basket of goods, like 1 minute of the user’s time + 0.1% of the day’s income + etc. It may be best for the overseer to use common sense about how important a decision is relative to some kind of idealized influence in the world, rather than sticking to any precisely defined basket.

It is also desirable to use a yardstick which is simple, and preferably which minimizes the overseer’s uncertainty. Ideally by standardizing on a single yardstick throughout an entire project, we could end up with definitions that are very broad and robust, while being very well-understood by the overseer.

Note that if the same agent is being trained to work for many users, then this yardstick is also specifying how the agent will weigh the interests of different users — for example, whose accents will it prefer to spend modeling capacity on understanding? This is something to be mindful of in cases where it matters, and it can provide intuitions about how to handle the normative uncertainty case as well. I feel that economic reasoning is useful for arriving at sensible conclusions in these situations, but there are other reasonable perspectives.

4. Widely varying reward

Some tasks may have widely varying rewards — sometimes the user would only pay 1¢ to move the decision one way or the other, and sometimes they would pay $10,000.

If small-stakes and large-stakes decisions occur comparably frequently, then we can essentially ignore the small-stakes decisions. That will happen automatically with a traditional optimization algorithm — after we normalize the rewards so that the “big” rewards don’t totally destroy our model, the “small” rewards will be so small that they have no effect.

Things get more tricky when small-stakes decisions are much more common than the large-stakes decisions. For example, if the importance of decisions is power-law distributed with an exponent of 1, then decisions of all scales are in some sense equally important, and a good algorithm needs to do well on all of them. This may sound like a very special case, but I think it is actually quite natural for there to be several scales that are all comparably important in total.

In these cases, I think we should do importance sampling — we oversample the high-stakes decisions during training, and scale the rewards down by the same amount, so that the contribution to the total reward is correct. This ensures that the scale of rewards is basically the same across all episodes, and lets us apply a traditional optimization algorithm.

Further problems arise when there are some very high-stakes situations that occur very rarely. In some sense this just means the learning problem is actually very hard — we are going to have to learn from few samples. Treating different scales as the same problem (using importance sampling) may help if there is substantial transfer between different scales, but it can’t address the whole problem.

For very rare+high-stakes decisions it is especially likely that we will want to use simulations to avoid making any obvious mistakes or missing any obvious opportunities. Learning with catastrophes is an instantiation of this setting, where the high-stakes settings have only downside and no upside. I don’t think we really know how to cope with rare high-stakes decisions; there are likely to be some fundamental limits on how well we can do, but I expect we’ll be able to improve a lot over the current state of the art.

5. Sparse reward

In many problems, “almost all” possible actions are equally terrible. For example, if I want my agent to write an email, almost all possible strings are just going to be nonsense.

One approach to this problem is to adjust the reward function to make it easier to satisfy — to provide a “trail of breadcrumbs” leading to high reward behaviors. I think this basic idea is important, but that changing the reward function isn’t the right way to implement it (at least conceptually).

Instead we could treat the problem statement as given, but view auxiliary reward functions as a kind of “hint” that we might provide to help the algorithm figure out what to do. Early in the optimization we might mostly optimize this hint, but as optimization proceeds we should anneal towards the actual reward function.

Typical examples of proxy reward functions include “partial credit” for behaviors that look promising; artificially high discount rates and careful reward shaping; and adjusting rewards so that small victories have an effect on learning even though they don’t actually matter. All of these play a central role in practical RL.

A proxy reward function is just one of many possible hints. Providing demonstrations of successful behavior is another important kind of hint. Again, I don’t think that this should be taken as a change to the reward function, but rather as side information to help achieve high reward. In the long run, we will hopefully design learning algorithms that automatically learn how to use general auxiliary information.

6. Complex reward

A reward function that intends to capture all of our preferences may need to be very complicated. If a reward function is implicitly estimating the expected consequences of an action, then it needs to be even more complicated. And for powerful learners, I expect that reward functions will need to be learned rather than implemented directly.

It is tempting to substitute a simple proxy for a complicated real reward function. This may be important for getting the optimization to work, but it is problematic to change the definition of the problem.

Instead, I hope that it will be possible to provide these simple proxies as hints to the learner, and then to use semi-supervised RL to optimize the real hard-to-compute reward function. This may allow us to perform optimization even when the reward function is many times more expensive to evaluate than the agent itself; for example, it might allow a human overseer to compute the rewards for a fast RL agent on a case by case basis, rather than being forced to design a fast-to-compute proxy.

Even if we are willing to spend much longer computing the reward function than the agent itself, we still won’t be able to find a reward function that perfectly captures our preferences. But it may be just as good to choose a reward function that captures our preferences “for all that the agent can tell,” i.e. such that the conditioned on two outcomes receiving the same expected reward the agent cannot predict which of them we would prefer. This seems much more realistic, once we are willing to have a reward function with much higher computational complexity than the agent.

Conclusion

In reinforcement learning we often take the reward function as given. In real life, we are only given our preferences — in an implicit, hard-to-access form — and need to engineer a reward function that will lead to good behavior. This presents a bunch of problems. In this post I discussed six problems which I think are relatively straightforward. (Straightforward from the reward-engineering perspective — the associated RL tasks may be very hard!)

Understanding these straightforward problems is important if we want to think clearly about very powerful RL agents. But I expect that most of our time will go into thinking about harder problems, for which we don’t yet have any workable approach. These harder problems may expose more fundamental limits of RL, that will require substantially new techniques to address.


Appendix: harder problems

Informed oversight

The process that produces a decision may encode important information about the quality of that decision, and if we throw out that information then a very bad decision may nevertheless receive a high reward. For example, if we want an agent to produce original art, we may not be able to evaluate a painting without understanding the process that produced it (and, in particular, what other artwork it was inspired by).

In some cases, recovering information about the process seems feasible. For example, if the agent is using attention or an external memory, the overseer can check what art the agent looked at. But in other cases it seems much more challenging: if the agent’s weights have memorized properties of particular artwork, we would need new techniques in order to make that information available to the overseer.

It is not clear whether there is any general way to give the overseer enough insight into the agent’s thought process to avoid problems like this; there may be some inherent limitations on delegation. Either way, it would be nice to understand the situation much better.

Capability amplification

By folding our preferences and long-term predictions into the reward function, we are in some sense limiting the agent to be no stronger than the overseer. If we start with a human, then RL will only ever produce human-level reasoning about long-term consequences or about “what is good.”

So we need some other mechanism for producing better long-term reasoning. I think that the most promising approach is to treat “reasoning well” as a short-term problem, for which humans can provide adequate oversight, and then to run that reasoning process for longer in order to produce more powerful reasoning.

Reward hacking / security

If our reward function has any problems, a powerful RL agent may tend to find actions a that exploit those problems in order to receive a high reward. In order to get good consequences, our reward function needs to be sufficiently secure that the agent can’t attack it in this way.

The first line of defense against an attack is to avoid immediately executing the agent’s action. Instead, the overseer can inspect that action and try to ensure that it isn’t intended as an attack. This leaves attacks which the overseer cannot recognize as attacks, or which do damage even when the overseer looks at them.

If the techniques from the previous sections actually allow the overseer to evaluate the agent’s actions, then they can also probably allow the overseer can detect attacks. Security during evaluation itself is an additional question though.

The main cause for hope is if the overseer can (1) be smarter than the agent which is trying to attack it, and (2) have access to some information about the agent’s thought process. Hopefully (2) allows the overseer to overcome the disadvantage of the “position of the interior” — if the agent picks a particular attack vector, the overseer can “watch them thinking” and then devote its energies to trying to detect or defend against that particular attack.

New Comment
30 comments, sorted by Click to highlight new comments since: Today at 2:06 PM

Security during evaluation itself is an additional question though.

The main cause for hope is if the overseer can (1) be smarter than the agent which is trying to attack it, and (2) have access to some information about the agent’s thought process.

This doesn't seem to help if it's the training process that is trying to attack the overseer, as I explain here. To describe it another way, what if the agent isn't thinking "how can I attack the overseer", which would let the overseer detect the attack before it happens. Instead the agent is thinking a line of thought (call it X) about some topic completely unrelated to "trying to attack the overseer", that happens to trigger a flaw in the overseer's evaluation process, and the reason the agent is thinking X is because the training process was trying to produce an agent that achieves high reward, and thinking X happens to achieve very high reward.

Since the optimization power that is causing the attack to happen is coming from the training process rather than the agent, perhaps another way to characterize the problem is that it's an outer alignment problem, whereas the "Reward hacking / security" section is only addressing the inner alignment problem. (I'm not totally sure this is correct use of terminology though.)

Does this make sense, and if so have you addressed this problem anywhere or have any ideas for how to solve it?

This is definitely a real concern, and one that I do spend time thinking about. Here are some thoughts:

First, if you want to understand what a model is doing, it's definitely going to be necessary to have access to the history/provenance of how/why each different component of the model came to be during the training process. This shouldn't be that difficult in an informed oversight setting if you do something like the following, though. Imagine some sequence of and suppose the minimum number of steps for an attack of this form to be introduced is . Then, at step we can implement oversight by calling to inspect the whole sequence and look for any attacks being developed and then retrain from right before that point in such a way that fixes them.

Second, however, it does still have to be the case that the overseer isn't too exploitable. If there is some strategy available to that causes to strongly believe that is acceptable that is easy to reach without going through anything that looks unacceptable, then this won't work. However, there are reasons to believe that this condition might not be that hard to obtain. First, it's fine if is just uncertain about whether is acceptable; we can just retrain the model to be more transparent in that case and be alright. Second, many of the natural sorts of attacks in this class are of the form where they exploit some knowledge that has but that its overseer doesn't, but in the informed oversight setup, should know everything that knows, preventing those sorts of attacks from working.

Third, I expect the training process to be quite a lot dumber than the model such that I expect the most concerning attacks on the overseer to come from the model and not the training process. It is conceivable, however, that the model could "hack" its own gradient so as to get the training process to modify it in a particular way, making an attack from the model look like it was coming from the training process. In this case, I would hope to prevent attacks of this form by ensuring that the model is properly myopic such that it isn't considering plans of this form.

Finally, if you're concerned about attacks of this form, you can modify your training process to make them less likely. In particular, in the standard setup that I generally think about, oversight is included as part of the model's loss function such that the model is incentivized to make itself transparent to the overseer. However, there are many different ways you can do this. Rather than have oversight as a penalty term, for example, you could have it as a constraint, which could help avoid Goodharting on the overseer (this is similar in spirit to your suggestion of using a quantilizer). Alternatively, you could not include oversight as part of the loss function at all and instead rely on the overseer manually fixing unacceptable models or retraining until an acceptable model is found, though I expect those sorts of approaches to not be very competitive.

Thanks for the reply. Good to know that it's on someone's radar already. I have some follow-up questions, which you don't have to all answer right away, since obviously this is an ongoing research project.

provenance

Ah, this is a useful connection to make. BTW I still have an unanswered question about provenance.

Imagine some sequence of and suppose the minimum number of steps for an attack of this form to be introduced is .

This part confuses me because intuitively it seems like if is larger that would make the problem easier, but in your proposal would need to inspect a longer sequence of models.

Also, roughly how many calls to do you usually imagine as making? What does "look for any attacks" actually consist of here? How can you be sure that whatever is doing will catch (or at least become suspicious/uncertain about) all possible attacks (and not have so many false positives that competitiveness is significantly impacted)?

many of the natural sorts of attacks in this class are of the form where they exploit some knowledge that has but that its overseer doesn’t

Ruling out one class of attacks doesn't seem to help much… To make a strong argument it seems like you'd need to exhaustively list all possible classes of attack and show how the overseer can catch all of them.

Third, I expect the training process to be quite a lot dumber than the model such that I expect the most concerning attacks on the overseer to come from the model and not the training process.

  1. The training process could be dumb but have access to a lot more compute and end up optimizing harder than the model would. It would take a lot longer to develop each attack but eventually find all possible attacks given enough compute.
  2. Future advances in ML/RL could make the training process smarter and the whole point of IDA is to take advantage of such advances to keep it competitive, so you can't assume that the training process is dumb.
  3. At the lower levels of IDA both the model and the overseer are pretty dumb, so even a moderately smart training process could be smarter.

(These arguments are really handwavy but hopefully gets my point/concern across.)

Rather than have oversight as a penalty term, for example, you could have it as a constraint, which could help avoid Goodharting on the overseer

Not sure what you have in mind here. If oversight is a constraint, what is providing the reward?

this is similar in spirit to your suggestion of using a quantilizer

Seems like with a quantilizer you'd take a competitiveness hit because the agent would no longer find the most optimal plans (that a comparable RL agent could find). How conservative do we need to be here, to ensure safety? How to analyze the competitiveness tradeoff, to make sure the competitiveness hit isn't too big?

This part confuses me because intuitively it seems like if is larger that would make the problem easier, but in your proposal would need to inspect a longer sequence of models.

You can avoid having to inspect the whole sequence by doing binary search to find the last model for which oversight works.

Also, roughly how many calls to do you usually imagine as making? What does "look for any attacks" actually consist of here? How can you be sure that whatever is doing will catch (or at least become suspicious/uncertain about) all possible attacks (and not have so many false positives that competitiveness is significantly impacted)?

Well, that's the hard part; this is still a very open question. I'm currently working on a large post summarizing my thoughts on this question, though, so rather than try to start laying out my thoughts here I think I'll just defer until that's out, which should be fairly soon.

Ruling out one class of attacks doesn't seem to help much… To make a strong argument it seems like you'd need to exhaustively list all possible classes of attack and show how the overseer can catch all of them.

Yep. I'm optimistic that this will be possible if we can get a better understanding of what transparent optimization looks like, however, since then can pursue the strategy of only being certain of the acceptability of models it can verify are in fact being transparent to it.

The training process could be dumb but have access to a lot more compute and end up optimizing harder than the model would. It would take a lot longer to develop each attack but eventually find all possible attacks given enough compute.

True, but the nice thing about that is that it should be slow and continuous such that if you're doing oversight during training according to a scheme like the one I described then you should be able to catch it.

Future advances in ML/RL could make the training process smarter and the whole point of IDA is to take advantage of such advances to keep it competitive, so you can't assume that the training process is dumb.

Well, maybe dumb was the wrong word there. I agree that it's bad to assume that the training process is dumb, but I think it's fine to assume that it'll be slow and continuous such that there's always some previous model such that an amplified version of it can oversee the current model.

At the lower levels of IDA both the model and the overseer are pretty dumb, so even a moderately smart training process could be smarter.

Yeah, this is where you're relying on the human more. For early models, I think you're mostly relying on the human having access to good enough interpretability tools that they can understand simple models without help.

Not sure what you have in mind here. If oversight is a constraint, what is providing the reward?

Ah, this might be a big difference in the way we're thinking about this problem. I see informed oversight as an inner alignment technique which you can do if you are training your model via the outer alignment technique of amplification. Specifically:

Let be the model and be your amplification operation, which in the pure supervised case just means letting a human answer the question given access to the model. Define to be some procedure for training to minimize Then, let and do informed oversight by letting inspect as above. In this picture, there are two distinct training signals: the supervised training signal through which is telling to approximate , and the oversight training signal, which is trying to get to be transparent and to verify that it is in fact doing the right thing.

(In practice, you'll probably want to incorporate oversight into the loss function so you get a better gradient at each step, though that's difficult because oversight is likely to be very expensive. However, you could do something like train an approximation to oversight that you use in your loss function in-between actual oversight calls. Regardless, I think this is mostly an implementation detail and generally focus on the above picture.)

Seems like with a quantilizer you'd take a competitiveness hit because the agent would no longer find the most optimal plans (that a comparable RL agent could find). How conservative do we need to be here, to ensure safety? How to analyze the competitiveness tradeoff, to make sure the competitiveness hit isn't too big?

I think this is related to what I just said above. The way I was thinking about this was that you could train your model to maximize its predictive accuracy of 's behavior subject to the constraint that it always be transparently acceptable to during oversight. Thus, competitiveness is mostly going to be a function of 1) how competitive HCH is, since that's the thing you're approximating, and 2) how much overhead oversight adds.

I agree that it’s bad to assume that the training process is dumb, but I think it’s fine to assume that it’ll be slow and continuous such that there’s always some previous model such that an amplified version of it can oversee the current model.

Why do you think it's fine to assume this? (Do you think this is unlikely to change in the future, or just that this is the limit of the scope of the problem that you're working on?)

Yeah, this is where you’re relying on the human more. For early models, I think you’re mostly relying on the human having access to good enough interpretability tools that they can understand simple models without help.

This seems to be assuming High Bandwidth Overseer. What about LBO?

in the pure supervised case just means letting a human answer the question given access to the model

This looks like a big disconnect between us. The thing that touched off this discussion was Ought's switch from Factored Cognition to Factored Evaluation, and Rohin's explanation: "In iterated amplification (AN #30), when decomposing tasks in the Factored Cognition sense, you would use imitation learning during the distillation step, whereas with Factored Evaluation, you would use reinforcement learning to optimize the evaluation signal."

I think if we're using SL for the question answering part and only using RL for "oversight" ("trying to get M to be transparent and to verify that it is in fact doing the right thing") then I'm a lot more optimistic since we only have to worry about security / reward gaming problems in the latter part, and we can do things like making it a constraint and doing quatilization without worrying about competitiveness. But I think Paul and Ought's plan is to use RL for both. In that case it doesn't help much to make "oversight" a constraint since the security / reward gaming problem in the "answer evaluation" part would still be there. And the problem just generally seems a lot harder because there could be so many different kinds of flaws in the "answer evaluation" part that could be exploited.

Why do you think it's fine to assume this? (Do you think this is unlikely to change in the future, or just that this is the limit of the scope of the problem that you're working on?)

I think I would be fairly surprised if future ML techniques weren't smooth in this way, so I think it's a pretty reasonable assumption.

This seems to be assuming High Bandwidth Overseer. What about LBO?

A low-bandwidth overseer seems unlikely to be competitive to me. Though it'd be nice if it worked, I think you'll probably want to solve the problem of weird hacky inputs via something like filtered-HCH instead. That being said, I expect the human to drop out of the process fairly quickly—it's mostly only useful in the beginning before the model learns how to do decompositions properly—at some point you'll want to switch to implementing as consulting rather than consulting .

This looks like a big disconnect between us. The thing that touched off this discussion was Ought's switch from Factored Cognition to Factored Evaluation, and Rohin's explanation: "In iterated amplification (AN #30), when decomposing tasks in the Factored Cognition sense, you would use imitation learning during the distillation step, whereas with Factored Evaluation, you would use reinforcement learning to optimize the evaluation signal."

I think if we're using SL for the question answering part and only using RL for "oversight" ("trying to get M to be transparent and to verify that it is in fact doing the right thing") then I'm a lot more optimistic since we only have to worry about security / reward gaming problems in the latter part, and we can do things like making it a constraint and doing quatilization without worrying about competitiveness. But I think Paul and Ought's plan is to use RL for both. In that case it doesn't help much to make "oversight" a constraint since the security / reward gaming problem in the "answer evaluation" part would still be there. And the problem just generally seems a lot harder because there could be so many different kinds of flaws in the "answer evaluation" part that could be exploited.

I mostly agree with this and it is a disagreement I have with Paul in that I am more skeptical of relaxing the supervised setting. That being said, you definitely can still make oversight a constraint even if you're optimizing an RL signal and I do think it helps, since it gives you a way to separately verify that the system is actually being transparent. The idea in this sort of a setting would be that, if your system achieves high performance on the RL signal, then it must be outputting answers which the amplified human likes—but then the concern is that it might be tricking the human or something. But then if you can use oversight to look inside the model and verify that it's actually being transparent, then you can rule out that possibility. By making the transparency part a constraint rather than an objective, it might help prevent the model from gaming the transparency part, which I expect to be the most important part and in turn could help you detect if there was any gaming going on of the RL signal.

For the record, though, I don't currently think that making the transparency part a constraint is a good idea. First, because I expect transparency to be hard enough that you'll want to be able to benefit from having a strong gradient towards it. And second, because I don't think it actually helps prevent gaming very much: even if your training process doesn't explicitly incentivize gaming, I expect that by default many mesa-optimizers will have objectives that benefit from it. Thus, what you really want is a general solution for preventing your mesa-optimizer from ever doing anything like that, which I expect to be something like corrigibility or myopia, rather than just trying to rely on your training process not incentivizing it.

I think I would be fairly surprised if future ML techniques weren't smooth in this way, so I think it's a pretty reasonable assumption.

This is kind of tangential at this point, but I'm not so sure about this. Humans can sometimes optimize things without being slow and continuous, so there must be algorithms that can do this, which can be invented or itself produced via (dumber) ML. As another intuition pump, suppose the algorithm is just gradient descent with some added pattern recognizers that can say "hey, I see where this is going, let's jump directly there."

A low-bandwidth overseer seems unlikely to be competitive to me.

Has this been written up anywhere, and is it something that Paul agrees with? (I think last time he talked about HBO vs LBO, he was still 50/50 on them.)

I mostly agree with this and it is a disagreement I have with Paul in that I am more skeptical of relaxing the supervised setting.

Ah ok, so your earlier comments were addressing a different and easier problem than the one I have in mind, and that's (mostly) why you sounded more optimistic than me.

Do you have a good sense of why Paul disagrees with you, and if so can you explain?

(I haven't digested your mechanistic corrigibility post yet. May have more to say after I do that.)

I think I never really understood the motivation for the informed oversight problem, because I'm not sure why having an AI be able to create original art is important. Can you give another example where the stakes are higher?

Happy to give more examples; if you haven't seen this newer post on informed oversight it might be helpful (and if not, I'm interested in understanding where the communication gaps are).

I just read the new post on informed oversight up to and including "Necessity of knowing what the agent knows". (I saw it before but didn't feel very motivated to read it, which I later realized was probably because I didn't understand why it's an important problem.) The new post uses the example of "an agent trying to defend a computer system from attackers" which does seem more motivating, but I'm still not sure why there's a problem here. From the post:

The same thing can happen if we observe everything our agent observes, if we aren’t able to understand everything our agent understands. In the security example, literally seeing a sequence of bits moving across an interface gives you almost no information — something can look innocuous, but cause a huge amount of trouble. In order to incentivize our agent to avoid causing trouble, we need to be able to detect any trouble that the agent deliberately causes.

If the overseer sees the agent output an action that the overseer can't understand the rationale of, why can't the overseer just give it a low approval rating? Sure, this limits the performance of the agent to that of the overseer, but that should be fine since we can amplify the agent later? If this doesn't work for some reason, why don't we have the agent produce an explanation of the rationale for the action it proposes, and output that along with the action, and have the overseer use that as a hint to help judge how good the action is?

If the overseer sees the agent output an action that the overseer can't understand the rationale of, why can't the overseer just give it a low approval rating? Sure, this limits the performance of the agent to that of the overseer, but that should be fine since we can amplify the agent later?

Suppose that taking action X results in good consequences empirically, but discovering why is quite hard. (It seems plausible to me that this kind of regularity is very important for humans actually behaving intelligently.) If you don't allow actions that are good for reasons you don't understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable (at which point there will be more sophisticated actions Y that result in good consequences for reasons that the agent would have to be even more sophisticated to understand).

If this doesn't work for some reason, why don't we have the agent produce an explanation of the rationale for the action it proposes, and output that along with the action, and have the overseer use that as a hint to help judge how good the action is?

Two problems:

  • Sometimes you need hints that help you see why an action is bad. You can take this proposal all the way to debate, though you are still left with a question about whether debate actually works.
  • Agents can know things because of complicated regularities on the training data, and hints aren't enough to expose this to the overseer.

Suppose that taking action X results in good consequences empirically, but discovering why is quite hard. (It seems plausible to me that this kind of regularity is very important for humans actually behaving intelligently.)

Why is the overseer unable to see that X results in good consequences empirically, and give a high approval rating as a result? (When I said "understand" I just meant that the overseer can itself see that the action is good, not that it can necessarily articulate a reason. Similarly the "explanation" from the agent can just be "X is empirically good, look for yourself".)

I have a guess that the overseer has a disadvantage relative to the agent because the agent has a kind of memory, where it has incorporated information from all past training data, and gets more from each new feedback from the overseer, but the overseer has no memory of past training data and has to start over evaluating each new input/action pair from a fixed state of knowledge. Is this right? (If so, it seems like maybe we can fix it by letting the overseer have access to past training data? Although it seems plausible that wouldn't work well enough so if this guess is right, I think I may understand what the problem is.)

Some problems:

  • If we accept the argument "well it worked, didn't it?" then we are back to the regime where the agent may know something we don't (e.g. about why the action wasn't good even though it looked good).
  • Relatedly, it's still not really clear to me what it means to "only accept actions that we understand." If the agent presents an action that is unacceptable, for reasons the overseer doesn't understand, how do we penalize it? It's not like there are some actions for which we understand all consequences and others for which we don't---any action in practice could have lots of consequences we understand and lots we don't, and we can't rule out the existence of consequences we don't understand.
  • As you observe, the agent learns facts from the training distribution, and even if the overseer has a memory there is no guarantee that they will be able to use it as effectively as the agent. Being able to look at training data in some way (I expect implicitly) is a reason that informed oversight isn't obviously impossible, but not reasons that this is a non-problem.

Relatedly, it’s still not really clear to me what it means to “only accept actions that we understand.” If the agent presents an action that is unacceptable, for reasons the overseer doesn’t understand, how do we penalize it? It’s not like there are some actions for which we understand all consequences and others for which we don’t—any action in practice could have lots of consequences we understand and lots we don’t, and we can’t rule out the existence of consequences we don’t understand.

What if the overseer just asks itself, "If I came up with the idea for this action myself, how much would I approve of it?" Sure, sometimes the overseer would approve something that has bad unintended/unforeseen consequences, but wouldn't the same thing happen if the overseer was just making the decisions itself?

ETA: Is the answer that if the overseer was making the decisions itself, there wouldn't be a risk that the process proposing the actions might deliberately propose actions that have bad consequences that the overseer can't foresee? Would this still be a problem if we were training the agent with SL instead of RL? If not, what is the motivation for using RL here?

I feel like through this discussion I now understand the problem a little better, but it's still not nearly as crisp as some of the other problems like "optimizing for worst case". I think part of it is lack of a clear motivating example (like inner optimizers for "optimizing for worst case") and part of it is that "informed oversight" is a problem that arises during the distillation step of IDA, but previously that step was described as distilling the overseer down to a faster but less capable agent. Here it seems like you're trying to train an agent that is more capable than the overseer in some way, and I'm not entirely sure why that has changed.

ETA: Going back to the Informed Oversight article, this part almost makes sense now:

In the security example, literally seeing a sequence of bits moving across an interface gives you almost no information — something can look innocuous, but cause a huge amount of trouble. In order to incentivize our agent to avoid causing trouble, we need to be able to detect any trouble that the agent deliberately causes. Even an apparently mundane gap in our understanding could hide attacks, just as effectively as if we’d been literally unable to observe the agent’s behavior.

I think it would really help if you could give a story about why the agent is deliberately trying to cause trouble, and how it came to have more understanding than the overseer, enough to pick an action that looks good to the overseer but would actually knowingly (to the agent) cause something bad to happen.

What if the overseer just asks itself, "If I came up with the idea for this action myself, how much would I approve of it?" Sure, sometimes the overseer would approve something that has bad unintended/unforeseen consequences, but wouldn't the same thing happen if the overseer was just making the decisions itself?

No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work. (Unless they happen to get unlucky and find great actions by chance, in which case they can get unlucky, or if they run a really powerful search, in which case yes really powerful searches are scary for the same reason.)

Would this still be a problem if we were training the agent with SL instead of RL?

You would still need informed oversight in order to solve the inner alignment problem (i.e. to actually achieve robustness in the face of sophisticated reasoning within the SL system), but you wouldn't need it for the outer alignment problem.

If not, what is the motivation for using RL here?

I agree with Will. The point is to be competitive, I don't see how you could be competitive if you use SL (unless it turns out that RL just doesn't add any value, in which case I agree we don't have to worry about RL).

like inner optimizers for "optimizing for worst case"

But you need to solve this problem in order to cope with inner optimizers.

Here it seems like you're trying to train an agent that is more capable than the overseer in some way, and I'm not entirely sure why that has changed.

This is the step where we try to ensure that the overseer is actually more capable, in the stronger sense of knowing every particular thing the agent knows.

I think it would really help if you could give a story about why the agent is deliberately trying to cause trouble

I don't quite understand what kind of story you are looking for, and this might have just been a communication problem in the post, so let me start by rephrasing that example:

  • I want my AI to protect me from attack, so I give it a high reward when it looks like I wasn't attacked.
  • So my AI searches over actions to find one for which it expects I'll conclude "I wasn't attacked."
  • Instead of actually protecting me, my AI undermines the mechanisms by which I would have detected that I was attacked.

We could run the same argument with "I want to acquire resources" instead of "I want to be protected from attack"---rather than actually helping me get resources, the agent can convince me that I have a bunch of resources when I don't really have any.

how it came to have more understanding than the overseer

We did a lot of gradient descent over the parameters of the agent, to find parameters that cause the agent to take good actions. That gradient descent gives the agent a bunch of ability to reason about the world, because successful reasoning can help it take good actions.

The overseer can also look at that data and try to learn about general patterns etc.. But how do we guarantee that the explicit procedure used by the overseer actually captures all of the stuff the agent was able to learn? This is very non-obvious.

But you need to solve this problem in order to cope with inner optimizers.

Is "informed oversight" entirely a subproblem of "optimizing for worst case"? Your original example of art plagiarism made it seem like a very different problem which might be a significant part of my confusion.

This is the step where we try to ensure that the overseer is actually more capable, in the stronger sense of knowing every particular thing the agent knows.

This is tangential but can you remind me why it's not a problem as far as competitiveness that your overseer is probably more costly to compute than other people's reward/evaluation functions?

I want my AI to protect me from attack, so I give it a high reward when it looks like I wasn’t attacked.

Ok, this is definitely part of the confusion/miscommunication, as I wouldn't have guessed this without it being explicit. Unless otherwise stated, I generally assume that overseers in your schemes follow the description given in Approval-directed agents, and only give high reward to each action if the overseer can itself anticipate good consequences from the action. (That post says, "Arthur’s actions are rated more highly than those produced by any alternative procedure. That’s comforting, but it doesn’t mean that Arthur is optimal. An optimal agent may make decisions that have consequences Hugh would approve of, even if Hugh can’t anticipate those consequences himself." This seems to clearly imply that Hugh does not reward Arthur just for making decisions that have consequences Hugh would approve of, unless Hugh can anticipate those consequences himself.)

One of your earlier comments in this thread said "If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable (at which point there will be more sophisticated actions Y that result in good consequences for reasons that the agent would have to be even more sophisticated to understand)." So I guess that explains why the overseer in your example is doing something different, but I don't recall seeing you mention this problem prior to this thread, so it wasn't on my radar as something that you're trying to solve. (I'm still not quite sure at this point that it really is a problem or that I correctly understand it. If you have explained it more somewhere, please let me know.)

Is "informed oversight" entirely a subproblem of "optimizing for worst case"? Your original example of art plagiarism made it seem like a very different problem which might be a significant part of my confusion.

No, it's also important for getting good behavior from RL.

This is tangential but can you remind me why it's not a problem as far as competitiveness that your overseer is probably more costly to compute than other people's reward/evaluation functions?

This is OK iff the number of reward function evaluations is sufficiently small. If your overseer is 10x as expensive as your policy, you need to evaluate the reward function <1/10th as often as you evaluate your policy. (See semi-supervised RL.)

(Note that even "10x slowdown" could be very small compared to the loss in competitiveness from taking RL off the table, depending on how well RL works.)

Unless otherwise stated, I generally assume that overseers in your schemes follow the description given in Approval-directed agents, and only give high reward to each action if the overseer can itself anticipate good consequences from the action.

In this problem we are interesting in ensuring that the overseer is able to anticipate good consequences from an action.

If a model trained on historical data could predict good consequences, but your overseer can't, then you are going to sacrifice competitiveness. That is, your agent won't be motivated to use its understanding to help you achieve good consequences.

I think the confusion is coming from equivocating between multiple proposals. I'm saying, "We need to solve informed oversight for amplification to be a good training scheme." You are asking "Why is that a problem?" and I'm trying to explain why this is a necessary component of iterated amplification. In explaining that, I'm sometimes talking about why it wouldn't be competitive, and sometimes talking about why your model might do something unsafe if you used the obvious remedy to make it competitive. When you ask for "a story about why the model might do something unsafe," I assumed you were asking for the latter---why would the obvious approach to making it competitive be unsafe. My earlier comment "If you don’t allow actions that are good for reasons you don’t understand, it seems like you can never take action X, and if the reasoning is complicated then amplification might not fix the problem until your agent is much more capable" is explaining why approval-directed agents aren't competitive by default unless you solve something like this.

(That all said, sometimes the overseer believes that X will have good consequences because "stuff like X has had good consequences in the past;" that seems to be an important kind of reasoning that you can't just leave out, and if you use that kind of reasoning then these risks can appear even in approval-directed agents with no hindsight. And if you don't use this kind of reasoning you sacrifice competitiveness.)

No, it’s also important for getting good behavior from RL.

Ok.

This is OK iff the number of reward function evaluations is sufficiently small. If your overseer is 10x as expensive as your policy, you need to evaluate the reward function <1/10th as often as you evaluate your policy. (See semi-supervised RL.)

Do you have an intuition that semi-supervised RL will be competitive with standard RL?

(Note that even “10x slowdown” could be very small compared to the loss in competitiveness from taking RL off the table, depending on how well RL works.)

Do you have an intuition that RL will work much better than SL for your purposes? If so, why/how? AFAIK, today people generally use RL over SL because A) they don't have something that can provide demonstrations, B) it's cheaper to provide evaluations than to provide demonstrations, or C) they want to exceed the performance of their demonstrators. But none of these seem to apply in your case? If you have a demonstrator (i.e., the amplified overseer) that can provide a supervised training signal at the performance level you want the trained agent to achieve, I'm not sure what you expect RL to offer on top of SL.

Another tangentially related puzzle that I have is, it seems like the internal computations of a RL agent would differ greatly depending on how you train it. For example if you train it with a simple reward function, then I expect the RL agent might end up modeling that reward function internally at a high level of accuracy and doing some sort of optimized/heuristic search for actions that would lead to high rewards. But if you train it with an overseer that is 10x as expensive as the agent, I'm not totally sure what you expect it to do internally and why we should expect whatever that is to be competitive with the first RL agent. For example maybe it would devote a lot more compute into running a somewhat accurate model of the overseer and then do a search for actions that would lead to high approval according to the approximate model, but then wouldn't it do worse than the first RL agent because it couldn't test as many candidate actions (because each test would cost more) and it would have to use worse heuristics (because the objective function is lot more complex and it doesn't have as much resources, e.g., artificial neurons, left to run the heuristics)?

I think the confusion is coming from equivocating between multiple proposals.

Yes, I think I understand that at this point.

In explaining that, I’m sometimes talking about why it wouldn’t be competitive, and sometimes talking about why your model might do something unsafe if you used the obvious remedy to make it competitive.

In the new Informed oversight post, it seems that you skipped over talking about "why it wouldn’t be competitive" and went directly to "talking about why your model might do something unsafe if you used the obvious remedy to make it competitive" which confused me because I didn't know that's what you were doing. (The post doesn't seem to contain the word "competitive" at all.)

That aside, can you give an example that illustrates "why it wouldn’t be competitive"?

That all said, sometimes the overseer believes that X will have good consequences because “stuff like X has had good consequences in the past;” that seems to be an important kind of reasoning that you can’t just leave out, and if you use that kind of reasoning then these risks can appear even in approval-directed agents with no hindsight.

I think I'm still confused about this, because it seems like these risks can appear even if the overseer (or a normal human) uses this kind of reasoning to itself decide what to do. For example, suppose I had an imperfect detector of network attacks, and I try a bunch of stuff to protect my network, and one of them happens to mislead my detector into returning "no attack" even when there is an attack, and then I use this kind of reasoning to do a lot more of that in the future.

Earlier you wrote "No, because if the overseer is making sophisticated decisions themselves they understand why those decisions actually work." but if you're saying that the overseer sometimes believes that X will have good consequences because “stuff like X has had good consequences in the past;” and uses that to make decisions, then they don't always understand why those decisions actually work?

What is the motivation for using RL here?

I see the motivation as given practical compute limits, it may be much easier to have the system find an action the overseer approves of instead of imitating the overseer directly. Using RL also allows you to use any advances that are made in RL by the machine learning community to try to remain competitive.

Would this still be a problem if we were training the agent with SL instead of RL?

Maybe this could happen with SL if SL does some kind of large search and finds a solution that looks good but is actually bad. The distilled agent would then learn to identify this action and reproduce it, which implies the agent learning some facts about the action to efficiently locate it with much less compute than the large search process. Knowing what the agent knows would allow the overseer to learn those facts, which might help in identifying this action as bad.

Consider the following agent:

Search over all possible plans for one that scores high on <malicious reward function> that looks good to the overseer with some normal explanation. Output that plan and the corresponding explanation.

Giving low scores to confusing actions and asking for an explanation do not save you from the harmful plans that are found by this agent. When the agent is much more powerful than the overseer, this seems very unsafe. When the overseer is more powerful, I am unsure what would happen.

The danger that I perceive is more from the abstract argument that the agent could find bad plans that the overseer would not recognize. You could consider lobbying and spy movies as real-life analogues of this problem.

Search over all possible plans for one that scores high on <malicious reward function> that looks good to the overseer with some normal explanation. Output that plan and the corresponding explanation.

But there's an agent close to this malicious agent in agent space, which prioritizes "looking good to the overseer" more and "scoring high on <malicious reward function>" less, so if we keeping training with RL, we'd move towards the former agent and away from the latter agent (since the former agent would tend to come up with actions that the overseer approves of more). And if we keep training, we should eventually end up with an agent that doesn't try to score high on <malicious reward function> at all, right?

When the agent is much more powerful than the overseer, this seems very unsafe.

I agree with this, but for a different reason. I think if we optimize for "looking good to the overseer" too much, we'd reliably trigger safety problems in the overseer's evaluation process (e.g., triggering bugs in the overseer's code, or causing distributional shifts that the overseer can't handle). I feel like that's a different problem from "informed oversight" though, which I still don't quite understand.

You could consider lobbying and spy movies as real-life analogues of this problem.

In both of those cases the lobbyist and the spy have an informational as well as computational advantage over the politician/counterintelligence officer (the politician doesn't know a lot of details that the lobbyist knows and has less time to think about an issue than the lobbyist; the counterintelligence officer can't see much of what the spy has seen and done and has to divide attention between lots of people who look suspicious), but in the IDA setting isn't the overseer seeing everything that the agent is seeing, and also has more computational power than the agent?

But there's an agent close to this malicious agent in agent space, which prioritizes "looking good to the overseer" more and "scoring high on <malicious reward function>" less, so if we keeping training with RL, we'd move towards the former agent and away from the latter agent (since the former agent would tend to come up with actions that the overseer approves of more). And if we keep training, we should eventually end up with an agent that doesn't try to score high on <malicious reward function> at all, right?

In theory, yes, but in practice RL does not give us such nice guarantees. (Inner optimizers are a particular instantiation of this concern that RL does not find the best possible agent.)

I agree with this, but for a different reason.

I agree that it's a different (also important) reason and that your reason does not motivate informed oversight.

In both of those cases the lobbyist and the spy have an informational as well as computational advantage over the politician/counterintelligence officer

Agreed, it's not perfectly analogous, and that's why I'm unsure what would happen (as opposed to being confident that bad behavior would result).

in the IDA setting isn't the overseer seeing everything that the agent is seeing, and also has more computational power than the agent?

Yes. The worry is more from the lack of a story for why we will get good outcomes, plus some speculative stories about how we could maybe get bad outcomes (primarily inner optimizers). With informed oversight solved, you could hope to construct an argument that even if an inner optimizer arises, the overseer would be able to tell and so we wouldn't get bad outcomes.

Inner optimizers are a particular instantiation of this concern that RL does not find the best possible agent.

I thought inner optimizers are supposed to be handled under "learning with catastrophe" / "optimizing for worst case". In particular inner optimizers would cause "malign" failures which would constitute a catastrophe which techniques for learning with catastrophe / optimizing for worst case (such as adversarial training, verification, transparency) would detect and train the agent out of.

Is "informed oversight" just another name for that problem, or a particular approach to solving it? (If the former, why yet another name? If the latter, how is it different from "transparency"?) I haven't seen any writing from Paul that says this, and also the original example that motivated "informed oversight" (the overseer wants to train the AI to create original art but can't distinguish between original art and plagiarism) seems rather different from the inner optimizer problem and wouldn't seem to constitute a "catastrophe" or a "malign failure", so I'm still confused.

ETA: The suggestions I gave at the start of this thread (overseer gives low approval rating to actions it can't understand, and having agent output an explanation along with an action) are supposed to be used alongside "learning with catastrophe" / "optimizing for worst case" and not meant as a replacement for them. I thought those ideas would be enough to solve the more recent motivating example for "informed oversight" that Paul gave (training an agent to defend against network attacks).

I thought inner optimizers are supposed to be handled under "learning with catastrophe" / "optimizing for worst case". In particular inner optimizers would cause "malign" failures which would constitute a catastrophe which techniques for learning with catastrophe / optimizing for worst case (such as adversarial training, verification, transparency) would detect and train the agent out of.

Yes. Inner optimizers should either result in low performance on the training distribution (in which case we have a hope of training them out, though we may get stuck in a local optimization), or to manifestly unacceptable behavior on some possible inputs.

Is "informed oversight" just another name for that problem, or a particular approach to solving it?

Informed oversight is being able to figure out everything your agent knows about how good a proposed action is. This seems like a prerequisite both for RL training (if you want a reward function that incentivizes the correct behavior) and for adversarial training to avoid unacceptable behavior.

If the latter, how is it different from "transparency"?

People discuss a bunch of techniques under the heading of transparency/interpretability, and have a bunch of goals.

In the context of this sequence, transparency is relevant for both:

  • Know what the agent knows, in order to evaluate its behavior.
  • Figure out under what conditions the agent would behave differently, to facilitate adversarial training.

For both of those problems, it's not obvious the solution will look anything like what is normally called transparency (or what people in that field would recognize as transparency). And even if it will look like transparency, it seems worth distinguishing different goals of that research.

So that's why there is a different name.

I thought those ideas would be enough to solve the more recent motivating example for "informed oversight" that Paul gave (training an agent to defend against network attacks).

(I disagreed with this upthread. I don't think "convince the overseer that an action is good" obviously incentivizes the right behavior, even if you are allowed to offer an explanation---certainly we don't have any particular argument that it would incentivize the right behavior. It seems like informed oversight roughly captures what is needed in order for RL to create the right incentives.)

The capability amplification section also seems under-motivated to me. Paul writes: "If we start with a human, then RL will only ever produce human-level reasoning about long-term consequences or about “what is good.”" But absent problems like those you describe in this post, I'm inclined to agree with Eliezer that

If arguendo you can construct an exact imitation of a human, it possesses exactly the same alignment properties as the human; and this is true in a way that is not true if we take a reinforcement learner and ask it to maximize an approval signal originating from the human. (If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I'm pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back.)

In other words, if we are aiming for Bostrom's maxipok (maximum probability of an OK outcome), it seems plausible to me that "merely" Paul's level of moral reasoning is sufficient to get us there, especially if the keys to the universe get handed back. If this is our biggest alignment-specific problem, I might sooner allocate marginal research hours towards improving formal methods or something like that.

plausible to me that "merely" Paul's level of moral reasoning is sufficient to get us there

The hard part of "what is good" isn't the moral part, it's understanding things like "in this world, do humans actually have adequate control of and understanding of the situation?" Using powerful optimization to produce outcomes that look great to Paul-level reasoning doesn't seem wise, regardless of your views on moral questions.

I agree that Paul level reasoning is fine if no one else is building AI systems with more powerful reasoning.

Using powerful optimization to produce outcomes that look great to Paul-level reasoning doesn't seem wise, regardless of your views on moral questions.

Interesting. I think there are some important but subtle distinctions here.

In the standard supervised learning setup, we provide a machine learning algorithm with some X (in this case, courses of action an AI could take) and some Y (in this case, essentially real numbers representing the degree to which we approve of the courses of action). The core challenge of machine learning is to develop a model which extrapolates well beyond this data. So then the question becomes... does it extrapolate well in the sense of accurately predicting Paul-level reasoning, including deficiencies Paul would exhibit when examining complex or deceptive scenarios that are at the limit of Paul's ability to understand? Or does it extrapolate well in the sense of accurately predicting what Paul would desire on reflection, given access to all of the AI's knowledge, cognitive resources, etc.?

Let's assume for the sake of argument that all of the X and Y data is "good", i.e. it doesn't make the algorithm think it's the first Paul which is supposed to get extrapolated by including a mistake that only the first Paul would make. I'll talk about the case where we have some bad data at the end.

The standard way to measure the effectiveness of extrapolation in machine learning is to make use of a dev set. Unfortunately, that doesn't help in this case because we don't have access to labeled data from "Paul who has reflected a bunch given access to all of the AI's knowledge, cognitive resources, etc." If we did have access to such data, we could find a data point that the two Paul's label differently and test the model on that. (However, we might do a similar sort of test by asking a child to provide some labeled data, then checking to see whether the model assigns nontrivial credence to the answers an adult gives on data points where the child and the adult disagree.)

In poetic terms, we want the system to be asking itself:

Is there a plausible model that fits the labeled data I've been given which leads me to believe this world is not one in which humans actually have adequate control and understanding of the situation? Does there exist some model for the user's preferences such that I assign a decently high prior to this model, the model fits the labeled data I've been given, and when this model is extrapolated to this [malign] clever scheme I've dreamed up, it returns either "this scheme is too complicated for me to evaluate and it should be penalized on that basis" or "this scheme is just bad"?

In the absence of data which distinguishes between two hypotheses, belief in one hypothesis or the other comes down to the choice of prior. So you want the AI's cognitive architecture to be structured so that whatever concepts, learning capabilities, prediction capabilities, etc. which make it cognitively powerful also get re-used in the service of generating plausible extrapolations from the labeled data the user has provided. Then, if any of those extrapolations assign nontrivial credence to some plan being malign, that's a strike against it.

Re: the bad data case, you might handle this using the same sort of techniques which are normally used for mislabeled or noisy data. For example, split the data into 30 folds, train an ensemble on every possible combination of 10 folds, and if any one of the resulting models objects to some action, nix it. Now we're resilient to up to 20 mislabeled data points. Not saying this is a good scheme, just trying to offer a concrete illustration of how this problem seems tractable.

I feel most confused about the third and fourth problems. Choosing a yardstick could work to aggregate reward functions, but I still worry about the issue that this tends to overweight reward functions that assign a low value to the yardstick but high value to other outcomes. With widely-varying rewards, it seems hard to importance sample high-stakes decisions, without knowing what those decisions might be. Maybe if we notice a very large reward, we instead make it lower reward, but oversample it in the future? Something like this could potentially work, but I don't see how yet.

For complex, expensive-to-evaluate rewards, Paul suggests using semi-supervised learning; this would be fine if semi-supervised learning was sufficient, but I worry that there actually isn't enough information in just a few evaluations of the reward function to narrow down on the true reward sufficiently, which means that even conceptually we will need something else.

Regarding long time horizons, it seems like the way humans handle this problem is to plan in high resolution over a short time horizon (the coming day or the coming week) and lower resolution over a long time horizon (the coming year or the coming decade). It seems like maybe the AI could use a similar tactic, so the 40-year planning is done with a game where each year constitutes a single time-step. I think maybe this is related to hierarchical reinforcement learning? (The option you outline seems acceptable to me though.)