Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Here I'll summarize the main abstraction I use for thinking about future AI systems. This is essentially the same model that Paul uses. I'm not actually introducing any new ideas in this post; mostly this is intended to summarize my current views.


If we are to think of highly advanced AI systems, it is useful to treat some AI capabilities as a kind of black box: we need a good understanding of where the optimization power made possible by future hardware and algorithms is going, so that we can think about what the optimized things look like without knowing the exact details of the optimization algorithm. I'm going to state a generalization over some of these capabilities, which is based on current ML practice (training policies to optimize training objectives). The current single best model I know of for reasoning about highly capable AI systems is to assume that they have this general capability and no other capabilities.

The general capability can be stated as: trained policies will receive a high average within-episode training score, compared to alternative policies with similar resource bounds. I'll call this capability "general episodic RL".

To clarify "within-episode", we could consider a single classification task (classify this picture) as a single episode in a supervised learning context, where the policy is the classifier; we could consider solving a single SAT problem to be an episode in a SAT solving context, where the policy is a SAT solver; and of course we have episodes in an episodic RL context. So systems with this general capability are, among other things, good supervised learners, SAT solvers, and episodic RL agents.


If advanced AI systems have this general capability (and no other capabilities), this implies that:

  1. We don't have good performance guarantees when test episodes are distinguishable from training episodes (i.e. when some distinguisher can look at an episode and tell whether it is a training episode or a test episode).

  2. When the test episode is similar to training episodes (e.g. in an online learning context), we should expect trained policies to act like a rational agent maximizing its expected score in this test episode; otherwise, the policy that acts as a rational agent would would get a higher expected test score than this one, and would therefore receive the highest training score.

  3. In particular, what it means for a policy to "maximize its average test score" is "there is no alternative policy getting a higher expected test score on the same sequence of test episodes."

  4. (3) implies that there is no automatic incentive to influence future test episodes (e.g. cause humans to give the AI easier-to-label images in the future).

  5. (3) also implies that collusion does not happen in most natural setups (e.g. under this policy, generative adversarial policy will not collude). Instead, agents will play something like a correlated equilibrium.

  6. If there is some lowish-complexity fact that would help a policy gain a higher average training score, then we should assume that the trained policy will act as if it knows this fact. Otherwise, it would get a lower training score than an alternative policy that does know this fact and uses this to achieve a higher average training score. In particular, trained policies will know general features about the training setup.

Why might episodic RL capabilities be avaliable?

To develop reliable general episodic RL systems, we would need:

  1. Methods for ensuring that trained policies notice when they may fail to generalize from a training context to a test context, such as KWIK learning and other methods for identifying inductive ambiguities.
  2. Better policy classes (e.g. neural network architectures) and theoretical analyses of them. For example, perhaps we could show that "adversarial" hypotheses are not problematic, since there will exist non-adversarial variants of adversarial hypotheses.
  3. Better optimization algorithms and theoretical analyses of them (which could prove e.g. non-collusion).

The hope is most of these technologies will be developed in the course of AI capabilities research. I find some aspects of these problems compelling as AI alignment research, especially those in (1), both because it isn't guaranteed that AI capabilities researchers will automatically develop enough theory for understanding highly advanced episodic RL agents, and because it is useful to have concrete models of things like inductive ambiguity identification available ahead of time (so more AI alignment research can build on them).

Using general capabilities to reason about hypothetical aligned AGI designs

Ideally, a proposed aligned AGI design states: "if we had access to algorithms implementing general capabilities X, Y, and Z, then we could arrange these algorithms such that, if all algorithms actually deliver these capabilities, then the resulting system is pretty good for human values". In practice, it is usually difficult to state a proposal this clearly, so initial research will be aimed at getting to the point that a proposal of this form can be stated.

If we're assuming that future AI systems only have episodic RL capabilities, then the proposal will say something like: "if we arrange episodic RL systems in the right way, then if they play in an approximate correlated equilibrium (within each episode), then the system will do good things." I think this approach allows both "unitary" systems and "decomposed" systems, and both "tools" and "agents", while making everything precise enough that we don't need to rely on the intuitive meanings of these words to reason about advanced AI systems.

What other capabilities could be available?

It's worth thinking about additional general capabilities that highly advanced AI systems might have.

  1. We could suppose that systems are good at transfer learning: they can generalize well from a training context to a test context, without the training context being similar enough to the training context that we'd expect good training performance to automatically imply good test performance. This is clearly possible in some cases, and impossible in other cases, but it's not clear where the boundary is.

  2. We could suppose that systems are good at learning "natural" structure in data (e.g. clusters and factors) using unsupervised learning.

  3. We could suppose that systems will be good at pursuing goals defined using logic, even when there aren't many training examples of correct logical inference to generalize from. Paul describes a version of this problem here. Additionally, much of MIRI's work in logical uncertainty and decision theory is relevant to designing agents that pursue goals defined using logic.

  4. We could suppose that systems will be good at pursuing environmental goals (such as manufacturing paperclips). At the moment, we have very little idea of how one might specify such a goal, but it is at least imaginable that some future theory would allow us to specify the goal of manufacturing paperclips. I expect any theory for how to do this to rely on (2) or (3), but I'm not sure.

This list isn't exhaustive; there are probably lots of other general capabilities we could assume.

Some useful AI alignment research may assume access to these alternative capabilities (for example, one may try to define conservative concepts using (2)), or attempt to build these capabilities out of other capabilities (for example, one may try to build capability (4) out of capabilities (1), (2), and (3)). This research is somewhat limited by the fact that we don't have good formal models for studying hypothetical systems having these properties. For example, at this time, it is difficult (though not impossible) to evaluate proposals for low-impact AI, since we don't understand environmental goals, and we don't know whether the AI will be able to find "natural" features of the world that include features that humans care about.

Unfortunately, there doesn't appear to be a very good argument that we will probably have any of these capabilities in the future. It seems more likely than not to me that some general capability that I listed will be highly relevant to AI alignment, but I'm highly uncertain.

What's the alternative?

If someone is attempting to design aligned AGI systems, what could they do other than reduce the problem to general capabilities like the ones I have talked about? Here I list some things that don't obviously fit into this model:

  1. We could test out AI systems and use systems that seem to have better behavior. But I see this as a form of training: we're creating policies and filtering out the ones that we don't like (i.e. giving them a higher training score if it looks like they are doing good things, and finding policies that have a high training score). If we filter AI systems aggressively, then we're basically just training things to maximize training score, and are back in an episodic RL context. If we don't filter them aggressively, then this limits how much useful work the filtering can do (so we'd need another argument for why things will go well).
  2. We could assume that AI systems will be transparent. I don't have a good model for what transparency for advanced AI systems would look like. At least some ways of getting transparency (e.g. anything that only uses the I/O behavior of the policy) reduce to episodic RL. Additionally, while it is easy for policies to be transparent if the policy class is relatively simple, complex policy classes are required to effectively make use of the capabilities of an advanced system. Regardless, transparency is likely to be important despite the current lack of formal models for understanding it.

Perhaps there are more ideas like these. Overall, I am skeptical if the basic argument for why a system ought to behave as intended uses more assumptions than the fact that the system has access to suitable capabilities. Once we have a basic argument for why things should probably mostly work, we can elaborate on this argument by e.g. arguing that transparency will catch some violations of the assumptions required for the basic argument.


Ω 2

6 comments, sorted by Click to highlight new comments since: Today at 6:26 PM
New Comment

I feel like I might be missing the big picture of what you're saying here.

Why do you focus on episodic RL? Is it because you don't want the AI to affect the choice of future test scenarios? If so, isn't this approach very restrictive since it precludes the AI from considering long-term consequences?

A central point which you don't seem to address here is where to get the "reward" signal. Unless you mean it's always generated directly by a human operator? But such an approach seems very vulnerable to perverse incentives (AI manipulating humans / taking control of reward button). I think that it should be solved by some variant of IRL but you don't discuss this.

Finally, a technical nitpick: It's highly uncertain whether there is such a thing as a "good SAT solver" since there is no way to generate training data. More precisely, we know that there are optimal estimators for SAT with advice and there are no -optimal estimators without advice but we don't know whether there are -optimal estimators (see Discussion section here).

EDIT: Actually we know there is no -optimal estimator. It might still be that there are optimal estimators for somewhat more special distributions which are still "morally generic" in a sense.

  • I focus on episodic RL because you can get lots of training examples. So you can use statistical learning theory to get nice bounds on performance. With long-term planning you can't get guarantees through statistical learning theory alone (there are not nearly enough data points for long-term plans working or not working), you need some other approach (outside the paradigm of current machine learning).
  • I'm imagining something like ALBA or human-imitation. IRL is one of those capabilities I'm not comfortable assuming we'll have (I agree with this post)
  • Hmm, it seems like the following procedure should work: Say we have an infinite list of SAT problems, a finite list of experts who try to solve SAT problems, and we want to get low regret (don't solve many fewer than the best expert does). This is basically a bandit problem: we treat the experts as bandits, and interpret "choosing slot machine x" as "use expert x to try to solve the SAT problem". So we can apply adversarial bandit algorithms to do well on this task asymptotically. I realize this is a simple model class, but it seems likely that a training procedure like this would generalize to e.g. neural nets. (I admit I haven't read your paper yet).
  • You have guarantees for non-stationary environments in online learning and multi-bandits. I am just now working on how to transfer this to reinforcement learning. Briefly, you use an adversarial multi-bandit algorithm where the "arms" are policies, the reward is the undiscounted sum of rewards and the probability to switch policy (moving to the next round of the bandit) is the ratio between the values of the time discount function in consequent moments of time, so that the finite time undiscounted reward is an unbiased estimate of the infinite time discounted reward. This means you switch policy roughly once in a horizon.

  • I agree that defining imperfect rationality is a challenge, but I think it has to be solvable, otherwise I don't understand what we mean by "human values" at all. I think that including bounded computational resources already goes a significant way towards modeling imperfection.

  • Of course we can train an algorithm to solve the candid search problem of SAT, i.e. find satisfying assignments if it can. What we can't (easily) do is training an algorithm to solve the decision problem, i.e. telling us whether a circuit is satisfiable or not. Note that it might be possible to tell that a circuit is satisfiable even if it's impossible to find the satisfying assignments (e.g. the circuit applies a one-way permutation and checks that the result is equal to a fixed string).

  • It seems like if you explore for the rest of your horizon, then by definition you explore for most of the time you actually care about. That seems bad. Perhaps I'm misunderstanding the proposal.

  • I agree that it's solvable; the question is whether it's any easier to do IRL well than it is to solve the AI alignment problem some other way. That seems unclear to me (it seems like doing IRL really well probably requires doing a lot of cognitive science and moral philosophy).

  • I agree that this seems hard to do as an episodic RL problem. It seems like we would need additional theoretical insights to know how to do this; we shouldn't expect AI capabilities research in the current paradigm to automatically deliver this capability.

Re 1st bullet, I'm not entirely certain I understand the nature of your objection.

The agent I describe is asymptotically optimal in the sense that for any policy , given the discount function, the reward obtained by the agent from time onwards and the reward that would be obtained by the agent from time onwards if it switched to following policy at time , we have that is bounded by something that goes to 0 as goes to for some family of time distributions which depends on (for geometric discount is uniform from 0 to ).

It's true that this desideratum seems much too weak for FAI since the agent would take much too long to learn. Instead, we want the agent to perform well already on the 1st horizon. This indeed requires a more sophisticated model.

This model can be considered analogous to episodic RL where horizons replace episodes. However, one difference of principle is that the agent retains information about the state of the environment when a new episode begins. I thinks this difference is a genuine advantage over "pure" episodic learning.

It seems like we're mostly on the same page with this proposal. Probably something that's going on is that the notion of "episodic RL" in my head is quite broad, to the point where it includes things like taking into account an ever-expanding history (each episode is "do the right thing in the next round, given the history"). But at that point it's probably better to use a different formalism, such as the one you describe.

My objection was the one you acknowledge: "this desideratum seems much too weak for FAI".