rohinmshah

PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter. http://rohinshah.com/

Sequences

Value Learning
Alignment Newsletter

Comments

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Perhaps I should start saying "Guys, can we encourage folks to work on both issues please, so that people who care about x-risk have more ways to show up and professionally matter?", and maybe that will trigger less pushback of the form "No, alignment is the most important thing"... 

I think that probably would be true.

For some reason when I express opinions of the form "Alignment isn't the most valuable thing on the margin", alignment-oriented folks (e.g., Paul here) seem to think I'm saying you shouldn't work on alignment (which I'm not), which triggers a "Yes, this is the most valuable thing" reply.

Fwiw my reaction is not "Critch thinks Rohin should do something else", it's more like "Critch is saying something I believe to be false on an important topic that lots of other people will read". I generally want us as a community to converge to true beliefs on important things (part of my motivation for writing a newsletter) and so then I'd say "but actually alignment still seems like the most valuable thing on the margin because of X, Y and Z".

(I've had enough conversations with you at this point to know the axes of disagreement, and I think you've convinced me that "which one is better on the margin" is not actually that important a question to get an answer to. So now I don't feel as much of an urge to respond that way. But that's how I started out.)

Homogeneity vs. heterogeneity in AI takeoff scenarios

Not sure why I didn't respond to this, sorry.

I agree with the claim "we may not have an AI system that tries and fails to take over the world (i.e. an AI system that tries but fails to release an engineered pandemic that would kill all humans, or arrange for simultaneous coups in the major governments, or have a robotic army kill all humans, etc) before getting an AI system that tries and succeeds at taking over the world".

I don't see this claim as particularly relevant to predicting the future.

Another (outer) alignment failure story

Planned opinion (shared with What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs))

Both the previous story and this one seem quite similar to each other, and seem pretty reasonable to me as a description of one plausible failure mode we are aiming to avert. The previous story tends to frame this more as a failure of humanity’s coordination, while this one frames it (in the title) as a failure of intent alignment. It seems like both of these aspects greatly increase the plausibility of the story, or in other words, if we eliminated or made significantly less bad either of the two failures, then the story would no longer seem very plausible.

A natural next question is then which of the two failures would be best to intervene on, that is, is it more useful to work on intent alignment, or working on coordination? I’ll note that my best guess is that for any given person, this effect is minor relative to “which of the two topics is the person more interested in?”, so it doesn’t seem hugely important to me. Nonetheless, my guess is that on the current margin, for technical research in particular, holding all else equal, it is more impactful to focus on intent alignment. You can see a much more vigorous discussion in e.g. [this comment thread](https://www.alignmentforum.org/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic?commentId=3czsvErCYfvJ6bBwf).

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Planned summary for the Alignment Newsletter:

A robust agent-agnostic process (RAAP) is a process that robustly leads to an outcome, without being very sensitive to the details of exactly which agents participate in the process, or how they work. This is illustrated through a “Production Web” failure story, which roughly goes as follows:

A breakthrough in AI technology leads to a wave of automation of $JOBTYPE (e.g management) jobs. Any companies that don’t adopt this automation are outcompeted, and so soon most of these jobs are completely automated. This leads to significant gains at these companies and higher growth rates. These semi-automated companies trade amongst each other frequently, and a new generation of "precision manufacturing'' companies arise that can build almost anything using robots given the right raw materials. A few companies develop new software that can automate $OTHERJOB (e.g. engineering) jobs. Within a few years, nearly all human workers have been replaced.

These companies are now roughly maximizing production within their various industry sectors. Lots of goods are produced and sold to humans at incredibly cheap prices. However, we can’t understand how exactly this is happening. Even Board members of the fully mechanized companies can’t tell whether the companies are serving or merely appeasing humanity; government regulators have no chance.

We do realize that the companies are maximizing objectives that are incompatible with preserving our long-term well-being and existence, but we can’t do anything about it because the companies are both well-defended and essential for our basic needs. Eventually, resources critical to human survival but non-critical to machines (e.g., arable land, drinking water, atmospheric oxygen…) gradually become depleted or destroyed, until humans can no longer survive.

Notice that in this story it didn’t really matter what job type got automated first (nor did it matter which specific companies took advantage of the automation). This is the defining feature of a RAAP -- the same general story arises even if you change around the agents that are participating in the process. In particular, in this case competitive pressure to increase production acts as a “control loop” that ensures the same outcome happens, regardless of the exact details about which agents are involved.

Planned opinion (shared with Another (outer) alignment failure story):

Both the previous story and this one seem quite similar to each other, and seem pretty reasonable to me as a description of one plausible failure mode we are aiming to avert. The previous story tends to frame this more as a failure of humanity’s coordination, while this one frames it (in the title) as a failure of intent alignment. It seems like both of these aspects greatly increase the plausibility of the story, or in other words, if we eliminated or made significantly less bad either of the two failures, then the story would no longer seem very plausible.

A natural next question is then which of the two failures would be best to intervene on, that is, is it more useful to work on intent alignment, or working on coordination? I’ll note that my best guess is that for any given person, this effect is minor relative to “which of the two topics is the person more interested in?”, so it doesn’t seem hugely important to me. Nonetheless, my guess is that on the current margin, for technical research in particular, holding all else equal, it is more impactful to focus on intent alignment. You can see a much more vigorous discussion in e.g. [this comment thread](https://www.alignmentforum.org/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic?commentId=3czsvErCYfvJ6bBwf).

Another (outer) alignment failure story

Planned summary for the Alignment Newsletter:

Suppose we train AI systems to perform task T by having humans look at the results that the AI system achieves and evaluating how well the AI has performed task T. Suppose further that AI systems generalize “correctly” such that even in new situations they are still taking those actions that they predict we will evaluate as good. This does not mean that the systems are aligned: they would still deceive us into _thinking_ things are great when they actually are not. This post presents a more detailed story for how such AI systems can lead to extinction or complete human disempowerment. It’s relatively short, and a lot of the force comes from the specific details that I’m not going to summarize, so I do recommend you read it in full. I’ll be explaining a very abstract version below.

The core aspects of this story are:
1. Economic activity accelerates, leading to higher and higher growth rates, enabled by more and more automation through AI.
2. Throughout this process, we see some failures of AI systems where the AI system takes some action that initially looks good but we later find out was quite bad (e.g. investing in a Ponzi scheme, that the AI knows is a Ponzi scheme but the human doesn’t).
3. Despite this failure mode being known and lots of work being done on the problem, we are unable to find a good conceptual solution. The best we can do is to build better reward functions, sensors, measurement devices, checks and balances, etc. in order to provide better reward functions for agents and make it harder for them to trick us into thinking their actions are good when they are not.
4. Unfortunately, since the proportion of AI work keeps increasing relative to human work, this extra measurement capacity doesn’t work forever. Eventually, the AI systems are able to completely deceive all of our sensors, such that we can’t distinguish between worlds that are actually good and worlds which only appear good. Humans are dead or disempowered at this point.

(Again, the full story has much more detail.)

AXRP Episode 6 - Debate and Imitative Generalization with Beth Barnes

Planned summary for the Alignment Newsletter:

This podcast covers a bunch of topics, such as <@debate@>(@AI safety via debate@), <@cross examination@>(@Writeup: Progress on AI Safety via Debate@), <@HCH@>(@Humans Consulting HCH@), <@iterated amplification@>(@Supervising strong learners by amplifying weak experts@), and <@imitative generalization@>(@Imitative Generalisation (AKA 'Learning the Prior')@) (aka [learning the prior](https://www.alignmentforum.org/posts/SL9mKhgdmDKXmxwE4/learning-the-prior) ([AN #109](https://mailchi.mp/ee62c1c9e331/an-109teaching-neural-nets-to-generalize-the-way-humans-would))), along with themes about <@universality@>(@Towards formalizing universality@). Recommended for getting a broad overview of this particular area of AI alignment.

My research methodology

I agree this involves discretion [...] So instead I'm doing some in between thing

Yeah, I think I feel like that's the part where I don't think I could replicate your intuitions (yet).

I don't think we disagree; I'm just noting that this methodology requires a fair amount of intuition / discretion, and I don't feel like I could do this myself. This is much more a statement about what I can do, rather than a statement about how good the methodology is on some absolute scale.

(Probably I could have been clearer about this in the original opinion.)

My research methodology

In some sense you could start from the trivial story "Your algorithm didn't work and then something bad happened." Then the "search for stories" step is really just trying to figure out if the trivial story is plausible. I think that's pretty similar to a story like: "You can't control what your model thinks, so in some new situation it decides to kill you."

To fill in the details more:

Assume that we're finding an algorithm to train an agent with a sufficiently large action space (i.e. we don't get safety via the agent having such a restricted action space that it can't do anything unsafe).

It seems like in some sense the game is in constraining the agent's cognition to be such that it is "safe" and "useful". The point of designing alignment algorithms is to impose such constraints, without requiring so much effort as to make the resulting agent useless / uncompetitive.

However, there are always going to be some plausible circumstances that we didn't consider (even if we're talking about amplified humans, which are still bounded agents). Even if we had maximal ability to place constraints on agent cognition, whatever constraints we do place won't have been tested in these unconsidered plausible circumstances. It is always possible that one misfires in a way that makes the agent do something unsafe.

(This wouldn't be true if we had some sort of proof against misfiring, that doesn't assume anything about what circumstances the agent experiences, but that seems ~impossible to get. I'm pretty sure you agree with that.)

More generally, this story is going to be something like:

  1. Suppose you trained your model M to do X using algorithm A.
  2. Unfortunately, when designing algorithm A / constraining M with A, you (or amplified-you) failed to consider circumstance C as a possible situation that might happen.
  3. As a result, the model learned heuristic H, that works in all the circumstances you did consider, but fails in circumstance C.
  4. Circumstance C then happens in the real world, leading to an actual failure.

Obviously, I can't usually instantiate M, X, A, C, and H such that the story works for an amplified human (since they can presumably think of anything I can think of). And I'm not arguing that any of this is probable. However, it seems to meet your bar of "plausible":

there is some way to fill in the rest of the details that's consistent with everything I know about the world.

EDIT: Or maybe more accurately, I'm not sure how exactly the stories you tell are different / more concrete than the ones above.

----

When I say you have "a better defined sense of what does and doesn't count as a valid step 2", I mean that there's something in your head that disallows the story I wrote above, but allows the stories that you generally use, and I don't know what that something is; and that's why I would have a hard time applying your methodology myself.

----

Possible analogy / intuition pump for the general story I gave above: Human cognition is only competent in particular domains and must be relearned in new domains (like protein folding) or new circumstances (like when COVID-19 hits), and sometimes human cognition isn't up to the task (like when being teleported to a universe with different physics and immediately dying), or doesn't do so in a way that agrees with other humans (like how some humans would push a button that automatically wirehead everyone for all time, while others would find that abhorrent).

My research methodology

Planned summary for the Alignment Newsletter:

This post outlines a simple methodology for making progress on AI alignment. The core idea is to alternate between two steps:

1. Come up with some alignment algorithm that solves the issues identified so far

2. Try to find some plausible situation in which either a) the resulting AI system is misaligned or b) the AI system is not competitive.

This is all done conceptually, so step 2 can involve fairly exotic scenarios that probably won't happen. Given such a scenario, we need to argue why no failure in the same class as that scenario will happen, or we need to go back to step 1 and come up with a new algorithm.

This methodology could play out as follows:

Step 1: RL with a handcoded reward function.

Step 2: This is vulnerable to <@specification gaming@>(@Specification gaming examples in AI@).

Step 1: RL from human preferences over behavior, or other forms of human feedback.

Step 2: The system might still pursue actions that are bad that humans can't recognize as bad. For example, it might write a well researched report on whether fetuses are moral patients, which intuitively seems good (assuming the research is good). However, this would be quite bad if the AI wow the report because it calculated that it would increase partisanship leading to civil war.

Step 1: Use iterated amplification to construct a feedback signal that is "smarter" than the AI system it is training.

Step 2: The system might pick up on <@inaccessible information@>(@Inaccessible information@) that the amplified overseer cannot find. For example, it might be able to learn a language just by staring at a large pile of data in that language, and then seek power whenever working in that language, and the amplified overseer may not be able to detect this.

Step 1: Use <@imitative generalization@>(@Imitative Generalisation (AKA 'Learning the Prior')@) so that the human overseer can leverage facts that can be learned by induction / pattern matching, which neural nets are great at.

Step 2: Since imitative generalization ends up learning a description of facts for some dataset, it may learn low-level facts useful for prediction on the dataset, while not including the high-level facts that tell us how the low-level facts connect to things we care about. 

The post also talks about various possible objections you might have, which I’m not going to summarize here.

Planned opinion:

I'm a big fan of having a candidate algorithm in mind when reasoning about alignment. It is a lot more concrete, which makes it easier to make progress and not get lost, relative to generic reasoning from just the assumption that the AI system is superintelligent.

I'm less clear on how exactly you move between the two steps -- from my perspective, there is a core reason for worry, which is something like "you can't fully control what patterns of thought your algorithm learn, and how they'll behave in new circumstances", and it feels like you could always apply that as your step 2. Our algorithms are instead meant to chip away at the problem, by continually increasing our control over these patterns of thought. It seems like the author has a better defined sense of what does and doesn't count as a valid step 2, and that makes this methodology more fruitful for him than it would be for me. More discussion [here](https://www.alignmentforum.org/posts/EF5M6CmKRd6qZk27Z/my-research-methodology?commentId=8Hq4GJtnPzpoALNtk).

Load More