AI-assisted alignment proposals require specific decomposition of capabilities

RobertM

Some alignment strategies involve using weaker AI systems to help with alignment. I'm interesting in figuring out what dependencies each plan has, which need to be true for the plan to succeed. These are likely to vary across plans. (Separate, but related, is the question of what assumptions those working on the plans are appear to be operating under, and how well those line up with the actual dependencies.)

There are two frames I've been using to think about this lately. One is to ask, "What assumptions does this approach make about the shape of the alignment problem itself?" The other is more specifically about how capabilities might (or might not) decompose, both in theory and in practice. I'm more interested in exploring the second frame in this post; I'm mostly using the first frame as a to set aside other considerations related to whether any given plan is viable.

I'm also operating with a couple of broader background assumptions; absent these you might look at my arguments and be confused:

Unaligned, broadly superhuman AGI ("ASI") is a coherent, possible thing in to bring about in our universe
If dropped on our heads tomorrow, unaligned ASI would with very high likelihood cause human extinction/loss of ~all value in the lightcone/etc.

Here are some broad classes of strategies I've seen proposed:

AI as Research Assistant
AI as Independent Researcher
RLAIF
Scalable Oversight (RLHF + RLAIF + IDA + ???)

This breakdown is kind of dumb - the first two are less "strategies" and more "desired end states". Also, I'm sure I'm missing some things. (I have read An overview of 11 proposals for building safe advanced AI, but many of the proposals feel similar to each other, or don't have an obvious path to actually being implemented. If you think one or more of those proposals is both possible to implement right now and differs meaningfully in terms of its dependencies, please let me know.)

AI as Research Assistant

"Shape of Problem" Dependencies

I don't think I've seen anyone argue that employing these kinds of strategies will do very much to move the needle from "default outcome is the bad one" to "quite likely we succeed", so trying to reverse-engineer the claims that argument would be making about what the alignment problem actually is doesn't feel like a good use of time.

I basically don't have any objection in principle to the idea that we could use weaker AI systems (including current ones) to speed up the efforts of existing researchers, and maybe even do so differentially with respect to capabilities in some cases. I just don't think that gets us there.

"How Capabilities Decompose" Dependencies

I don't think this relies on any facts about how future capabilities decompose; we have existing capabilities that seem likely to be (at least marginally) helpful if a good UI/UX wrapper can be created for them.

AI as Independent Researcher

The closest thing to this that I can think of is some form of STEM AI. I don't think I've seen anyone expand on the details of such a proposal in a way that overcomes the serious difficulties noted in that post, so I'll skip the decompositions in favor of this excerpt:

However, if one of the major use cases for your first advanced AI is helping to align your second advanced AI, STEM AI seems to perform quite poorly on that metric, as it advances our technology without also advancing our understanding of alignment. In particular, unlike every other approach on this list, STEM AI can't be used to do alignment work, as its alignment guarantees are explicitly coming from it not modeling or thinking about humans in any way, including aligning AIs with them. Thus, STEM AI could potentially create a vulnerable world situation where the powerful technology produced using the STEM AI makes it much easier to build advanced AI systems, without also making it more likely that they will be aligned.

RLAIF^[1]

An example would be Constitutional AI. I'm not sure if anyone who accepts my two background assumptions thinks this will work by itself as an end-to-end solution, but putting it here just in case.

"Shape of Problem" Dependencies

Some (not mutually exclusive) possibilities for what has to be true about the alignment problem for this to work:

In the limit of RL based on AI feedback derived from some carefully-worded set of initial prompts by which AI feedback is generated, most (or all) kinds of internal cognition will converge to something human-friendly before you get something powerful enough to cause x-risk. Then you can use that aligned AI to help you future-proof the next generation.
- Note that this is actually two assumptions:
  - That the convergence can hit the correct target of human-friendly values
  - That the convergence will happen before crossing the dangerous capabilities threshold
The values of an agent which was bootstrapped from a foundation model trained on a human-generated corpus will be substantially influenced by the object-level content of the corpus, and RL is mostly just pushing it into a better region of an already-human-compatible space of values.
Moral realism^[2], and for some reason the agent cares.
Not moral realism, but the practical equivalent - orthogonality may be technically true, but the default thing that comes out of making something sufficiently smart is something human-friendly, and you'd have to have a very good mechanistic understanding of intelligence to create something unfriendly.

"How Capabilities Decompose" Dependencies

None of the shapes of the problem that RLAIF working would imply seem to require any assumptions on this dimension.

Scalable Oversight

Described at a high level as a strategy being persued by OpenAI and Anthopic, among others.

"Shape of Problem" Dependencies

The proposals I'm familiar with don't seem to strictly require assumptions not related to the decomposition of capabilities, but the more complicated the alignment problem is, the more assumptions you need to make about how capabilities decompose for this class of approaches to be tractable.

"How Capabilities Decompose" Dependencies

The biggest requirement here is that there exist a relatively clean way to separate the kind of cognition that reasons effectively about the alignment problem from cognition that would make the system dangerous, and that we can make that happen in practice. This decomposition seems more likely to be possible^[3] in worlds where the alignment problem is mostly an engineering challenge, with very little in way of anything resembling "philosophy" required. I think this because the cognitive machinery necessary to solve novel philosophical challenges seems much broader and more likely to be dangerous than what would be necessary to solve more concretely specified engineering challenges. The kind of thing I mean by "engineering challenge" is something like "create a system which can reliably write code which passes some set of unit tests given to it", whereas by "philosophy challenge" I mean something like "figure out what human values are", and all the normative bits that remain after you dissolve those questions into empirical ones.

As far as I understand current scalable oversight proposals, they mostly aren't structured in ways which acknowledge this requirement. On my model, the less "narrow" your system is, the more powerful its cognition needs to be to solve any specific problem, when compared to the narrowest possible system trained specifically for that problem.

As a toy example, let's look at the case of chess. To the extent one can meaningfully evaluate how well GPT-4 plays chess, training a model specifically to play chess to a similar level (without building it on top of an existing foundation model) would give you a model with much weaker cognition. Correspondingly, a GPT-n which is capable of giving you useful alignment ideas (or of usefully evaluating the ideas spit out by another instance of itself) must be much more broadly capable than a narrow system trained specifically for that task^[4].

But current proposals generally seem to start with, "First, train a foundation model capable of reasoning across a broad range of tasks and domains". Then some further refinements are made, but none of those look to me like something that might differentially train a specific narrow capability helpful for alignment research, rather than just more reliably eliciting existing capabilities.

Some possibilities that resolve this tension:

I'm misunderstanding the proposals - there is in fact a step which narrowly trains capabilities necessary for alignment, with at least some justification for why the proposed training regime works to train the capabilities narrowly.
It turns out that humans are so bad at "novel thinking" with respect to alignment research that we can get useful intellectual output from some combination of systems doing generation + distillation. I have several problems with this:
- To the extent that this requires an alignment researcher in the loop to evaluate the output, this seems difficult to scale. Evaluation might be easier than generation, but is still quite challenging to do quickly and reliably. We have direct evidence of this, with the current state of the alignment field: most agendas have had very little external engagement. I currently think this tops out at "AI as a Research Assistant" levels of helpfulness.
- The less the combined system requires an alignment researcher in the loop, the more skeptical I am that the combined set of capabilities is safe.
- To the extent that this can be done for alignment research, I expect it would be easier to do in more paradigmatic domains. The demonstration of human-or-better performance of the same kind of AI-assisted research in an "easier" domain would be a significant positive update; the (ongoing) lack of such a demonstration a minor negative update.
A rejection of this entire framing. I'm not sure what such a rejection would look like, since in practice it works out to believing you'll get the necessary capabilities without the danger, either deliberately or by default.

If you are familiar with the details of an AI-assisted alignment scheme - regardless of whether it maps to one of the four strategies described above - and have thoughts about what that scheme's success requires (or substantially depends on) in terms of capabilities decomposition, I would be very interested to hear your thoughts.

If you think this line of thinking is fundamentally mistaken with respect to evaluating proposals like this, I'd like to hear arguments for that as well.

^{^}
Reinforcement Learning from AI Feedback.
^{^}
I don't think I've actually seen anyone use this in an argument, but it does seem like it might help if it were true.
^{^}
Though still not obviously possible.
^{^}
One problem, of course, is that we don't know how to train a narrow system to do that, or even how narrow such a system could be, since that depends on the actual cognitive capabilities required by the task. Relatedly, I think the fact that we don't know how to construct a narrower training regime for this task is evidence in favor of the problem being more difficult than a set of relatively well-understood engineering challenges.

[-]Martín Soto3y10

I don't think I completely grok the distinction you're trying to point at with "Shape of problem" vs "How capabilities decompose".

I guess "Shape of problem" is about systematic incentives that will be present, like inductive biases in our training procedures, while "How capabilities decompose" is about how easy/natural it is for a mind to solve the task without solving other tasks. The latter is about "minds in general" and the former about "minds trained by us"?

But then I don't understand some of your classifications. For example, how is "it stumbles into human-friendliness before x-risk capability” a claim about shape of the problem (instead of also depending on how hard are the tasks of making humans extinct, understanding/imitating humans, etc.), while things like “IDA does/doesn’t converge to deception (because of obfuscated arguments etc.)” (which would be a part of Scalable Oversight) are not shape of the problem, but capabilities decomposition?

I feel like this is a pretty blurry line to classify evidence (and thus maybe not the most useful, but I'm not sure).

Moral realism^[2], and for some reason the agent cares.
Not moral realism, but the practical equivalent - orthogonality may be technically true, but the default thing that comes out of making something sufficiently smart is something human-friendly, and you'd have to have a very good mechanistic understanding of intelligence to create something unfriendly.

How would the first be different from the second? What are you understanding by "moral realism makes the AI human-friendly" that is not just "practical convergence"?

Are you picturing something like "the AI reasons enough / reads this text (which would also convince humans if presented correctly) and becomes completely convinced that a certain moral theory is true, and that it must follow it (a la Descartes with God)"? Because that's just a particular way of getting "practical convergence". You probably agree with that (as you mentioned, they are not mutually exclusive), but I'm interested in whether you understood anything else by moral realism here.

20

AI-assisted alignment proposals require specific decomposition of capabilities

20

AI as Research Assistant

"Shape of Problem" Dependencies

"How Capabilities Decompose" Dependencies

AI as Independent Researcher

RLAIF^[1]

"Shape of Problem" Dependencies

"How Capabilities Decompose" Dependencies

Scalable Oversight

"Shape of Problem" Dependencies

"How Capabilities Decompose" Dependencies

20

20

20

AI-assisted alignment proposals require specific decomposition of capabilities

20

AI as Research Assistant

"Shape of Problem" Dependencies

"How Capabilities Decompose" Dependencies

AI as Independent Researcher

RLAIF[1]

"Shape of Problem" Dependencies

"How Capabilities Decompose" Dependencies

Scalable Oversight

"Shape of Problem" Dependencies

"How Capabilities Decompose" Dependencies

20

20

RLAIF^[1]