My simple model for Alignment vs Capability

ryan_b

I have a simple model of the alignment vs. capabilities question. I am writing it down because after chewing on Michael Nielson's post about existential risk from AI, I thought it was great but was unsatisfied with with the idea an alignment dilemma. I wasn't the only person to feel that way, but neither was I able to find any satisfactory source describing the (tension? dichotomy? tradeoff?). The real rub of Nielsen's dilemma for me is that it was expressed in terms of accelerationism, which is a form of one-dialism, and one-dialism is wrong.

I note there is nothing actually new here; I am quite sure these considerations were all covered, and more besides, back when MIRI getting started. Yet we weren't speaking in terms of alignment vs capability then, and capabilities were a less urgent concern, and regardless of what we have said before here, people aren't talking about models in public now. So a simple model it is.

Background

The way I understand the alignment vs capabilities question is that capabilities research eventually cashes out as AGI, and alignment research cashes out as aligned AGI, so in this context alignment research is effectively a subset of capabilities research. I expect this to apply more or less continuously, where increasingly-capable-but-not-AGIs will still need to be aligned.

What I want to accomplish with the model is to have something we can use to make heuristic judgments about research: in particular things like whether to support, fund, or pursue it, and how to make comparisons between different research agendas. This is not for doing things like predicting timelines directly, or the course of AI research overall, or the future of a given research org.

Reject Acceleration

My chief objection to thinking and speaking in terms of acceleration is that it is causally useless. I make analogy to GDP as an economic metric: if a particular industry is growing very fast, and we want that industry to succeed but also to avoid catastrophic industrial accidents, one scalar number that refers to a big pile of all the activity adds nothing. We cannot build safer industrial facilities by saying "but GDP should go up less." Likewise for talking to investors, and also to regulators. Acceleration is purely a rhetorical device, which is why we should be unsurprised to see it appear in the public conversation, and also why it should be treated with caution.

Think Paths to Success Instead

I prefer to think in terms of paths to success, in the sense (but not rigorous definition) of paths through phase space. Here we are thinking about paths through the phase space of the world, specifically those which lead to AGI and the smaller subspace of aligned AGI. Further, we are just thinking about the research, as distinct from thinking about orgs, number of researchers, funding, etc.

I frequently recall ET Jaynes's paper Macroscopic Prediction, and in particular this bit:

Gibbs' variational principle is, therefore, so simple in rationale that one almost hesitates to utter such a triviality; it says "predict that final state that can be realized by Nature in the greatest number of ways, while agreeing with your macroscopic information."

In our context, something becomes more likely the more paths there are to accomplishing it.

So how do we count the paths?

Count the Paths

We want to identify the actionable parts of research. For our purposes, "leads to future research" and "applicable to real systems" both count as actionable. I think directly about angles of attack in the sense of Hamming's important research problem criterion, where what Hamming is talking about is "a way to make progress on the problem."

What Hamming calls angles of attack are what I think an alignment research agenda produces: ways to make progress on the problem of alignment (likewise for major AI labs: ways to make progress on the problem of capability). Mechanistic Interpretability is a research agenda. Suppose they discover a new behavior they call schlokking.

Example 1: Another paper expands on the schlokking behavior. This counts as a path because it lead to future research.

Example 2: Another paper reports successfully predicting the outputs of an LLM, based on the schlokking behavior. This counts as a path because it was applied to a real system.

Example 3: Before either 1 or 2, a researcher had an idea and poked around some weights. They thought to themselves "I'm pretty sure I found something weird. Why does it look so schlokky?" This does not count as a path by itself, because it hadn't gone anywhere yet.

This gives us a simple path heuristic: something good enough to get a name, that later research refers to or gets applied to real systems.

Dual Use Accounting

The same rules apply for capability paths, and we count paths on the capability side the same way. Continuing with schlokking:

Example 4: A paper explains how they optimized their training procedure to exploit the schlokking behavior to reduce training time. This counts as a capability path, as applied to a real system.

Example 5: A paper explains how schlokking is actually pretty suboptimal, and through deliberate design they can get the same behavior faster. This counts as a capability path, as lead to future research.

Consider the Ratio

Since each agenda is likely to produce paths to both alignment and capability, we can look at the ratio of alignment paths to capability paths, which I simply label the path ratio.

Example 6: In the previous examples we count two alignment paths, and also two capabilities paths. Therefore the path ratio of alignment:capability is 2:2, or simplified 1:1.

Between counting the paths and looking at the path ratio, we have a few ways to sort research output:

The highest ratio of alignment:capability is the lowest-risk
The highest number of alignment paths generated is the most productive (likewise for capability)
The highest number of total paths is probably the most insightful
Etc.

At this point is when we can actually use the model to engage with the alignment vs capability question. The intuitive answer is naturally to favor the lowest-risk research, which is to say the highest path ratio of alignment:capability; if I read the public conversation right this even extends to forsaking any research agenda that adds any capability paths at all, which is to say the alignment:capability path ratio must be infinite for some people. I think this is wrong.

The Total and the Threshold

I think the strong position against adding capability paths is wrong for two reasons based on this model.

The first reason is that I claim we should be concerned with how many ways there are to make progress on alignment vs ways to make progress on capability overall. In the model that is the total ratio of paths over all the research that has been done to date. This is a comparative measure that is proportional to how much of an advantage capability has currently. While this could plausibly be made fully quantitative with a sufficiently broad literature review, I lack a broad base in the literature and so I have been using a gut-level fermi estimate instead.

Example 7: An alignment:capability ratio of 1:1 suggests progress is about equal, which feels patently absurd. So the real question is whether we are looking more like:

1:10, which is to say capability has a 10x advantage.
1:100, which is to say capability has a 100x advantage.
1:1000, which is to say capability has a 1000x advantage.

Of these, 1:100 feels the most right to me. That being said, I will lean on the the simplicity of the optimistic estimate of 1:10 for examples.

The second reason is that we have so few paths to success with alignment, in absolute terms. In the trivial case, if the path ratio alignment:capability is 0, which is to say that there are no viable alignment paths at all, then we are simply doomed unless we are completely wrong about the risk from AI in the first place. If the aggregate number of alignment paths is still very small, then the question of whether we are doomed is reduced to the question of whether these few paths succeed or fail, which is the in-model way of saying we still need more alignment research.

These two things interact to let us look at a question like "Should I go into alignment research?" or "Should I pursue my idea for a research agenda?". If the total number of alignment paths is zero or very small, the answer is trivially yes, because otherwise there is no hope, capabilities be damned. If the total number of alignment paths is high enough that we care about controlling risk, then I claim the heuristic threshold for yes is whether it makes the total ratio better.

Example 8: If your total ratio is 1:10, then your answer should be yes if you can be confident you will not accidentally produce more than 10x paths to capability as paths to alignment. If the path ratio you wind up with from your research agenda is 1:9, that is still a net gain.

I notice this is extremely permissive relative to people who express concern about enhancing capabilities. I have not seen a single proposal for an approach to alignment that I thought would provide 10x more ways to help capability. As it stands, my model does not predict that alignment research can't accidentally cause AGI, so much as predict it is almost always worth the gamble.

Refinements

The model is simple unto brutality. I can anticipate several concerns, most of which could probably be addressed.

Doesn't talk about probability. That is deliberate - there needs to be a way of answering "The probability of what, exactly?" and in the current state of conversation about capability vs alignment there doesn't seem to be an answer. This model isn't even really quantitative; all I'm actually doing is using a few numbers to get some rank ordering.
Probability aside, does not address uncertainty. I have mostly been thinking about this in terms of determining results from other agendas before launching your own, which should be answerable. That being said, I have been treating uncertainty as mostly being a range of values in the path ratio, something like [0-3]:[0-5] or similar for simplicity. A distribution of possible outcomes would be better, but I don't have enough information to tell what those would look like other than to just guess gaussian.
Doesn't talk about time. Time is another thing that I believe needs a model to work on. I have been considering a unit of time in the form of one research cycle, which I feel like lasts approximately from the time a research agenda coalesces to the time that the rest of the community digests its outputs well enough to start trying to use them. This is still just comparative at best - for example, I think it would be a reasonable claim that the research cycle for capabilities is faster than for alignment for various reasons. Conceiving of it this way would effectively count as some multiplier for the number of capability paths generated (which is the converse of how I usually see this kind of thing tracked, where each research path is treated as a small speed multiplier). My suspicion is that the real difference would be between concrete vs theoretical, because running experiments gives you more information, and it also seems like a bigger fraction of capabilities research is concrete. But how much would that change if people weren't avoiding concrete alignment research on the grounds that capability might also benefit?
All paths are not created equal - doesn't talk about quality or impact. This is mostly because I have very little sense of how we would compare them in terms of whatever qualities they might have. A way to increase the granularity one level would be to use a comparative measure again; we could have a path size, which is to say regular paths are worth one path, but especially good ones might be worth two or three regular paths, up to some crucial or paradigm-defining breakthrough value. I slightly prefer a size implementation to a multiple implementation because of how it feels like it would effect transmission. As a practical example, we might consider research picked up by two major AI labs as having twice the path size as a different path picked up by only one.

LESSWRONG
LW

My simple model for Alignment vs Capability

7

New to LessWrong?

7