I have a simple model of the alignment vs. capabilities question. I am writing it down because after chewing on Michael Nielson's post about existential risk from AI, I thought it was great but was unsatisfied with with the idea an alignment dilemma. I wasn't the only person to feel that way, but neither was I able to find any satisfactory source describing the (tension? dichotomy? tradeoff?). The real rub of Nielsen's dilemma for me is that it was expressed in terms of accelerationism, which is a form of one-dialism, and one-dialism is wrong.
I note there is nothing actually new here; I am quite sure these considerations were all covered, and more besides, back when MIRI getting started. Yet we weren't speaking in terms of alignment vs capability then, and capabilities were a less urgent concern, and regardless of what we have said before here, people aren't talking about models in public now. So a simple model it is.
The way I understand the alignment vs capabilities question is that capabilities research eventually cashes out as AGI, and alignment research cashes out as aligned AGI, so in this context alignment research is effectively a subset of capabilities research. I expect this to apply more or less continuously, where increasingly-capable-but-not-AGIs will still need to be aligned.
What I want to accomplish with the model is to have something we can use to make heuristic judgments about research: in particular things like whether to support, fund, or pursue it, and how to make comparisons between different research agendas. This is not for doing things like predicting timelines directly, or the course of AI research overall, or the future of a given research org.
My chief objection to thinking and speaking in terms of acceleration is that it is causally useless. I make analogy to GDP as an economic metric: if a particular industry is growing very fast, and we want that industry to succeed but also to avoid catastrophic industrial accidents, one scalar number that refers to a big pile of all the activity adds nothing. We cannot build safer industrial facilities by saying "but GDP should go up less." Likewise for talking to investors, and also to regulators. Acceleration is purely a rhetorical device, which is why we should be unsurprised to see it appear in the public conversation, and also why it should be treated with caution.
Think Paths to Success Instead
I prefer to think in terms of paths to success, in the sense (but not rigorous definition) of paths through phase space. Here we are thinking about paths through the phase space of the world, specifically those which lead to AGI and the smaller subspace of aligned AGI. Further, we are just thinking about the research, as distinct from thinking about orgs, number of researchers, funding, etc.
I frequently recall ET Jaynes's paper Macroscopic Prediction, and in particular this bit:
Gibbs' variational principle is, therefore, so simple in rationale that one almost hesitates to utter such a triviality; it says "predict that final state that can be realized by Nature in the greatest number of ways, while agreeing with your macroscopic information."
In our context, something becomes more likely the more paths there are to accomplishing it.
So how do we count the paths?
Count the Paths
We want to identify the actionable parts of research. For our purposes, "leads to future research" and "applicable to real systems" both count as actionable. I think directly about angles of attack in the sense of Hamming's important research problem criterion, where what Hamming is talking about is "a way to make progress on the problem."
What Hamming calls angles of attack are what I think an alignment research agenda produces: ways to make progress on the problem of alignment (likewise for major AI labs: ways to make progress on the problem of capability). Mechanistic Interpretability is a research agenda. Suppose they discover a new behavior they call schlokking.
Example 1: Another paper expands on the schlokking behavior. This counts as a path because it lead to future research.
Example 2: Another paper reports successfully predicting the outputs of an LLM, based on the schlokking behavior. This counts as a path because it was applied to a real system.
Example 3: Before either 1 or 2, a researcher had an idea and poked around some weights. They thought to themselves "I'm pretty sure I found something weird. Why does it look so schlokky?" This does not count as a path by itself, because it hadn't gone anywhere yet.
This gives us a simple path heuristic: something good enough to get a name, that later research refers to or gets applied to real systems.
Dual Use Accounting
The same rules apply for capability paths, and we count paths on the capability side the same way. Continuing with schlokking:
Example 4: A paper explains how they optimized their training procedure to exploit the schlokking behavior to reduce training time. This counts as a capability path, as applied to a real system.
Example 5: A paper explains how schlokking is actually pretty suboptimal, and through deliberate design they can get the same behavior faster. This counts as a capability path, as lead to future research.
Consider the Ratio
Since each agenda is likely to produce paths to both alignment and capability, we can look at the ratio of alignment paths to capability paths, which I simply label the path ratio.
Example 6: In the previous examples we count two alignment paths, and also two capabilities paths. Therefore the path ratio of alignment:capability is 2:2, or simplified 1:1.
Between counting the paths and looking at the path ratio, we have a few ways to sort research output:
At this point is when we can actually use the model to engage with the alignment vs capability question. The intuitive answer is naturally to favor the lowest-risk research, which is to say the highest path ratio of alignment:capability; if I read the public conversation right this even extends to forsaking any research agenda that adds any capability paths at all, which is to say the alignment:capability path ratio must be infinite for some people. I think this is wrong.
The Total and the Threshold
I think the strong position against adding capability paths is wrong for two reasons based on this model.
The first reason is that I claim we should be concerned with how many ways there are to make progress on alignment vs ways to make progress on capability overall. In the model that is the total ratio of paths over all the research that has been done to date. This is a comparative measure that is proportional to how much of an advantage capability has currently. While this could plausibly be made fully quantitative with a sufficiently broad literature review, I lack a broad base in the literature and so I have been using a gut-level fermi estimate instead.
Example 7: An alignment:capability ratio of 1:1 suggests progress is about equal, which feels patently absurd. So the real question is whether we are looking more like:
Of these, 1:100 feels the most right to me. That being said, I will lean on the the simplicity of the optimistic estimate of 1:10 for examples.
The second reason is that we have so few paths to success with alignment, in absolute terms. In the trivial case, if the path ratio alignment:capability is 0, which is to say that there are no viable alignment paths at all, then we are simply doomed unless we are completely wrong about the risk from AI in the first place. If the aggregate number of alignment paths is still very small, then the question of whether we are doomed is reduced to the question of whether these few paths succeed or fail, which is the in-model way of saying we still need more alignment research.
These two things interact to let us look at a question like "Should I go into alignment research?" or "Should I pursue my idea for a research agenda?". If the total number of alignment paths is zero or very small, the answer is trivially yes, because otherwise there is no hope, capabilities be damned. If the total number of alignment paths is high enough that we care about controlling risk, then I claim the heuristic threshold for yes is whether it makes the total ratio better.
Example 8: If your total ratio is 1:10, then your answer should be yes if you can be confident you will not accidentally produce more than 10x paths to capability as paths to alignment. If the path ratio you wind up with from your research agenda is 1:9, that is still a net gain.
I notice this is extremely permissive relative to people who express concern about enhancing capabilities. I have not seen a single proposal for an approach to alignment that I thought would provide 10x more ways to help capability. As it stands, my model does not predict that alignment research can't accidentally cause AGI, so much as predict it is almost always worth the gamble.
The model is simple unto brutality. I can anticipate several concerns, most of which could probably be addressed.