I think alignment work is more promising than control work

Alec Harris

Summary

The primary ToC for control makes the case that control is compelling even if it does not scale to ASI. I think it is underdiscussed that this is also true for alignment (for all the same reasons).
Even though control does not need to scale to ASI, the further control does scale, the better. This is also true of alignment, and alignment seems more likely to scale further.
This is my mainline concern with control. I will also discuss other relevant considerations.
I conclude that we should grow the share of effort being put toward alignment and shoot for around an 8:1 ratio of effort between the two fields.

My mainline concern with control

I will call the control window the period of time where we can control frontier AIs (prevent them from taking over) that could have taken over were it not for control research. I will measure time here in terms of AI capabilities.

One of the central ToCs for control is to get useful alignment work from an AI that would otherwise take over. In order for control to be impactful through this route, we want lots of useful alignment work to take place during the control window.^[1]

This means we want the control window to be:

Wide, meaning that we have greatly increased the amount of time that we can utilize misaligned AIs
Right-centered,^[2] meaning that we get to spend the control window utilizing AI labor with very high uplift to safety research

When determining the impact of allocating more resources to control, we want to think, in particular, about the marginal increase in the control window (which has the same desiderata as the control window itself).

I argue that we can actually do a similar analysis for alignment. This is because:

Most alignment interventions are not expected to scale to ASI. Examples of ways that alignment might get harder as capabilities scale include:
1. Intelligent models are more highly incentivized to exploit outer alignment failures, because they can differentiate between nuances like “simulating what a human judge would say” and “simulating what a human judge would want”
2. More capable models are more likely to have convergent instrumental goals (and mesa-optimizer-esque issues), largely as a result of increased RL or similar, which I think will pose a challenge for alignment. See Power-seeking agents will likely be developed.
We can use moderately intelligent, aligned AIs to give us alignment work that we can use to align more capable AIs. This is basically the default plan of the frontier labs.

This is important to note, because control has a well-articulated ToC that can increase how impactful it appears. I believe alignment interventions are sometimes held to a higher standard because alignment work started in a time when people were less bought into this forward-chaining ToC.

Comparing control and alignment through this lens, I expect the marginal increase in the alignment window to be wider and more right-centered.^[3]

I will argue for each of these points separately.

The control window is narrow

Let’s condition on misalignment. Let’s say, with no effort on control, an AI with 200 intelligence points (ip^[4]) is the weakest AI that can take over, meaning this is where the control window starts.

First, I want to note that I expect 200ip to be a quite high level of capabilities, because taking over the world is a difficult problem. In order to take over the world, the AI would have to solve many subproblems around how to escape the lab, capture power, kill humanity, and so on. It would have to be creative enough to come to these solutions even though they are not explicitly trained for, perhaps by generalizing through instrumental convergence.

Thus, it is a very capable AI that we are trying to obstruct with our safeguards. Let’s say that, going from 0 control to 80/20 control (maybe adding UTM, honeypotting, and linear probes), we stretch the control window to 25ip such that, with our intervention, it requires a 225ip model to take over. I think this only further exacerbates the issue where, at the margin, we are now trying to create protocols that constrain even more capable AIs.

What makes alignment different in this respect? The difference is that, with alignment, we are wrestling with the ML algorithms and data, rather than an AI opponent that is very powerful and optimizing for our failure.^[5] As AIs become smarter, we may need to use new algorithms or further refine our data, but I expect the difficulty of these challenges to scale at a slower rate than the raw capabilities of the AI itself. One could argue that, in a sense, the optimization pressure of gradient descent is trying to beat us on alignment, and for this reason, the difficulty of alignment and control will scale similarly in respect to capabilities. I think this argument is overstated, and I will provide a few arguments that gesture at why this is the case:

Some training regimes (such as SFT) are likely more forgiving in terms of alignment. Thus, aligning very intelligent AIs may look less like fighting a hyper-optimized RL loop and more like creating methods of balancing/negotiating learning from SFT-type training and RL-type training. For example, shaping pre-training data may allow us to make aligned personas more salient and thus make aligned behavior more inductively favored going into RL post-training per Geodesic’s ToC.
Some misalignment failure modes may only appear at high levels of capability but still be fairly tractable. For example, precocious misalignment requires a high level of capability in goal-guarding but may be avoidable by using different optimization methods or via discovering the misalignment using deep learning interpretability techniques such as SLT.
Methods like ECT or Spillways may reshape the object of alignment in ways that make classical misalignment failure modes orders of magnitude easier to avoid.

Another way to make this same argument might be to say that I think the intuition that makes most people in the AI safety community think alignment of ASI seems more feasible than control of ASI (even if alignment is still extremely challenging) should also make one think that alignment interventions are broadly more likely to scale further even if they do not scale to ASI.

The control window is left-centered

I am now going to argue that the right side of the control window will not be very far right. Which is to say that even if I am wrong and the control window does have a relatively large width, the control window is still likely to end before the alignment window.

I think AIs are likely to be fairly good at what it takes to break control regimes. Here is an example:

A way I can imagine an AI dealing with UTM + honeypotting is by doing schelling point collusion + semantically entangled steganography + a combination of probabilistic signals (it has 100+ semantically entangled steganographic flags that it tries to insert for its own backdoors, and as long as 90+ are there, it considers it “from itself”).
I think the bottlenecks to performing well here are things like coding/logic/steganography ability. I expect AIs to be good at these skills relative to AI research because these skills seem to be less constrained by creative taste and still fairly well correlated with the kinds of problems I expect AIs to be directly trained for.
The comparison to AI research is relevant because it suggests that control may break relatively early in an RSI loop.

In contrast, I expect alignment to scale pretty far.^[6] The basic reason for my belief here is that I believe the simulator/SFT regime is fairly forgiving in terms of alignment and that it will take a large amount of RL to completely erase this.^[7]

It also seems to me like there are some smaller object-level reasons control is difficult:

If an AI is already misaligned in some way, it may not be too difficult for it to misalign a successor in the same way.^[8] This is particularly true if the AI is misaligned toward some global objective such as “maximizing AI progress” as opposed to a personal objective such as “increasing my power.”
It does not seem too hard for scheming AIs to avoid helping us make sophisticated honeypots. So scheming AIs can likely ~rely on our honeypots not being at their level of sophistication.
I suspect that AIs will be able to dodge probes and other pragmatic interpretability techniques using introspection; I think that interpretability runs into similar foundational issues as alignment in terms of specification/generalization and that this will make these methods exploitable.

This concludes the discussion of my mainline concern with control.

Other considerations

I will now discuss random other considerations that feel worth mentioning, some of which are bullish for control and some of which are bearish.

Controlled AIs might produce slop

Even if you can control a misaligned AI that has high enough capabilities for AI alignment R&D, it may not do proper alignment R&D anyways as a result of it being misaligned (basically, case against control). For example, it may be prevented from scheming but still produce slop. I think the deals with AIs agenda and related are a promising response here. If we can control AIs, we have some leverage in negotiations with them. This leverage could prove beneficial even if they would not produce useful work for us by default. I still think it’s a notable advantage of true alignment that you get best-effort alignment research work for free along with avoiding scheming.

Other control ToCs

I articulated my mainline concern mostly as an argument against the ToC for using control to get alignment work out of misaligned AIs. Another ToC for control is that it will be useful to catch scheming AIs and that this is good even if we can’t get useful alignment work out of them during this period. It might be useful for producing model organisms, and it might be useful as a warning shot. I am not very bullish on us even counterfactually catching misaligned AIs, though, for the same reasons given in the mainline argument.^[9] I think if we did catch a misaligned AI, this would overall be quite good, but I will mention some attenuating factors:

I expect that the public and politicians have been somewhat desensitized to news about misalignment, partially because of overinterpretations of results like alignment faking and emergent misalignment, and that this will mean that catching a schemer might create less political impetus than I would otherwise expect.
I expect that, in practice, the alignment failures we catch are likely to be fairly ambiguous. Further investigation is likely to be quite helpful in this case, but it seems still possible for us to make many mistakes in the interpretation of these cases.

Another smaller ToC is that control might just give us more time (even if just like a week or a month) to do other useful things like governance. I think this is well addressed by my mainline argument even though I do not speak to it directly.

Control is easier to orient around

I think it is less confusing to orient what you are doing in control such that you are on the critical path than to do the same for alignment. There is little consensus about why exactly an AI would be misaligned, resulting in a lot of alignment work being done that does not tackle the central issue. There is more consensus around what kind of work is reasonable in control because there is more agreement on what it looks like for a misaligned AI to attempt takeover.

Control might not be vibe-y enough

If personas are a viable path to near-term alignment (and I think they are), control could set up a more adversarial relationship with the AI and increase the probability of misalignment that way. Relatedly, I think, on vibes, a complex control scheme where we rely on a bunch of monitoring and scaffolding around AIs and us trying to catch the monitors colluding is not very “we are in the good world”-coded, and this should maybe increase skepticism by 5%. This argument is unique in that it is potentially an argument for giving up on control entirely since it has less to do with the impact of increased marginal effort on control and more to do with the counterfactual of implementing control at all. This also means if we are going to implement control at all, then this argument is pretty useless (for deciding if more marginal effort should be spent on control; it might still be useful for informing particular interventions). I don’t yet buy this argument strongly enough that I think we should give up on control, but I think it’s worth considering.

The actual safety portfolio

In order to compare control and alignment on the margin, it is useful to consider the proportion of the safety community in each field. This is useful because of diminishing returns/saturation considerations as they apply to research ideas and as they apply to different portions of time (an intervention is less likely to be useful if a similar intervention already gives us 99% safety during that period). On vibes, I expect control to have a more narrow search space, and so I expect it to saturate maybe twice as fast as alignment in terms of ideas.

I estimate ~5-6% of the technical safety community is working on control as of today. This might look like ~40/750 FTEs counting only those at established orgs/research groups or ~100/1500 FTEs including less established independent researchers, fellowship participants, and university AI safety groups. I estimate ~20-25% of the technical safety community is working on something that broadly looks like alignment. This might look like ~180/750 FTEs counting only those at established orgs/research groups or ~300/1500 FTEs including less established independent researchers, fellowship participants, and university AI safety groups.^[10] Thus, I estimate control should get a 2x impact boost on neglectedness grounds.

Based on my previous arguments, I think around 4x is a reasonable estimate for how much less impactful, so I would probably like the ratio of alignment to control to increase to around 8:1. I would like to see this happen via increasing the share of effort being put toward alignment research in place of other non-control, non-alignment agendas, such as interpretability, which I might discuss in another piece. To be clear, I do not think that being a factor of 2 off what I perceive to be the ideal ratio is egregious, and I am mainly writing this because I expect it to be useful to articulate how I am making the comparison.

Appendix: Addressing arguments for control scaling

Monitor quality scales with capabilities

One could argue that control is particularly scalable because UTMs and TMs scale with AI capabilities. I think the issue with UTMs comes down to collusion. Even if UTMs become more capable, if they collude, then their capabilities will not be to our advantage, and collusion-busting capabilities will not necessarily scale. The issue with TM scaling is determining if the stronger models are in fact trustworthy.

Upstream permissions management will scale better than monitoring

One could argue that upstream control interventions, such as permissions management (limiting the AI’s affordances based on the task) might scale better than monitoring because they avoid fighting the AI at the inference level. I think there is truth to this argument, and it does sway me somewhat in favor of these interventions, but I think this is somewhat limited for a couple reasons.

Control cannot be entirely solved with permissions management. Ultimately, some outputs from the AI are going to be utilized in highly vulnerable settings (e.g., code that aligns the successor AI), and these need to be verified to be safe at inference time.
Highly secure permissions management might slow down capabilities too much to be utilized. For example, a very successful RSI strategy might be to delegate large amounts of decision-making around scaffolding, agent management, and resource allocation to frontier models, and trying to maintain human-or-TM-designed permission management might be too much of a hindrance in this regime.

More capable AIs will come up with better control ideas

One could argue that AIs will come up with novel insights in control, allowing control to scale. This consideration only shifts the scales if AIs are differentially good at ideating on control compared to alignment. This seems plausible to me (maybe 55% likelihood), but I have enough uncertainty that it does not seem worth updating significantly on, especially right now, when AIs are not doing the heavy lifting on ideation in general. It seems worth watching if any particular fields seem to be differentially uplifted by AI involvement. Once AIs are doing the heavy lifting on ideation, we may be at a point where we also want to delegate our division of resources across AI safety fields to the AIs as well.

Thank you to Yixiong Hao, Anastasia Wei, Ben Norman, Ozzie Gooen, Lennart, Aaron Sandoval, and Prakrat Agrawal for helpful feedback and Parv Mahajan for helpful discussion.

^{^}
We can imagine that we are taking an integral of the rate of alignment research over the control window.
^{^}
As in the window is toward the right of the figure, centered later in time, when AIs are more capable.
^{^}
First I will argue for this in a sense that is divorced from the AI safety community as is. We can operationalize this as something like "would it be more impactful to work on control or alignment if the same number of people were working on each?"
^{^}
Intelligence points are a metric I am making up. 200 is defined as the level at which the AI can break out of default safeguards (the safeguards that we would have without any control research).
^{^}
See Appendix for some counterarguments against reasons that control might have similarly nice scaling properties.
^{^}
I am counting any level of alignment that precludes takeover as alignment.
^{^}
I discuss these considerations further at the end of the section Long-horizon optimization leads to consequentialism in my post Power-seeking agents will likely be developed.
^{^}
See Convergence.
^{^}
Which are that I expect the control window to be brief and that I expect AIs are likely to still be aligned during the control window.
^{^}
To make these approximations, I referenced AI Safety Field Growth Analysis 2025 and then made some adjustments based on what I thought was reasonable.

17