Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Stephen Casper, Thanks to Alex Lintz and Daniel Dewey for feedback. 

This is a reply but not an objection to a recent post from Paul Christiano titled AI alignment is distinct from its near term applications. The post is fairly brief and the key point is decently summed up by this excerpt.

I worry that companies using alignment to help train extremely conservative and inoffensive systems could lead to backlash against the idea of AI alignment itself. If such systems are held up as key successes of alignment, then people who are frustrated with them may end up associating the whole problem of alignment with “making AI systems inoffensive.”

I have no disagreements with this claim. But I would push back against the general notion that AI [existential] safety work is disjoint from near term applications. Paul seems to agree with this.

We can develop and apply alignment techniques to these existing systems. This can help motivate and ground empirical research on alignment, which may end up helping avoid higher-stakes failures like an AI takeover. 

This post argues for strongly emphasizing this point. 

What do I mean by near-term applications? Any challenging problem involving consequential AI and society. Examples include:

  • Self-driving cars
  • Recommender systems
  • Search engines
  • AI weapons
  • Cybersecurity
  • Unemployment
  • Bias/fairness/justice involving systems that work with humans or human data
  • Misuses of media generators including text and image generators

I argue that working on these problems probably matters a lot for three reasons. The second and third of which are potential matters of existential safety. 

Non X-risks from AI are still intrinsically important AI safety issues.

There are many important non X-risks in this world, and any altruistic-minded person should care about them. For the same reason we should care about health, wealth, development, and animal welfare, we should also care about making important near-term applications of narrow AI go well for people.  

There are valuable lessons to learn from near-term applications.

Imagine that we figured out ways to make near-term applications of AI go very well. I find it incredibly hard to imagine a world in which we did any of these things without developing a lot of useful technical tools and governance strategies that could be retooled or built on for higher-stakes problems later. Consider some examples.

  • For self-driving cars to go well, we would need to effectively develop and iterate on techniques for robustness, reliability, and anomaly detection/handling. 
  • To make recommender systems go well, we would need to make a lot more progress on efficiently inferring what humans truly want in a way that is disentangled from what they seem to want. 
  • To minimize harms from AI weapons, we would need to introduce a lot of national and international laws and precedent for disincentivizing and responding to deadly AI. 
  • To minimize problems from discriminatory AI or the misuses of media generators (e.g. BLOOM or Stable Diffusion), we would need laws and precedent establishing definitions of harm and methods for recourse. Perhaps most importantly, establishing lots of laws and bureaucracy around AI systems like this may help to establish a legal regime that provides filters and obstacles to the deployment of risky systems. If auditing and slow timelines are good, so is this kind of bureaucracy. 

See also this post

Making allies and growing the AI safety field is useful

AI safety and longtermism (AIS&L) have a lot of critics, and in the past year or so, they seem to have grown in number and profile. Many of whom are people who work on and care a lot about near-term applications of AI. To some extent this is inevitable. Having an influential and disruptive agenda will inevitably lead to some pushback from competing ones. Haters are going to hate. Trolls are going to troll.

But AIS&L probably have more detractors than they should from people who should really be allies. Given how many forces in the world are making AI more risky, there shouldn’t be conflict between groups of people who are working on making it go better but in different ways. In what world could isolation and mutual dismissal between AIS&L people and people working on neartermist problems be helpful? There seem to be too many common adversaries and interests between the two groups to not be allies–especially for influencing AI governance. Having more friends and fewer detractors seems like it could only increase the political and research capital of the AIS&L community. There is also virtually no downside of being more popular. 

I think that some negative press about AIS&L might be due to active or tacit dismissal of the importance of neartermist work by AIS&L people. Speaking for myself, I have had a number of conversations in the past few months with non AIS&L who seem sympathetic but have expressed feelings of dismissal by the community which has made them more hesistant to be involved. For this reason, we might stand to benefit a great deal from less parochialism and more friends. 

Paul argues that 

...companies using alignment to help train extremely conservative and inoffensive systems could lead to backlash against the idea of AI alignment itself.

But I think it is empirically, overwhelmingly clear that a much bigger concern when it comes to "backlash against the idea of AI alignment itself" comes from failures of the AIS&L community to engage with more neartermist work. 

Thanks for reading--constructive feedback is welcome. 

New Comment
17 comments, sorted by Click to highlight new comments since:
[-]Neel NandaΩ112926

Non X-risks from AI are still intrinsically important AI safety issues.

I want to push back on this - I think it's true as stated, but that emphasising it can be misleading. 

Concretely, I think that there can be important near-term, non-X-risk AI problems that meet the priority bar to work on. But the standard EA mindset of importance, tractability and neglectedness still applies. And I think often near-term problems are salient and politically charged, in a way that makes these harder to evaluate. 

I think these are most justified on problems with products that are very widely used and without much corporate incentive to fix the issues (recommender system alignment is the most obvious example here)

I broadly agree with and appreciate the rest of this post though! And want to distinguish between "this is not a cause area that I think EAs should push on on the margin" and "this cause area does not matter" - I think work to make systems less deceptive, racist, and otherwise harmful seems pretty great.


No disagreements substance-wise. But I'd add that I think work to avoid scary autonomous weapons is likely at least as important as recommender systems. If this post's reason #1 were the only reason for working on nerartermist AI stuff, then it would probably be like a lot of other very worthy but likely not top-tier impactful issues. But I see it as emphasis-worthy icing on the cake given #2 and #3. 

Cool, agreed. Maybe my main objection is just that I'd have put it last not first, but this is a nit-pick

I think that some near-future applications of AI alignment are plausible altruistic top priorities. Moreover, even when people disagree with me about prioritization, I think that people who want to use AI to accomplish contemporary objectives are important users. It's good to help them, understand the difficulties they encounter, and so on, both to learn from their experiences and make friends.

So overall I think I agree with the most important claims in this post.

Despite that, I think it's important for me personally (and for ARC) to be clear about what I care about---both because having clear research goals is important for doing good research, and because there's a significant risk of inadvertently misleading people about my priorities and views.

As an example, I've recently been writing about mechanistic anomaly detection. But if people think that I mean the kind of anomaly detection that is most helpful for avoiding self-driving car crashes, I think that would lead to them evaluating my research inappropriately. Many of my technical decisions seem very weird on this perspective, and if I tried to satisfy a reviewer with that perspective I think it would push me in a very unhealthy direction. On the other side, this equivocation might lead some people to be more sympathetic to or excited about my work for reasons I regard as kind of dishonest.

This example is related to your post, but not really to my OP (since in my OP I'm mostly talking about contemporary applications of alignment, which are more closely technically connected to my day-to-day work). I have similar views about the other examples you give in the "valuable lessons" section:

  • I think that inferring subtleties of human values isn't a significant part of my alignment research. Almost all the action I care is about very simple aspects of our preferences. So if someone evaluates my research through that lens, they are likely to be confused, and if they find it exciting it might be for the wrong reasons.
  • I think that regulations appropriate for managing catastrophic AI risk are extremely different from those appropriate for autonomous weapons, or for managing harms from discrimination. Maybe some repurposing is possible, but at this point I am personally much more interested in directly contributing to the kind of governance that is more directly applicable.

This isn't to say that I don't think such work is meaningful or interesting, or that I don't want to be friends with people who do it. Just that it's not what I'm doing, and it's worth being clear about that so that my work is being evaluated in an appropriate way.

The other extreme would be deliberately muddying the waters about distinctions between different technical problems, and while I don't think anyone is necessarily advocating for that I do think it would be very unwise.


Hi Paul, thanks. Nice reading this reply. I like your points here.

Some of what I say here might reveal a lack of keeping up well with ARC. But as someone who works primarily on interpretability, the thought of mechanistic anomaly detection techniques that are not useful for use in today's vision or language models seems surprising to me. Is there anything you can point me to to help me understand why an interpretability/anomaly detection tool that's useful for ASI or something might not be useful for cars?

We are mostly thinking about interpretability and anomaly detection designed to resolve two problems (see here):

  • Maybe the AI thinks about the world in a wildly different way than humans and translates into human concepts by asking "what would a human say?" instead of "what is actually true?" This leads to bad generalization when we consider cases where the AI system plans to achieve a goal and has the option of permanently fooling humans. But that problem is very unlikely to be serious for self-driving cars, because we can acquire ground truth data for the relevant queries. On top of that it just doesn't seem they will think about physical reality in such an alien way.
  • Maybe an AI system explicitly understands that it is being evaluated, and then behaves differently if it later comes to believe that it is free to act arbitrarily in the world.

We do hope that these are just special cases and that our methods will resolve a broader set of problems, but these two special cases loom really large. On the other hand, it's much less clear whether realistic failures for self-driving cars will involve the kind of mechanism distinctions we are trying to detect.

Also: there are just way more pressing problems for self-driving cars. And on top of all that, we are taking a very theoretical approach precisely because we are worried it may be difficult to study these problems until a future time very close to when they become catastrophic.

Overall I think that if someone looked at our research agenda and viewed it as an attempt to respond to reliability failures in existing models, the correct reaction should be more like "Why are they doing all of this instead of just empirically investigating which failures are most important for self-driving cars and then thinking about how to address those?" There's still a case for doing more fundamental theoretical research even if you are interested in more prosaic reliability failures, but (i) it's qualitatively much worse and I don't really believe it, (ii) this isn't what such research should look like. So I think it's pretty bad if someone is evaluating us from that perspective.

(In contrast I think this is a more plausible framing for e.g. work on adversarial evaluation and training. It might still lead an evaluator astray, but at least it's a very plausible research direction to focus on for prosaic reliability as well as being something we might want to apply to future systems.)



We do hope that these are just special cases and that our methods will resolve a broader set of problems.

I hope so too. And I would expect this to be the case for good solutions. 

Whether they are based on mechanistic interpretability, probing, other interpretability tools, adversaries, relaxed adversaries, or red-teaming, I would expect methods that are good at detecting goal misgeneralization or deceptive alignment to also be useful for self-driving cars and other issues. At the end of the day, any misaligned model will have a bug -- some set of environments or inputs that will make it do bad things. So I struggle to think of an example of a tool that is good for finding insidious misaligning bugs but not others. 

So I'm inclined to underline the key point of my original post. I want to emphasize the value of (1) engaging more with the rest of the community that doesn't identify themselves as "AI Safety" researchers and (2) being clear that we care about alignment for all of the right reasons.  Albeit this should be discussed with the appropriate amount of clarity which was your original point. 

So I struggle to think of an example of a tool that is good for finding insidious misaligning bugs but not others.

One example: A tool that is designed to detect whether a model is being truthful (correctly representing what it knows about the world to the user), perhaps based on specific properties of truthfulness (for example see doesn't seem like it would be useful for improving the reliability of self-driving cars, as likely self-driving cars aren't misaligned in the sense that they could drive perfectly safely but choose not to, but rather are just unable to drive perfectly safely because some of their internal (learned) systems aren't sufficiently robust.


Thanks for the comment. I'm inclined to disagree though. The application was for detecting deceptiveness. But the method was a form of contrastive probing.  And one could use a similar approach for significantly different applications. 

This feels kind of like a semantic disagreement to me. To ground it, it's probably worth considering whether further research on the CCS-style work I posted would also be useful for self-driving cards (or other applications). I think that would depend on whether the work improves the robustness of the contrastive probing regardless of what is being probed for (which would be generically useful), or whether it improves the probing specifically for truthfulness in systems that have a conception of truthfulness , possibly by improving the constraints or adding additional constraints (less useful for other systems). I think both would be good, but I'm uncertain which would be more useful to pursue if one is motivated by reducing x-risk from misalignment.


I see your point here about generally strengthening methods versus specifically refining them for an application. I think this is a useful distinction. But I'm all about putting a good amount of emphasis on connections between this work and different applications. I feel like at this point, we have probably cruxed. 

Not Paul, but some possibilities why ARC's work wouldn't be relevant for self-driving cars:

  • The stuff Paul said about them aiming at understanding quite simple human values (don't kill us all, maintain our decision-making power) rather than subtle things. It's likely for self-driving cars we're more concerned with high reliability and hence would need to be quite specific. E.g., maybe ARC's approach could discern whether a car understands whether it's driving on the road or not (seems like a fairly simple concept), but not whether it's driving in a riskier way than humans in specific scenarios.
  • One of the problems that I think ARC is worried about is ontology identification, which seems like a meaningfully different problem for sub-human systems (whose ontologies are worse than ours, so in theory could be injected into ours) than for human-level or super-human systems (where that may not hold). Hence focusing on the super-human case would look weird and possibly not helpful for the subhuman case, although it would be great if they could solve all the cases in full generality.
  • Maybe once it works ARC's approach could inform empirical work which helps with self-driving cars, but if you were focused on actually doing the thing for cars you'd just aim directly at that, whereas ARC's approach would be a very roundabout and needlessly complex and theoretical way of solving the problem (this may or may not actually be the case, maybe solving this for self-driving cars is actually fundamentally difficult in the same way as for ASI, but it seems less likely).

Minor suggestion: I would remove the caps from the title. Reason: I saw this linked below Christiano's post, and my snap reaction was that the post is [angry knee-jerk response to someone you disagree with] rather than [thoughtful discussion and disagreement]. Only after introspection did I read this post.

Thanks--noted. I see your point -- I definitely don't want people to think this is an angry response. Especially since I explicitly agree with paul in the post. But since this post has been up for a bit including on the EA forum, I'll shy away from changes. 

I think the hard part here is that I do care about the near-term risks you mention and think people should work on them (as they are). However, I think the concern is that:

  • If the distinction isn’t clear, investments from researchers, funders and government can end up leaning way too much into things that seem like they are helpful for alignment, but are totally missing the core. Then, we get a bunch of “safety” work which seems to be tackling “alignment”, but very little tackling the core of alignment (and every time we invent a new word to point to what we mean, it gets hijacked).
  • In practice, I think quite a few have tried to elevate the concern for AI x-risk without minimizing the near-term/ethics side of things, but the conversation always ends up toxic and counterproductive. For example, I applaud Miles Brundage’s efforts on Twitter to try to improve the conversation, but he’s getting vicious bad faith comments thrown at him even when he’s the nicest dude ever. I still don’t want to give up on this side of things, but just want to point out that it’s not like nobody has tried.

Overall, I think this is still an important conversation to have and I think it isn’t obvious what we should do.

+1 to your point that while the standard for making allies in other research disciplines should be low, the standards for what we fund should be high. But I suspect I might have a somewhat wider vision of an ideal AIS&L community than you do. I don't think I would characterize many examples AIS-inspired neartermist work that I can think of as hijacking. 

To your second point, FWIW, I personally try to ignore trolls as much as possible and do my best to  emphasize that there are a lot of near and longtermist reasons to care about almost all AI alignment work. 

Without enough focus or at least collaboration with shorttermism there will be an existential risk. Not from AI, from humanity. I would not be surprised if shit popped off long before AI even got a chance to kill us