Note: this has been in my draft queue since well before the FLI letter and TIME article. It was sequenced behind the interpretability post, and delayed by travel, and I'm just plowing ahead and posting it without any edits to acknowledge ongoing discussion aside from this note.

This is an occasional reminder that I think pushing the frontier of AI capabilities in the current paradigm is highly anti-social, and contributes significantly in expectation to the destruction of everything I know and love. To all doing that (directly and purposefully for its own sake, rather than as a mournful negative externality to alignment research): I request you stop.

(There's plenty of other similarly fun things you can do instead! Like trying to figure out how the heck modern AI systems work as well as they do, preferably with a cross-organization network of people who commit not to using their insights to push the capabilities frontier before they understand what the hell they're doing!)

(I reiterate that this is not a request to stop indefinitely; I think building AGI eventually is imperative; I just think literally every human will be killed at once if we build AGI before we understand what the hell we're doing.)

New Comment
23 comments, sorted by Click to highlight new comments since:

I am currently job hunting, trying to get a job in AI Safety but it seems to be quite difficult especially outside of the US, so I am not sure if I will be able to do it. 

If I will not land a safety job, one of the obvious options is to try to get hired by an AI company and try to learn more there in the hope I will either be able to contribute to safety there or eventually move to the field as a more experienced engineer.

I am conscious of why pushing capabilities could be bad so I will try to avoid it, but I am not sure how far it extends. I understand that being Research Scientist in OpenAI working on GPT-5 is definitely pushing capabilities but what about doing frontend in OpenAI or building infrastructure at some strong but not leading (and hopefully a bit more safety-oriented) company such as Cohere? Or let's say working in a hedge fund which invests in AI? Or working in a generative AI company which doesn't build in-house models but generates profit for OpenAI? Or working as an engineer at Google on non-AI stuff?

I do not currently see myself as an independent researcher or AI safety lab founder, so I will definitely need to find a job. And nowadays too many things seem to touch AI one way or the other, so I am curious if anybody has an idea about how could I evaluate career opportunities.

Or am I taking it too far and the post simply says "Don't do dangerous research"?

My answer is "work on applications of existing AI, not the frontier". Advancing the frontier is the dangerous part, not using the state-of-the-art to make products.

But also, don't do frontend or infra for a company that's advancing capabilities.

I am currently job hunting, trying to get a job in AI Safety but it seems to be quite difficult especially outside of the US, so I am not sure if I will be able to do it.

This has to be taken as a sign that AI alignment research is funding constrained.  At a minimum, technical alignment organizations should engage in massive labor hording to prevent the talent from going into capacity research.

This feels game-theoretically pretty bad to me, and not only abstractly, but I expect concretely that setting up this incentive will cause a bunch of people to attempt to go into capabilities (based on conversations I've had in the space). 

For this incentives-reason, I wish hardcore-technical-AI-alignment had a greater support-infrastructure for independent researchers and students. Otherwise, we're often gonna be torn between "learning/working for something to get a job" and "learning AI alignment background knowledge with our spare time/energy".

Technical AI alignment is one of the few important fields that you can't quite major in, and whose closest-related jobs/majors make the problem worse.

As much as agency is nice, plenty of (useful!) academics out there don't have the kind of agency/risk-taking-ability that technical alignment research currently demands as the price-of-entry. This will keep choking us off from talent. Many of the best ideas will come from sheltered absentminded types, and only the LTFF and a tiny number of other groups give (temporary) support to such people.

Yes, important to get the incentives right.  You could set the salary for AI alignment slightly below that of the worker's market value. Also, I wonder about the relevant elasticity.  How many people have the capacity to get good enough at programming to be able to contribute to capacity research + would have the desire to game my labor hording system because they don't have really good employment options?

More discussion here:

I'd be curious to hear more about this "contributes significantly in expectation" bit. Like, suppose I have some plan that (if it doesn't work) burns timelines by X, but (if it does work) gets us 10% of the way towards aligned AGI (e.g. ~10 plans like this succeeding would suffice to achieve aligned AGI) and moreover there's a 20% chance that this plan actually buys time by providing legible evidence of danger to regulators who then are more likely to regulate and more likely to make the regulation actually useful instead of harmful. So we have these three paths to impact (one negative, two positive) and I'm trying to balance the overall considerations. I suppose you'd say (a) do the math and see what it says, and (b) be vigilant against rationalization / wishful thinking biasing your math towards saying the benefits outweigh the costs. Is that right? Anything else you want to say here?

(A concrete example here might be ARC Evals' research which may have inadvertently burned timelines a bit by inspiring the authors of AutoGPT who read the GPT-4 system card, but iiuc lots of people like Langchain were doing stuff like that anyway so it probably didn't make more than a few week's difference, and meanwhile the various beneficial effects of their evals work seem quite strong.)

(Perhaps a useful prompt would be: Do you think it's useful to distinguish between capabilities research and research-which-has-a-byproduct-of-giving-people-capabilities-ideas? Why or why not?)


I'm trying to make a basic point here, that pushing the boundaries of the capabilities frontier, by your own hands and for that direct purpose, seems bad to me. I emphatically request that people stop doing that, if they're doing that.

I am not requesting that people never take any action that has some probability of advancing the capabilities frontier. I think that plenty of alignment research is potentially entangled with capabilities research (and/or might get more entangled as it progresses), and I think that some people are making the tradeoffs in ways I wouldn't personally make them, but this request isn't for people who are doing alignment work while occasionally mournfully incurring a negative externality of pushing the capabilities frontier.

(I acknowledge that some people who just really want to do capabilities research will rationalize it away as alignment-relevant somehow, but here on Earth we have plenty of people pushing the boundaries of the capabilities frontier by their own hands and for that direct purpose, and it seems worth asking them to stop.)

... but this request isn't for people who are doing alignment work while occasionally mournfully incurring a negative externality of pushing the capabilities frontier.

So then what's the point of posting it?

Anyone that could possibly stumble across this post would not believe themselves to be the villain, just "occasionally mournfully incurring a negative externality of pushing the capabilities frontier", in or outside of 'alignment work'.

Obviously many people exist who think that pushing the capabilities frontier is not a cost but rather a benefit. Your comment, taken literally, is saying that such people “could [not] possibly stumble across this post”. You don’t really believe that, right? It’s a blog post on the internet, indexed by search engines etc. Anyone could stumble across it!

I was one of the people who upvoted but disagreed -- I think it's a good point you raise, M. Y. Zuo, that So8res' qualifications blunt the blow and give people an out, a handy rationalization to justify continuing working on capabilities. However, there's still a non-zero (and I'd argue substantial) effect remaining.

What are your thoughts on the argument that advancing capabilities could help make us safer?

In order to do alignment research, we need to understand how AGI works; and we currently don't understand how AGI works, so we need to have more capabilities research so that we would have a chance of figuring it out. Doing capabilities research now is good because it's likely to be slower now than it might be in some future where we had even more computing power, neuroscience understanding, etc. than we do now. If we successfully delayed capabilities research until a later time, then we might get a sudden spurt of it and wouldn't have the time to turn our increased capabilities understanding into alignment progress. Thus by doing capabilities research now, we buy ourselves a longer time period in which it's possible to do more effective alignment research.

I’m not Nate but I have a response here.

Aside: More Sensible Cousin of Bad Argument 1 [but I still strongly disagree with it]: Three steps: “(A) Endgame safety is really the main thing we should be thinking about; (B) Endgame safety will be likelier to succeed if it happens sooner than if it happens later [for one of various possible reasons]; (C) Therefore let’s make the endgame happen ASAP.”

See examples of this argument by Kaj Sotala and by Rohin Shah.

I mainly don’t like this argument because of (A), as elaborated in the next section.

But I also think (B) is problematic. First, why might someone believe (B)? The main argument I’ve seen revolves around the idea that making the endgame happen ASAP minimizes computing overhang, and thus slow takeoff will be likelier, which in turn might make the endgame less of a time-crunch.

I mostly don’t buy this argument because I think it relies on guessing what bottlenecks will hit in what order on the path to AGI—not only for capabilities but also for safety. That kind of guessing seems very difficult. For example, Kaj mentions “more neuroscience understanding” as bad because it shortens the endgame, but couldn’t Kaj have equally well said that “more neuroscience understanding” is exactly the thing we want right now to make the endgame start sooner??

As another example, this is controversial, but I don’t expect that compute will be a bottleneck to fast takeoff, and thus I don’t expect future advances in compute to meaningfully speed the Endgame. I think we already have a massive compute overhang, with current technology, once we know how to make AGI at all. The horse has already left the barn, IMO. See here.

So anyway, I see this whole (B) argument as quite fragile, even leaving aside the (A) issues. And speaking of (A), that brings us to the next item: …

I don’t want to speak for Nate but I’m guessing that he would at least share my objection to (A) on the basis of his DTD post.


According to Deepmind we should aim at that little spot at the top ( I added the yellow arrow). This spot is still dangerous btw. Seems tricky to me.

Image from the deepmind paper on extreme risk.( 


This is an occasional reminder that I think pushing the frontier of AI capabilities in the current paradigm is highly anti-social

There's plenty of other similarly fun things you can do instead! Like trying to figure out how the heck modern AI systems work as well as they do

These two research tracks actually end up being highly entangled/convergent, they don't disentangle cleanly in the way you/we would like.

Some basic examples:

  1. successful interpretability tools want to be debugging/analysis tools of the type known to be very useful for capability progress (it's absolutely insane that people often try to build complex DL systems without the kinds of detailed debugging/analysis tools that are useful in many related fields such as for computer graphics pipelines. You can dramatically accelerate progress when you can quickly visualize/understand your model's internal computations on a gears level vs the black box alchemy approach).

  2. Deep understanding of neuroscience mechanisms could advance safer brain-sim ish approaches, help elucidate practical partial alignment mechanisms (empathic altruism, prosociality, love, etc), but also can obviously accelerate DL capabilities.

  3. Better approximations of universal efficient active learning (empowerment/curiosity) are obviously dangerous capability wise, but also seem important for alignment by modeling/bounding human utility when externalized.

In the DL paradigm you can't easily separate capabilities and alignment, and forcing that separation seems to constrain us to approaches that are too narrow/limiting to be relevant on short timelines.

successful interpretability tools want to be debugging/analysis tools of the type known to be very useful for capability progress

Give one example of a substantial state-of-the-art advance that decisively influenced by transparency; I ask since you said "known to be." Saying that it's conceivable isn't evidence they're actually highly entangled in practice. The track record is that transparency research gives us differential technological progress and pretty much zero capabilities externalities.

In the DL paradigm you can't easily separate capabilities and alignment

This is true for conceptual analysis. Empirically they can be separated by measurement. Record general capabilities metrics (e.g., generally downstream accuracy) and record safety metrics (e.g., trojan detection performance); see whether an intervention improves a safety goal and whether it improves general capabilities or not. For various safety research areas there aren't externalities. (More discussion of on this topic here.)

forcing that separation seems to constrain us

I think the poor epistemics on this topic has encouraged risk taking, have reduced the pressure to find clear safety goals, and allowed researchers to get away with "trust me I'm making the right utility calculations and have the right empirical intuitions" which is a very unreliable standard of evidence in deep learning.


The probably-canonical example at the moment is Hyena Hierarchy, which cites a bunch of interpretability research, including Anthropic's stuff on Induction Heads. If HH actually gives what it promises in the paper, it might enable way longer context.

I don't think you even need to cite that though. If interpretability wants to be useful someday, I think interpretability has to be ultimately aimed at helping steer and build more reliable DL systems. Like that's the whole point, right? Steer a reliable ASI.

I think you're giving only part of the moral calculus. The altruistic goal is not to delay AGI. The altruistic goal is to advance alignment faster than capabilities.

I'd love to eke out a couple more months or even days of human existence. I love it, and the humans that are enjoying their lives. But doing so at the risk of losing the alignment battle for future humans would be deeply selfish.

I think this is relevant because there's a good bit of work that does advance capabilities, but advances alignment faster. There's plenty more to say about being sure that's what you're doing, and not letting optimistic biases form your beliefs.

From Some background for reasoning about dual-use alignment research:

Doing research but not publishing it has niche uses. If research would be bad for other people to know about, you should mainly just not do it.


Leaving aside an enormous host of questions, why do you think this request would have even a shred of efficacy, given that China exists?