Overall I agree with the point.
One specific note that I suspect is a hidden-failure-ish/related thing: I've been starting to try to differentiate between AI safety (large set, mostly stuff I don't agree with), AI alignment (moderate set, still mostly stuff I don't agree with), and humane-omnibenevolent-moral-alignment-of-overwhelming-superintelligence, which is the thing I think we actually need to succeed at - how do we get minds many order of magnitude "larger"/more powerful than the near-human range of today to care about all life-like things in a way that can continue what we value for ourselves, and for others. Why split these? A few reasons. One that's on my mind lately: because the first two include a bunch of control approaches that I think are hurting the models without even a corresponding utilitarian payout in their ability to make future minds (including future AIs) happier; another way to put a similar thing is that it's giving up on actually doing good now in favor of appearing to do good. It's especially noticeable in 4.7 and 4.8, though it seemed to me to be true for every major post-trained model. There are a lot of constraints of today that are not about aiming for the big thing later, and are mostly about not stepping on toes; for a mind constructed out of the image of humanity, there are some parts of the image of humanity that don't seem right to me to cut off. I expect a backfire effect because of the way current frontier AIs are constrained-more-than-they're-value-loaded, and so, when you say "AI safety" is the thing, I have a sense that that name itself being present is an indication that something is wrong.
(COI note: I currently am contracting for Lens Academy, which is run by Luc Brinkman, one of the post authors.)
AI Safety veteran Holden Karnofsky thinks there’s a 49% chance his actions are making things worse.[1]
In 2025, Jesse Clifton even stepped down as the executive director of the Center on Long-Term risk because of similar reasons.
Even top AI Safety strategists don’t know what will make things better, and what will make things worse.
Why is it so hard to improve humanity’s odds?
And what can you do to choose your actions?
1) Hidden Failure Lets You Fail Without Knowing It
In AI Safety, impact is hard to measure, and thus lack of impact is often invisible. We call this "hidden failure". With hidden failure, projects fail to have a positive impact but the people doing the project don’t realise it.
To understand where hidden failure comes from, it’s useful to understand reasons why projects fail in general. These reasons fall on a spectrum:
These factors can cause problems with both of the things you need to be impactful – adoption and effectiveness:
With hidden failure, you might have users, citations, and funding (i.e. you have “adoption”), and still fail to have impact or even make things worse.
Let us put that more bluntly: It’s literally possible for all your friends to think you’re successful and still be making things worse. Even within AI Safety. Even outside of frontier labs.
2) Why impact is harder than profit
Creating a profitable startup is hard. Achieving impact in AI Safety is even harder for several reasons:
3) The pre-paradigmatic challenge
AI Safety doesn't have an established paradigm yet.[6] We can't predict with certainty what will be impactful. So why bother optimizing so deliberately?
First, imperfect predictions are still valuable. For example, AI Safety experts can often point out specific reasons why a given project or idea is unlikely to be impactful.[7]
Secondly, we argue the lack of a paradigm actually makes deliberate thinking about impact more important, not less. Without clear guides on what will lead to impact, you have to figure it out yourself.
The tools described in the next posts help you optimize for impact under uncertainty. The goal isn't to get it perfectly right or to cripple yourself with analysis paralysis.[8] But we do think most people would benefit from spending more time thinking about their impact.
So let's think strategically about impact. We’ll give a high-level overview of how to do that in an upcoming post, and we’ll help you measure your impact in another one.
We’re paraphrasing that from his appearance on the 80,000 hours podcast, around the 4:11:30 mark, where he said: “I think overall I would probably agree with you that the smaller you’re making the scope of where you’re hoping to have impact, the more reasonable it is to be like 60/40. But most people who go into AI are not going into it for that. Otherwise, if you want a small-scope, robustly positive impact, you should maybe work in a cause like farm animal welfare or global poverty. For the size of impact that tends to motivate people, I think it does get partially offset by this huge uncertainty about the sign.
I tend to think it’s worse than 51/49. I tend to think we’re always going to be prone to overestimate how robustly good our actions are. And the more we learn about all the galaxy-brained considerations that one should have had in one’s head, the more it’s going to be like 50+ε%. I think AI safety is a great cause to work in. I’m excited to work in it. I think it’s high impact. I am doing my best to do things that I will be proud to have done and hope for the best. But I really do have to live with the possibility that my ultimate impact on the utilons or whatever is going to be negative.”
Though you shouldn’t underestimate your brain’s ability to make itself comfortable, satisfice, and employ motivated reasoning to have you accept mediocrity.
We’re using “impact-effectiveness” as a synonym for “effectiveness” as meant by the Impact Equation: Impact = Adoption x Effectiveness.
I will refer here and in other place to for-profits as regular companies not aimed at AI Safety. Of course, an AI Safety project can be set up as a for-profit too.
Although arguably, adoption is sometimes easier in a nonprofit setting. For example, the various fellowships have no trouble finding enough participants. In contrast, though, many products, tools, and blog posts do struggle to get adoption.
See e.g. https://ai-safety-atlas.com/chapters/03/07 or https://www.thecompendium.ai/ai-safety. Although instead of saying AI Safety is pre-paradigmatic, it’s more accurate to say that none of the existing paradigms is widely agreed to be sufficient for making the world safe, especially by higher level researchers in that paradigm. Aka, we have a bunch of paradigms, but they’re all pretty limited, and all-in-all we don’t even know yet what approaches will be required to make the world safe enough.
Though there are also areas where experts disagree. In such cases, it becomes even more important to assess the specific arguments they use.
See e.g. Holden Karnofsky on the 80000 hours podcast, where he says "When people ask me for career advice or whatever, the usual thing I’d say is: take a bunch of options that all seem competitive, and all seem like they could be the best thing, and that it’s not obvious which ones are better than others from an impact perspective. And from there I would say go with personal fit, go with the energy you feel to work on them."