A brief list of ways AI safety efforts could be net negative

Elias Schmied

I tend to think it’s worse than 51/49. I tend to think we’re always going to be prone to overestimate how robustly good our actions are. And the more we learn about all the galaxy-brained considerations that one should have had in one’s head, the more it’s going to be like 50+ε%. I think AI safety is a great cause to work in. I’m excited to work in it. I think it’s high impact. I am doing my best to do things that I will be proud to have done and hope for the best. But I really do have to live with the possibility that my ultimate impact on the utilons or whatever is going to be negative.

I’m not aware of a good list of downside risks for AI safety broadly^[1], so I decided to make one.

This is not intended to be fully comprehensive, these are just the ones that I personally take seriously^[2]^[3]:

AI governance interventions are obviously high-variance: bad regulation can easily make things worse, many interventions could increase the risk of great power conflict, increased political polarization around AI could be really bad, more centralization of power increases authoritarianism risk, more decentralization of power increases misuse risk, and so on. And technical work can have flow-through effects on these variables that outweigh its direct effects.^[4]
Activist work can polarize people against the cause.^[5]
Human takeover might be worse than AI takeover, and many AI safety interventions effectively attempt^[6] to make human takeover more likely relative to AI takeover.
If powerful AI will be well-described as doing humanlike roleplaying, trying to control it could make it eventually dislike its “oppressors”, or make it less “mentally healthy” in some way. And even without that assumption, AI safety work could lead to an adversarial relationship with AI in other ways.
Future AIs may be moral patients themselves, which would substantially reduce the value of preventing human extinction, and increase the downside risk (including S-risk) of “AI control”-style interventions.
Misleading or insufficiently useful work could contribute to “safety-washing” or a false sense of security.
There’s cultural concerns around scale, professionalization and “mainstreaming”^[7]- decreases in integrity and epistemic virtue could be very bad for achieving good outcomes.
Capabilities externalities (directly through technical work, or via talent pipelines, funding, or raising awareness) could accelerate AI progress, which many think is bad - people have raised this worry about RLHF historically, and raise it about interpretability and evals nowadays. Most infamously, AI safety activity, to varying extents, contributed to the foundings of all three of DeepMind, OpenAI and Anthropic.

(This list is taken from a previous post of mine, but I thought it deserved its own top-level reference.)

^{^}
The closest thing I’m aware of is Safeguarding the Safeguards, but even that is more narrow.
^{^}
To be clear, I don’t personally think AI safety has been net negative so far, like some do. I wouldn’t even say that I have a properly considered view about it - maybe 60% that it’s been net positive, with very low credal resilience.
But I do feel a vibe of overconfidence in the discourse here sometimes, and I think this can have downstream consequences, e.g. an action bias.
^{^}
Quickly, here are others that I excluded because I don’t personally see them as potentially major factors, and didn’t want to water down the main list by including a bunch of implausible galaxy-brained stuff:
- Differential slowdown of safety-minded actors: This feels somewhat falsified and “out of fashion” now that Anthropic has taken the lead and concern about China passing the US is a bit lower than 1-2 years ago? And the AI safety community also has less relative power now that more and larger forces have gotten involved.
- Putting AI doom stories in the training data: I don’t buy that this could be a major factor, there’s a lot of stuff in the training data and post-training applies a lot of optimization away from a Simulators-style reproduction of the training data.
- Theoretical concerns about the value of the future, most commonly associated with suffering-focused people: Since AI would most likely expand through the universe too, I don’t see this as an argument that AI safety might be net negative specifically (over and above what’s already in the list) (although I do think there are important considerations in general there).
- “Crying wolf” dynamics if doom predictions don’t pan out: I don’t buy this as a major factor, since many safety people are not that overtly/confidently doomy nowadays, and so wouldn’t lose credibility.
- Most of our impact comes from acausal effects, and effects on the base reality if we are in a simulation: I’m confused here like everyone else, but I currently don’t buy this as a major factor because we only know our reality, and therefore the same things that are good here should naively also have good acausal effects in expectation. (except that it maybe pushes for somewhat more cooperation and virtue ethics).
^{^}
Holden Karnofsky: “Most things that touch policy at all in any way will move us along that spectrum in one direction or another, so therefore have a high chance of being negative [...]
And then most things that you can do in AI at all will have some impact on policy. Even just alignment research: policy will be shaped by what we’re seeing from alignment research, how tractable it looks, what the interventions look like.” (h/t Anthony DiGiovanni)
^{^}
Holden Karnofsky: “there’s also a lot of micro ways in which you could do harm. Just literally working in safety and being annoying, you might do net harm. You might just talk to the wrong person at the wrong time, get on their nerves. I’ve heard lots of stories of this. Just like, this person does great safety work, but they really annoyed this one person, and that might be the reason we all go extinct” (h/t Anthony DiGiovanni)
^{^}
Among other things.
^{^}
I associate these with people like Richard Ngo (and here) and Oliver Habryka.

[-]frpatry13h20

Great, lucid post. I agree that we don't know if the net result could be negative. Sometimes you don't only look for positive returns like on a balance sheet, but more like a beneficial process. The process itself helps guide some important choices that may not be entirely associated with this field of research for now.

In a way, some people might think AI is nearly at a mature state because of the last couple of years' acceleration. I don't think it is. AI is still in its juvenile stage, and we have a long way to go before security, for instance, is well understood.

We have to be lucid and humble about what's ahead, and be ready to adapt based on new models as well. AGI and continual learning are maybe around the corner, but we'd better be ready before making the turn.

Do you think your list will evolve drastically in the future? Meaning, new unknown things popping up because of new paradigms?

[-]Elias Schmied13h10

Thanks! Yeah, agreed.

I'm pretty confused about how much to expect new ones. On the one hand, all these bullet points were already known about, in principle, several years ago. On the other hand, it seems unlikely that we know all of them since the space of possible considerations is so vast. So I'm not sure.

[-]frpatry13h10

We probably know most of them but not all. The real difference might be that they need to reshuffle in importance from time to time. A paradigm shift can bring known-but-minor ones back to the surface, and make new ones appear around them.

23

A brief list of ways AI safety efforts could be net negative

23

23

23