I am concerned that technical AI alignment research could be increasing the risks from AGI, as opposed to decreasing them. Here are some reasons why alongside some possible interventions: (analogies are purposefully crude to help illustrate my points, but I recognize that in reality, the situation is not necessarily as clear cut / black and white)
- Accelerating timelines: Promising results in AI alignment could lower researchers' assessments of danger, increase risk-taking and accelerate the development of AGI. Contrastingly, in a world where people are aware that misaligned AI could be dangerous to humanity, but there is not a lot of promising alignment research, researchers would be more cautious about developing and deploying AI. An analogy - if you believed self-driving cars were very dangerous and ideally should never be let on the road, you could plausibly be against attempts to make self-driving cars safer as this makes it more likely for them to be allowed on the road.
- Possible solution: focus more on making people aware of risks from AI as opposed to trying to mitigate those risks.
- Subtle misalignment: If we become better at aligning the goals of AI with our own goals, misalignment will become more subtle and much harder to spot. A slightly misaligned AI could be much more dangerous than a very misaligned AI. For example, it could take us much longer to notice that the AI system was misaligned and by the time we do, multiple negative cascades have been set in motion. For those concerned with S-risks, it seems like a slightly misaligned AI is more likely to lead to a long period of human suffering. An analogy - a clearly dangerous axe-wielding maniac is plausibly less dangerous than a calculating psychopath who appears very normal but secretly plans to kill as many people as possible.
- Possible solution: focus more on techniques for detecting misalignment as opposed to techniques for achieving alignment.
- Increased usefulness for malicious actors: There is a lot of talk of "alignment", but less about what we are actually aligning with. The ability to align the goals of AI with complex human-level goals could make AI-based systems much more effective weapons/tools for malicious actors than AI's that aren't able to encapsulate human values as well. An analogy - say we have a destructive weapon that, when used, wipes out a random area of the earth. This weapon isn't that useful because you cannot guarantee that a) it will attack the area you want it to and b) it will not attack your own area. However, given the ability to select the attack area very precisely, the weapon becomes much more useful and more likely to be used.
- Possible solution: spend more time considering threat models that involve malicious actors and ways to mitigate them, as opposed to just "accidental" risks from AI. This potentially means a greater focus on governance and international peace as opposed to technical alignment research.
Curious to get some convincing rebuttals of these concerns. I do hope that technical AI safety research is a net positive, however, at the moment am quite skeptical.