Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

These are some "alternative" (in the sense of non-mainstream) research projects or questions related to AI safety that seem both relevant and underexplored. If instead you think they aren't, let me know in the comments, and feel free to use the ideas as you want if you find them interesting.

A potentially catastrophic scenario that appears somewhat frequently in AI safety discourse involves a smarter-than-human AI which gets unrestricted access to the internet, and then bad things happen. For example, the AI manages to persuade or bribe one or more humans so that they perform actions which have a high impact on the world.

What are the worst (i.e. with worst consequences) examples of similar scenarios that already happened in the past? Can we learn anything useful from them?

Considering these scenarios, why is it the case that nothing worse has happened yet? Is it simply because human programmers with bad intentions are not smart enough? Or because the programs/AIs themselves are not agentic enough? I would like to read well-thought arguments on the topic.

Can we learn something from the history of digital viruses? What's the role played by cybersecurity? If we assume that slowing down progress in AI capabilities is not a viable option, can we make the above scenario less likely to happen by changing or improving cybersecurity?

Intuitively, it seems to me that the relation of AI safety with cybersecurity is similar to the relation with interpretability: even though the main objective of the other fields is not the reduction of global catastrophic risk, some of the ideas in those fields are likely to be relevant for AI safety as well.

Cognitive and moral enhancement in bioethics

A few days ago I came across a bioethics paper that immediately made me think of the relation between AI safety and AI capabilities. From the abstract:

"Cognitive enhancement [...] could thus accelerate the advance of science, or its application, and so increase the risk of the development or misuse of weapons of mass destruction. We argue that this is a reason which speaks against the desirability of cognitive enhancement, and the consequent speedier growth of knowledge, if it is not accompanied by an extensive moral enhancement of humankind."

As far as I understand, some researchers in the field are pro cognitive enhancement—sometimes even instrumentally as a way to achieve moral enhancement itself. Others, like the authors above, are much more conservative: they see research into cognitive enhancement as potentially very dangerous, unless accompanied by research into moral enhancement.

Are we going to solve all our alignment problems by reading the literature on cognitive and moral enhancement in bioethics? Probably not. Would it be useful if at least some individuals in AI safety knew more than the surface-level info given here? Personally, I would like that.

Aiming at “acceptably safe” rather than “never catastrophic”

Let's say you own a self-driving car and you are deciding whether to drive or give control to the car. If all you care about is safety of you and others, what matters for your decision is the expected damage of you driving the car versus the expected damage of self-driving.

This is also what we care about on a societal level. It would be great if self-driving cars were perfectly safe, but what is most important is that they are acceptably safe, in the sense that they are safer than the human counterpart they are supposed to replace.

Now, the analogy with AI safety is not straightforward because we don't know to what extent future AIs will replace humans, and also because it will be a matter of “coexistence" (more AI systems added to our daily lives and society) rather than just replacement.

Nonetheless, we can model each human as having an expected catastrophic damage—think of a researcher who works on pathogens—and consider whether a certain AI system supposed to carry out similar tasks would be more or less dangerous.

Still, decisions won't be as easy as in the above situation about self-driving cars. An example: suppose we are evaluating the release on the market of a new, smarter-than-human, personal assistant AI. According to our models and estimates, it is expected to cause one order of magnitude less damage than a human personal assistant. However, most people who will buy the new AI do not have a personal assistant yet; thus, allowing the release of the new AI would be comparable to artificially increasing agentic population on the planet. How is global catastrophic risk going to change in this case? 

Producing agents that are acceptably safe is probably difficult, but it might be easier than producing agents that are guaranteed to never do anything really bad. And if solving alignment completely is very very difficult, aiming at acceptably safe might be a reasonable alternative.


This work was supported by CEEALAR. Thanks to Lucas Teixeira for feedback.

New Comment