Lessons learned from talking to >100 academics about AI safety

Marius Hobbhahn

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I’d like to thank MH, Jaime Sevilla and Tamay Besiroglu for their feedback.

During my Master's and Ph.D. (still ongoing), I have spoken with many academics about AI safety. These conversations include chats with individual PhDs, poster presentations and talks about AI safety.

I think I have learned a lot from these conversations and expect many other people concerned about AI safety to find themselves in similar situations. Therefore, I want to detail some of my lessons and make my thoughts explicit so that others can scrutinize them.

TL;DR: People in academia seem more and more open to arguments about risks from advanced intelligence over time and I would genuinely recommend having lots of these chats. Furthermore, I underestimated how much work related to some aspects AI safety already exists in academia and that we sometimes reinvent the wheel. Messaging matters, e.g. technical discussions got more interest than alarmism and explaining the problem rather than trying to actively convince someone received better feedback.

Update: here is a link with a rough description of the pitch I used.

Executive summary

I have talked to somewhere between 100 and 200 academics (depending on your definitions) ranging from bachelor students to senior faculty. I use a broad definition of “conversations”, i.e. they include small chats, long conversations, invited talks, group meetings, etc.

Findings

Most of the people I talked to were more open about the concerns regarding AI safety than I expected, e.g. they acknowledged that it is a problem and asked further questions to clarify the problem or asked how they could contribute.
Often I learned something during these discussions. For example, the academic literature on interpretability and robustness is rich and I was pointed to resources I didn’t yet know. Even in cases where I didn’t learn new concepts, people scrutinized my reasoning such that my explanations got better and clearer over time.
The higher up the career ladder the person was, the more likely they were to quickly dismiss the problem (this might not be true in general, I only talked with a handful of professors).
Often people are much more concerned with intentional bad effects of AI, e.g. bad actors using AI tools for surveillance, than unintended side-effects from powerful AI. The intuition that “AI is just a tool and will just do what we want” seems very strong.
There is a lot of misunderstanding about AI safety. Some people think AI safety is the same as fairness, self-driving cars or medical AI. I think this is an unfortunate failure of the AI safety community but is quickly getting better.
Most people really dislike alarmist attitudes. If I motivated the problem with X-risk, I was less likely to be taken seriously.
Most people are interested in the technical aspects, e.g. when I motivated the problem with uncontrollability or interpretability (rather than X-risk), people were more likely to find the topic interesting. Making the core arguments for “how deep learning could go wrong” as detailed, e.g. by Ajeya Cotra or Richard Ngo usually worked well.
Many people were interested in how they could contribute. However, often they were more interested in reframing their specific topic to sound more like AI safety rather than making substantial changes to their research. I think this is understandable from a risk-reward perspective of the typical Ph.D. student.
People are aware of the fact that AI safety is not an established field in academia and that working on it comes with risks, e.g. that you might not be able to publish or be taken seriously by other academics.
In the end, even when people agree that AI safety is a really big problem and know that they could contribute, they rarely change their actions. My mental model changed from “convince people to work on AI safety” to “Explain why some people work on AI safety and why that matters; then present some pathways that are reachable for them and hope for the best”.
I have talked to people for multiple years now and I think it has gotten much easier to talk about AI safety over time. By now, capabilities have increased to a level that people can actually imagine the problems AI safety people have been talking about for a decade. It could also be that my pitch has gotten better or that I have gotten more senior and thus have more trust by default.

Takeaways

Don’t be alarmist and speak in the language of academics. Don’t start with X-risk or alignment, start with a technical problem statement such as “uncontrollability” or “interpretability” and work from there.
Be open to questions and don’t dismiss criticism even if it has obvious counterarguments. You are likely one of the first people to talk to them about AI safety and these are literally the first five minutes in their lives thinking about it.
Academic incentives matter to academics. People care about their ability to publish and their citation counts. They know that if they want to stay in academia, they have to publish. If you tell them to stop working on whatever they are working on right now and work on AI alignment, this is not a reasonable proposition from their current perspective. If you show them pathways toward AI safety, they are more likely to think about what options they could choose. Providing concrete options that relate to their current research was always the most helpful, e.g. when they work on RL suggest inverse RL or reward design and when they work on NN capabilities, suggest NN interpretability.
Existing papers, workshops, challenges, etc. that are validated within the academic community are super helpful. If you send a Ph.D. or post-doc a blog post they tend to not take it seriously. If you send them a paper they do (arxiv is sufficient, doesn’t have to be peer-reviewed). Some write-ups I found especially useful to send around include:
- Concrete problems in AI safety [arxiv]
- Unsolved Problems in ML safety [arxiv]
- AGI safety from first principles [AF]
Explain don’t convince: Let your argument do the work. If you explained it poorly, people shouldn't feel pressured. I think that most academics will agree that AI safety is a relevant problem if you did a half-decent job explaining it. However, there is a big difference between “understanding the problem” and “making a big career change”. Nevertheless, it is still important that other academics understand the problem even if they don’t end up working on it. They influence the next generation, they are your reviewers, they make decisions about funding, etc. The difference between whether they think AI safety is reasonable or whether it is alarmist, sci-fi, billionaire-driven BS might be bigger than I originally expected. Furthermore, if they take you seriously, it’s less likely that they will see the field as alarmist/crazy in their next encounter with AI safety even if you’re not around.

Things we/I did - long version

I have spoken to lots of Bachelor's students, Master's students, PhDs and some post-docs and professors in Tübingen.
Gave an intro talk about AI safety in my lab.
Gave a talk about AI safety for the ELLIS community (the European ML network).
Co-founded and stopped the AI safety reading group for Tübingen. We stopped because it was more efficient to get people into online reading groups, e.g. AGISF fundamentals.
Presented a poster called “AI safety introduction - questions welcome” at the ELLIS doctoral symposium (with roughly 150 European ML PhDs participating) together with Nikos Bosse.
Presented the same poster at the IMPRS-IS Bootcamp (with roughly 200 ML PhDs participating).

In total, depending on how you count, I had between 100 and 200 conversations about AI safety with academics, most of which were Ph.D. students.

This is the poster we used. We put it together in ~60 minutes, so don’t expect it to be perfect. We mostly wanted to start a conversation. If you want to have access to the overleaf and make adaptions, let me know. Feedback is appreciated.

Findings - long version

People are open to chat about AI safety

If you present a half-decent pitch for AI safety, people tend to be curious. Even if they find it unintuitive or sci-fi, in the beginning, they will usually give you a chance to change their mind and explain your argument. Academics tend to be some of the brightest minds in society and they are curious and willing to be persuaded when presented with plausible evidence.

Obviously, that won’t happen all the time, sometimes you’ll be dismissed right away or you’ll hear a bad response presented as a silver bullet answer. But in the vast majority of cases, they’ll give you a chance to make your case and ask clarifying questions afterward.

I learned something from the discussions

There are many academics working on problems related to AI safety, e.g. robustness, interpretability, control and much more. This literature is often not exactly what people in AI safety are looking for but they are close enough to be really helpful. In some cases, these resources were also exactly what I was looking for. So just on a content level, I got lots of ideas, tips and resources from other academics. Informal knowledge such as “does this method work in practice?” or “which code base should I use for this method?” is also something that you can quickly learn from practitioners.

Even in the cases where I didn’t learn anything on a content level, it was still good to get some feedback, pushback and scrutiny on my pitch on why AI safety matters. Sometimes I skipped steps in my reasoning and that was pointed out, sometimes I got questions that I didn’t know how to answer so I had to go back to the literature or think about them in more detail. I think this made both my own understanding of the problem and my explanation of it much better.

Intentional vs unintentional harms

Most of the people I talked to thought that intentional harm was a much bigger problem than unintended side effects. Most of the arguments around incentives for misalignment, robustness failures, inner alignment, goal misspecification, etc. were new and sound a bit out there. Things like country X will use AI to create a surveillance state seemed much more plausible to most. After some back and forth, people usually agreed that the unintended side effects are not as crazy as they originally seemed.

I think this does not mean people caring about AI alignment should not talk about unintended side effects for instrumental reasons. I think this mostly implies that you should expect people to have never heard of alignment before and simple but concrete arguments are most helpful.

It depends on the career stage

People who were early in their careers, e.g. Bachelor’s and Master’s students were often the most receptive to ideas about AI safety. However, they are also often far away from contributing to research so their goals might drastically change over the years. Also, they sometimes lack some understanding of ML, Deep Learning or AI more generally so it is harder to talk about the details.

PhDs are usually able to follow most of the arguments on a fairly technical level and are mostly interested, e.g. they want to understand more or how they could contribute. However, they have often already committed to a specific academic trajectory and thus don’t see a path to contribute to AI safety research without taking substantial risks.

Post-docs and professors were the most dismissive of AI safety in my experience (with high variance). I think there are multiple possible explanations for this including
a) most of their status depends on their current research field and thus they have a strong motivation to keep doing whatever they are doing now,
b) there is a clear power/experience imbalance between them and me and
c) they have worked with ML systems for many years and are generally more skeptical of everyone claiming highly capable AI. Their lived experience is just that hype cycles die and AI is usually much worse than promised.

However, this comes from a handful of conversations and I also talked to some professors who seemed genuinely intrigued by the ideas. So don’t take it this as strong evidence.

Misunderstandings and vague concepts

There are a lot of misunderstandings around AI safety and I think the AIS community has failed to properly explain the core ideas to academics until fairly recently. Therefore, I often encountered confusions like that AI safety is about fairness, self-driving cars and medical ML. And while these are components of a very wide definition of AI safety and are certainly important, they are not part of the alignment-focused narrower definition of AI safety.

Usually, it didn’t take long to clarify this confusion but it mostly shows that when people hear you talking about AI safety, they often assume you mean something very different from what you intended unless you are precise and concrete.

People dislike alarmism

If you motivate AI safety with X-risk people tend to think you’re pascal’s mugging them or that you do this for signaling reasons. I think this is understandable. If you haven’t thought about how AI could lead to X-risk, the default response is that this is probably implausible and there are also wildly varying estimates of X-risk plausibility within the AI safety community.

When people claim that civilization is going to go extinct because of nuclear power plants or because of ecosystem collapse from fertilizer overuse, I tend to be skeptical. This is mostly because I can’t think of a detailed mechanism of how either of those leads to actual extinction. If people are unaware of the possible mechanisms of advanced AI leading to extinction, they think you just want attention or don’t do serious research.

In general, I found it easier just not to talk about X-risk unless people actively asked me to. There are enough other failure modes you can use to motivate your research that they are already familiar with that range from unintended side-effects to intended abuse.

People are interested in the technical aspects

There are many very technical pitches for AI safety that never talk about agency, AGI, consciousness, X-risk and so on. For example, one could argue that

ML systems are not robust to out-of-distribution samples during deployment and this could lead to problems with very powerful systems.
ML systems are incentivized to be deceptive once they are powerful enough to understand that they are currently being trained.
ML systems are currently treated as black boxes and it is hard to open up the black box. This leads to problems with powerful systems.
ML systems could become uncontrollable. Combining a powerful black-box tool with a real-world task can have big unforeseen side effects.
ML models could be abused by people with bad intentions. Thus, AI governance will likely matter a lot in the near future.

Most of the time, a pitch like “think about how good GPT-3 is right now and how fast LLMs get better; think about where a similar system could be in 10 years; What could go wrong if we don’t understand this system or if it became uncontrollable?” is totally fine to get an “Oh shit, someone should work on this” reaction even if it is very simplified.

People want to know how they can contribute

Once you have conveyed the basic case for why AI safety matters, people tend to be naturally curious about how they can contribute. Most of the time, their current research is relatively far away from most AI safety research and people are aware of that.

I usually tried to show a path between their research and research that I consider core AI safety research. For example, when people work on RL, I suggested working on inverse RL or reward design or when people work on NNs, I suggested working on interpretability. In many instances, this path is a bit longer, e.g. when someone works on some narrow topic in robotics. However, most of the time you can just present many different options, see how they respond to them and then talk about those that they are most excited about.

In general, AI safety comes with lots of hard problems and there are many ways in which people can contribute if they want to.

One pitfall of this strategy is that people sometimes want to get credit for “working on safety” without actually working on safety and start to rationalize how their research is somehow related to safety (I was guilty of this as well at some point). Therefore, I think it is important to point this out (in a nice way!) whenever you spot this pattern. Usually, people don’t actively want to fool themselves but we sometimes do that anyway as a consequence of our incentives and desires.

People know that doing AI safety research is a risk to their academic career

If you want to get a Ph.D. you need to publish. If you want to get into a good post-doc position you need to publish even more. Optimally, you publish in high-status venues and collect lots of citations. Academics often don’t like this system but they know that this is “how it’s done”.

They are also aware that the AI safety community is fairly small in academia and is often seen as “not serious” or “too abstract”. Therefore, they are aware that working more on AI safety is a clear risk to their academic trajectory.

Pointing out that the academic AI safety community has gotten much bigger, e.g. through the efforts of Dan Hendrycks, Jacob Steinhardt, David Kruger, Sam Bowman and others, makes it a bit easier but the risk is still very present. Taking away this fear by showing avenues to combine AI safety with an academic career was often the thing that people cared most about.

Explain don’t convince

When I started talking to people about AI safety some years ago, I tried to convince them that AI safety matters a lot and that they should consider working on it. I obviously knew that this is an unrealistic goal but the goal was still to “convince them as much as possible”. I think this is a bad framing for two reasons. First, most of your discussions feel like a failure since people will rarely change their life substantially based on one conversation. Second, I was less willing to engage with questions or criticism because my framing assumed that my belief was correct rather than just my best working hypothesis.
I think switching this mental model to “explain why some people believe AI safety matters” is a much better approach because it solves the problems outlined before but also feels much more collaborative. I found this framing to be very helpful both in terms of getting people to care about the issue but also in how I felt about the conversation later on.

I think there is also a vibes-based explanation to this. When you’re confronted with a problem for the first time and the other person actively tries to convince you, it can feel like being bothered by Jehova’s Witnesses or someone trying to sell you a fake Gucci bag. When the other person explains their arguments to you, you have more agency and control over the situation and “are allowed to” generate your own takeaways. This might seem like a small difference but I think it matters much more than I originally anticipated.

It has gotten much easier

I think my discussions today are much more fruitful than, e.g. 3 years ago. There are multiple plausible explanations for this. a) I might have gotten better at giving the pitch, b) I’m now a Ph.D. student and thus my default trust might be higher, or c) I might just have lowered my standards.

However, I think there are other factors at work that contribute to the fact that I can have better discussions. First, I think the AI alignment community has actually gotten better at explaining the risk in a more detailed fashion and in ways that can be explained in the language of the academic community, e.g. with more rigor and less hand-waving. Secondly, there are now some people in academia who take these risks seriously who have academic standing and whose work you can refer to in discussions (see above for links). Thirdly, capabilities have gotten good enough that people can actually envision the danger.

Conclusion

I have had lots of chats with other academics about AI safety. I think academics are sometimes seen as “a lost cause” or “focusing on publishable results” by some people in the AI safety community and I can understand where this sentiment is coming from. However, most of my conversations were pretty positive and I know that some of them made a difference both for me and the person I was talking to. I know of people who got into AI safety because of conversations with me and I know of people who have changed their minds about AI safety because of these conversations. I also have gotten more clarity about my own thoughts and some new ideas due to these conversations.

Academia is and will likely stay the place where research is done for a lot of people in the foreseeable future and it is thus important that the AI safety community interacts with the academic world whenever it makes sense. Even if you personally don’t care about academia, the people who teach the next generation, who review your papers and who set many research agendas should have a basic understanding of why you think AI safety is a cause worth working on even if they will not change their own research direction. Academia is a huge pool of smart and open-minded people and it would be really foolish for the AI safety community to ignore that.

[-]Marius Hobbhahn4moΩ5116Review for 2022 Review

I haven't talked to that many academics about AI safety over the last year but I talked to more and more lawmakers, journalists, and members of civil society. In general, it feels like people are much more receptive to the arguments about AI safety. Turns out "we're building an entity that is smarter than us but we don't know how to control it" is quite intuitively scary. As you would expect, most people still don't update their actions but more people than anticipated start spreading the message or actually meaningfully update their actions (probably still less than 1 in 10 but better than nothing).

[-]Karl von Wendt2y94

Thank you very much for sharing this - it is very helpful to me! I agree that academics, in particular within the EU, but probably also everywhere else, are an underutilized and potentially very valuable resource, especially with respect to AI governance. Your post seems to support my own view that we should be talking about "uncontrollable AI" instead of "misaligned AGI/superintelligence", which I have explained here: https://www.lesswrong.com/posts/6JhjHJ2rdiXcSe7tp/let-s-talk-about-uncontrollable-ai

[-]trevor2y81

Was this funded? These sorts of findings are clearly more worthy of funding than 90% of the other stuff I've seen, and most the rest of the 10% is ambiguous.

Forget the "Alignment Textbook from 100 years in the future"; if we had this 6 years ago, things would have gone very differently.

[-]Marius Hobbhahn2y8-1

I don't think these conversations had as much impact as you suggest and I think most of the stuff funded by EA funders has decent EV, i.e. I have more trust in the funding process than you seem to have.

I think one nice side-effect of this is that I'm now widely known as "the AI safety guy" in parts of the European AIS community and some people have just randomly dropped me a message or started a conversation about it because they were curious.

I was working on different grants in the past but this particular work was not funded.

[-]trevor2y60

I was thinking about this post, not the conversations at all. If this post had existed 6 years ago, your conversations probably would have had much more impact than they did. It's a really good idea for people to know how to succeed instead of fail at explaining AI safety to someone for the first time, even if this was just academics.

This post should be distilled so even more people can see it, and possibly even distributed as zines or a guidebook, e.g. at some of the AI safety events such as in DC.

[-]Collin2y84

Thanks for writing this up! I basically agree with most of your findings/takeaways.

In general I think getting the academic community to be sympathetic to safety is quite a bit more tractable (and important) than most people here believe, and I think it's becoming much more tractable over time. Right now, perhaps the single biggest bottleneck for most academics is having long timelines. But most academics are also legitimately impressed by recent progress, which I think has made them much more open to considering AGI than they used to be at least, and I think this trend will likely accelerate over the next few years as we see much more impressive models.

[-]Nisan2y51

You poster talks about "catastrophic outcomes" from "more-powerful-than-human" AI. Does that not count as alarmism and x-risk? This isn't meant to be a gotcha, I just want to know what counts as too alarmist for you.

[-]Marius Hobbhahn2y62

Maybe I should have stated this differently in the post. Many conversations end up talking about X-risks at some point but usually only after it went through the other stuff. I think my main learning was just that starting with X-risk as the motivation did not seem very convincing.

Also, there is a big difference in how you talk about X-risk. You could say stuff like "there are plausible arguments why X-risk could lead to extinction but even experts are highly uncertain about this" or "We're all gonna die" and the more moderate version seems clearly more persuasive.

[-]Algon2y51

"... Post-docs and professors were the most dismissive of AI safety in my experience (with high variance)."

What subset of senior academics took your seriously? Do you have data on them, on your talks etc.? Figuring out characteristics of these people seems like it could have high returns.

[-]Marius Hobbhahn2y70

I think taking safety seriously was strongly correlated with whether or not they believed AI will be transformative in the near-term (as opposed to being just another hype cycle). But not sure. My sample is too small to make any general inferences.

[-]jacob_cannell2y41

Current LLMs are already hooked up to the internet, have access to millions of individual machines and are allowed to take actions on those machines

That seems pretty unlikely to me as worded - source?

[-]Marius Hobbhahn2y96

This seems to be what https://www.adept.ai/act is working on if I understood their website correctly. They probably don't have a million users yet though so I agree that it is not accurate at the moment.

Also, https://openai.com/blog/webgpt/ is an LLM with access to the internet, right?

But yeah, probably an overstatement.

[-]Gurkenglas2y40

People know that doing AI safety is a risk to their academic career

Would it help to point out that 2040 is unlikely to be a real year, so they should stop having 20-year plans?

[-]Marius Hobbhahn2y128

Not really. You can point out that this is your reasoning but whenever you talk about short timelines you can bet that most people will think you're crazy. To be fair, even most people in the alignment community are more confident that 2040 will exist than not, so this is even a controversial statement within AI safety.

[-]Gurkenglas2y30

Huh, I misremembered Cotra's update, and wrt that Metaculus question that got retitled due to the AI effect, I can see most people thinking it resolves long before history ends.

[-]harfe2y20

Great post! I agree that academia is a resource that could be very useful for AI safety.

There are a lot of misunderstandings around AI safety and I think the AIS community has failed to properly explain the core ideas to academics until fairly recently. Therefore, I often encountered confusions like that AI safety is about fairness, self-driving cars and medical ML.

I think these misunderstandings are understandable based on the term "AI safety". Maybe it would be better to call the field AGI safety or AGI alignment? This seems to me like a more honest description of the field.

You also write that you find it easier to not talk about xrisk. If we avoid talking about xrisk while presenting AI safety, then some misunderstandings about AI safety will likely persist in the future.

[-]Marius Hobbhahn2y84

Firstly, I don't think the term matters that much. Whether you use AGI safety, AI safety, ML safety, etc. doesn't seem to have as much of an effect compared to the actual arguments you make during the conversation (at least that was my impression).

Secondly, I don't say you should never talk about x-risk. I mostly say you shouldn't start with it. Many of my conversations ended up in discussions of X-risk but only after 30 minutes of back and forth.

LESSWRONG
LW