Preparing for "The Talk" with AI projects

by Daniel Kokotajlo2 min read13th Jun 202016 comments

62

Ω 32

AIWorld Optimization
Frontpage
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Epistemic status: Written for Blog Post Day III. I don't get to talk to people "in the know" much, so maybe this post is obsolete in some way.

I think that at some point at least one AI project will face an important choice between deploying and/or enlarging a powerful AI system, or holding back and doing more AI safety research.

(Currently, AI projects face choices like this all the time, except they aren't important in the sense I mean it, because the AI isn't potentially capable of escaping and taking over large parts of the world, or doing something similarly bad.)

Moreover, I think that when this choice is made, most people in the relevant conversation will be insufficiently concerned/knowledgeable about AI risk. Perhaps they will think: "This new AI design is different from the classic models, so the classic worries don't arise." Or: "Fear not, I did [insert amateur safety strategy]."

I think it would be very valuable for these conversations to end with "OK, we'll throttle back our deployment strategy for a bit so we can study the risks more carefully," rather than with "Nah, we're probably fine, let's push ahead." This buys us time. Say it buys us a month. A month of extra time right after scary-powerful AI is created is worth a lot, because we'll have more serious smart people paying attention, and we'll have more evidence about what AI is like. I'd guess that a month of extra time in a situation like this would increase the total amount of quality-weighted AI safety and AI policy work by 10%. That's huge.


One way to prepare for these conversations is to raise awareness about AI risk and technical AI safety problems, so that it's more likely that more people in these conversations are more informed about the risks. I think this is great.

However, there's another way to prepare, which I think is tractable and currently neglected:

1. Identify some people who might be part of these conversations, and who already are sufficiently concerned/knowledgeable about AI risk.

2. Help them prepare for these conversations by giving them resources, training, and practice, as needed:

2a. Resources:

Perhaps it would be good to have an Official List of all the AI safety strategies, so that whatever rationale people give for why this AI is safe can be compared to the list. (See this prototype list.)

Perhaps it would be good to have an Official List of all the AI safety problems, so that whatever rationale people give for why this AI is safe can be compared to the list, e.g. "OK, so how does it solve outer alignment? What about mesa-optimizers? What about the malignity of the universal prior? I see here that your design involves X; according to the Official List, that puts it at risk of developing problems Y and Z..." (See this prototype list.)

Perhaps it would be good to have various important concepts and arguments re-written with an audience of skeptical and impatient AI researchers in mind, rather than the current audience of friends and LessWrong readers.

2b. Training & practice:

Maybe the person is shy, or bad at public speaking, or bad at keeping cool and avoiding fluster in high-stakes discussions. If so, some coaching and practice could go a long way. Maybe they have the opposite problems, frequently coming across as overconfident, arrogant, aggressive, or paranoid. If so someone should tell them this and help them tone it down.

In general it might be good to do some role-play exercises or something, to prepare for these conversations. As an academic, I've seen plenty of mock-dissertation-defense sessions and mock-job-talk-question-sessions, which seem to help. And maybe there are ways to get even more realistic practice, e.g. by trying to convince your skeptical friends that their favorite AI design might kill them if it worked.

Note that most of part 2 can be done without having done part 1. This is important in case we don't know anyone who might be part of one of these conversations, which is true for many and perhaps most of us.


Why do I think this is tractable? Well, seems like the sort of thing that people producing AI safety research can do on the margin, just by thinking more about their audience and maybe recording their work (or other people's work) on some Official List. Moreover people who don't do (or even read) AI safety research can contribute to this, e.g. by reading the literature on how to practice for situations like this, and writing up the results.

Why do I think this is neglected? Well, maybe it isn't. In fact I'd bet that some people are already thinking along these lines. It's a pretty obvious idea. But just in case it is neglected, I figured I'd write this. Moreover, the Official Lists I mentioned don't exist, and I think they would if people were taking this idea seriously. Finally--and this more than anything else is what caused me to write this post--I've heard one or two people explicitly call this out as something that they don't think is an important use case for the alignment research they were doing. I disagreed with them, and here we are. If this is a bad idea, I'd love to know why.

62

Ω 32

16 comments, sorted by Highlighting new comments since Today at 9:55 AM
New Comment

Hey Daniel, don't have time for a proper reply right now but am interested in talking about this at some point soon. I'm currently in UK Civil Service and will be trying to speak to people in their Office for AI at some point soon to get a feel for what's going on there, perhaps plant some seeds of concern. I think some similar things apply.

Sure, I'd be happy to talk. Note that I am nowhere near the best person to talk to about this; there are plenty of people who actually work at an AI project, who actually talk to AI scientists regularly, etc.

Planned summary for the Alignment Newsletter:

At some point in the future, it seems plausible that there will be a conversation in which people decide whether or not to deploy a potentially risky AI system. So one class of interventions to consider is interventions that make such conversations go well. This includes raising awareness about specific problems and risks, but could also include identifying people who are likely to be involved in such conversations _and_ concerned about AI risk, and helping them prepare for such conversations through training, resources, and practice. This latter intervention hasn't been done yet: some simple examples of potential interventions would be generating official lists of AI safety problems and solutions which can be pointed to in such conversations, or doing "practice runs" of these conversations.

Planned opinion:

I certainly agree that we should be thinking about how we can convince key decision makers of the level of risk of the systems they are building (whatever that level of risk is). I think that on the current margin it's much more likely that this is best done through better estimation and explanation of risks with AI systems, but it seems likely that the interventions laid out here will become more important in the future.

Sounds good. Thanks! My current opinion is basically not that different from yours.

Moreover, I think that when this choice is made, most people in the relevant conversation will be insufficiently concerned/knowledgeable about AI risk. Perhaps they will think: "This new AI design is different from the classic models, so the classic worries don't arise." Or: "Fear not, I did [insert amateur safety strategy]."

This seems a bit like writing the bottom line first?

Like, AI fears in our community have come about because of particular arguments.  If those arguments don't apply, I don't see why one should strongly assume that AI is to be feared, outside of having written the bottom line first.

It also seems kind of condescending to operate under the assumption that you know more about the AI system someone is creating than the person who's creating it knows?  You refer to their safety strategy as "amateur", but isn't there a chance that having created this system entitles them to a "professional" designation?  A priori, I would expect that an outsider not knowing anything about the project at hand would be much more likely to qualify for the "amateur" designation.

I think it would be very valuable for these conversations to end with "OK, we'll throttle back our deployment strategy for a bit so we can study the risks more carefully," rather than with "Nah, we're probably fine, let's push ahead." This buys us time. Say it buys us a month. A month of extra time right after scary-powerful AI is created is worth a lot, because we'll have more serious smart people paying attention, and we'll have more evidence about what AI is like. I'd guess that a month of extra time in a situation like this would increase the total amount of quality-weighted AI safety and AI policy work by 10%. That's huge.

This isn't obvious to me.  One possibility is that there will be some system which is safe if used carefully, and having a decent technological lead gives you plenty of room to use it carefully, but if you delay your development too much, competing teams will catch up and you'll no longer have space to use it carefully.  I think you have to learn more about the situation to know for sure whether a month of delay is a good thing.

Maybe they have the opposite problems, frequently coming across as overconfident

People seem predisposed to form echo chambers of the likeminded.  I don't think the rationalist or AI safety communities are exempt from this.  (Even if the AI safety community has a lot of people with a high level of individual rationality--not obvious, see above note about writing the bottom line first--I don't think having a high level of individual rationality is super helpful for the echo chamber formation problem, since it's more of a sociological phenomenon.)  So coming across as overconfident in one's knowledge may be a bigger risk.

Thanks for the thoughtful pushback! It was in anticipation of comments like this that I put hedging language in like "it think" and "perhaps." My replies:


This seems a bit like writing the bottom line first?
Like, AI fears in our community have come about because of particular arguments.  If those arguments don't apply, I don't see why one should strongly assume that AI is to be feared, outside of having written the bottom line first.

1. Past experience has shown that even when particular AI risk arguments don't apply, often an AI design is still risky, we just haven't thought of the reasons why yet. So we should make a pessimistic meta-induction and conclude that even if our standard arguments for risk don't apply, the system might still be risky--we should think more about it.

2. I intended those two "perhaps..." statements to be things the person says, not necessarily things that are true. So yeah, maybe they *say* the standard arguments don't apply. But maybe they are wrong. People are great at rationalizing, coming up with reasons to get to the conclusion they wanted. If the conclusion they want is "We finally did it and made a super powerful impressive AI, come on come on let's take it for a spin!" then it'll be easy to fool yourself into thinking your architecture is sufficiently different as to not be problematic, even when your architecture is just a special case of the architecture in the standard arguments.

Points 1 and 2 are each individually sufficient to vindicate my claims, I think.

It also seems kind of condescending to operate under the assumption that you know more about the AI system someone is creating than the person who's creating it knows?  You refer to their safety strategy as "amateur", but isn't there a chance that having created this system entitles them to a "professional" designation?  A priori, I would expect that an outsider not knowing anything about the project at hand would be much more likely to qualify for the "amateur" designation.

3. I'm not operating under the assumption that I know more about the AI system someone is creating than the person who's creating it knows. The fact that you said this dismays me, because it is such an obvious staw man. It makes me wonder if I touched a nerve somehow, or had the wrong tone or something, to raise your hackles.

4. Yes, I refer to their safety strategy as amateur. Yes, this is appropriate. AI safety is related to AI capabilities, but the two are distinct sub-fields, and someone who is great at one could be not so great at another. Someone who doesn't know the AI safety literature, who does something to make their AI safe, probably deserves the title amateur. I don't claim to be a non-amateur AI scientist, and whether I'm a non-amateur AI safety person is irrelevant because I'm not going to be one of the people in The Talk. I do claim that e.g. someone like Paul Christiano or Stuart Russell is a professional AI safety person, whereas most AI scientists are not.

This isn't obvious to me.  One possibility is that there will be some system which is safe if used carefully, and having a decent technological lead gives you plenty of room to use it carefully, but if you delay your development too much, competing teams will catch up and you'll no longer have space to use it carefully.  I think you have to learn more about the situation to know for sure whether a month of delay is a good thing.

5. I agree that this is a possibility. This is why I said "say it buys us a month;" I meant that to be an average of the various possibilities. In retrospect I was unclear; I should have clarified that It might not be a good idea to delay at all, for the reasons you mention. I agree we have to learn more about the situation; in retrospect I shouldn't have said "I think it would be better for these conversations to end X way" (even though that is what I think is most likely) but rather found some way to express the more nuanced position.

6. I agree with everything you say about overconfidence, echo chambers, etc. except that I don't think I was writing the bottom line first in this case. I was making a claim without arguing for it, but then I argued for it in the comments when you questioned it. It's perfectly reasonable (indeed necessary) to have some unargued for claims in any particular finite piece of writing.

1. Past experience has shown that even when particular AI risk arguments don't apply, often an AI design is still risky, we just haven't thought of the reasons why yet. So we should make a pessimistic meta-induction and conclude that even if our standard arguments for risk don't apply, the system might still be risky--we should think more about it.

I've heard this sentiment before, but I'm not aware of a standard reference supporting this claim (let me know if there's something I'm not remembering), and I haven't been totally satisfied when I probe people on it in the past.

I agree we should think a lot because so much is at stake, but sometimes the fact that so much is at stake means that it's better to act quickly.

People are great at rationalizing, coming up with reasons to get to the conclusion they wanted. If the conclusion they want is "We finally did it and made a super powerful impressive AI, come on come on let's take it for a spin!" then it'll be easy to fool yourself into thinking your architecture is sufficiently different as to not be problematic, even when your architecture is just a special case of the architecture in the standard arguments.

Agreed, I just don't want people to fall into the trap of rationalizing the opposite conclusion either.

I'm not operating under the assumption that I know more about the AI system someone is creating than the person who's creating it knows. The fact that you said this dismays me, because it is such an obvious staw man. It makes me wonder if I touched a nerve somehow, or had the wrong tone or something, to raise your hackles.

It did.  Part of me thought it was better not to comment, but then I figured the entire point of the post was how to do outreach to people we don't agree with, so I decided it was better to express my frustration.

5. I agree that this is a possibility. This is why I said "say it buys us a month;" I meant that to be an average of the various possibilities. In retrospect I was unclear; I should have clarified that It might not be a good idea to delay at all, for the reasons you mention. I agree we have to learn more about the situation; in retrospect I shouldn't have said "I think it would be better for these conversations to end X way" (even though that is what I think is most likely) but rather found some way to express the more nuanced position.

Thanks for clarifying.

It did. Part of me thought it was better not to comment, but then I figured the entire point of the post was how to do outreach to people we don't agree with, so I decided it was better to express my frustration.

Well said. I'm glad you spoke up. Yeah, I don't want people to rationalize their way into thinking AI should never be developed or released either. Currently I think people are much more likely to make the opposite error, but I agree both errors are worth watching out for.

I don't know of a standard reference for that claim either. Here is what I'd say in defense of it:

--AIXItl was a serious proposal for an "ideal" intelligent agent. I heard the people who came up with it took convincing, but eventually agreed that yes, AIXItl would seize control of its reward function and kill all humans.

--People proposed Oracle AI, thinking that it would be safe. Now AFAICT people mostly agree that there are various dangers associated with Oracle AI as well.

--People sometimes said that AI risk arguments were founded on these ideal models of AI as utility maximizers or something, and that they wouldn't apply to modern ML systems. Well, now we have arguments for why modern ML systems are potentially dangerous too. (Whether these are the same arguments rephrased, or new arguments, is not relevant for this point.)

--In my personal experience at least, I keep discovering entirely new ways that AI designs could fail, which I hadn't thought of before. For example, paul's "The Universal Prior is Malign." Or oracles outputting self-fulfilling prophecies. Or some false philosophical view on consciousness or something being baked into the AI. This makes me think maybe there are more which I haven't yet thought of.

Though the world this points at is pretty scary (a powerful AI system ready to go, only held back by the implementors buying safety concerns), the intervention does seem cheap and good.

I wonder whether 1 will be easy. I think it relies on the first AI systems being made by one of a small selection of easily-identifiable orgs

Though the world this points at is pretty scary (a powerful AI system ready to go, only held back by the implementors buying safety concerns), the intervention does seem cheap and good.

By scary, do you mean (or mean to imply) unlikely?

I think that if AI happens soon (<10 years) it'll likely happen at an org we already know about, so 1 is feasible. If AI doesn't happen soon, all bets are off and 1 will be very difficult.

By scary, do you mean (or mean to imply) unlikely?

No. Sorry, I suspect starting with "Though" was confusing. I think I meant 'this seems like one of the harder worlds to get a win in, but given that world, this seems like a good intervention'.

I think I have an intuition where (a) we may only win if we stop things getting as bad as this situation and (b) extra expected utility is mostly cheaply purchased by plans that condition on worlds that are not this bad.

I dunno whether that's true though. I haven't thought about it a bunch.

Interesting. I'd love to hear more about the sorts of worlds conditioned on in your (b). For my part, the worlds I described in the original post seem both the most likely and also not completely hopeless--maybe with a month of extra effort we can actually come up with a solution, or else a convincing argument that we need another month, etc. Or maybe we already have a mostly-working solution by the time The Talk happens and with another month we can iron out the bugs.

I just wanted to say that this is a good question, but I'm not sure I know the answer yet.

Worlds that appear most often in my musings (but I'm not sure they're likely enough to count) are:

  • an aligned group getting a decisive strategic advantage
  • safety concerns being clearly demonstrated and part of mainstream AI research
    • Perhaps general reasoning about agents and intelligence improves, and we can apply these techniques to AI designs
    • Perhaps things contiguous with alignment concerns cause failures in capable AI systems early on
  • A more alignable paradigm overtaking ML
    • This seems like a fantasy
    • Could be because ML gets bottlenecked or a different approach makes rapid progress

Thanks, that was an illuminating answer. I feel like those three worlds are decently likely, but that if those worlds occur purchasing additional expected utility in them will be hard, precisely because things will be so much easier. For example, if safety concerns are part of mainstream AI research, then safety research won't be neglected anymore.

You can purchase additional EU by pumping up their probability as well EDIT: I know I originally said to condition on these worlds, but I guess that's not what I actually do. Instead, I think I condition on not-doomed worlds

Ah, that sounds much better to me. Yeah, maybe the cheapest EU lies in trying to make these worlds more likely. I doubt we have much control over which paradigms overtake ML, and I think that the intervention I'm proposing might help make the first and second kinds of world more likely (because maybe with a month of extra time to analyze their system, the relevant people will become convinced that the problem is real)