Plan 1 and Plan 2

[-]Raemon2mo229

A thing I've been thinking lately (this is reposted from a twitter thread where it was more squarely on-topic, but seems like a reasonable part of the convo here, riffing off the Tsvi thread)

It matters a fair amount which biases people have, here.

A few different biases pointing in the "Plan 2 for bad reasons" direction:

1. a desire for wealth
2. a desire to not look weird in front of your friends
3. a desire to "be important"
4. subtly different from #3, a desire to "have some measure of control over the big forces playing out."
5. a desire to be high status in the world's Big League status tier
6. action bias, i.e. inability to do nothing.
7. bias against abstract arguments that you can't clearly see/measure, or against sitting with confusion.
8. bias to think things are basically okay and you don't need to majorly change your life plans.
9. being annoyed at people who keep trying to stop you or make you feel bad or be lower status.
10. being annoyed at people who seem to be missing an important point when they argue with you about AI doom.

All of these seem in-play to me. But depending on these things' relative strength, they suggest different modes of dealing with the problem.

A reason I am optimistic about If Anyone Builds It, is because I think it has a decent chance of changing how reasonable it feels to say "yo guys I do think we might kill everyone" in front of both your friends, and high status big wigs.

This won't be sufficient to change decisionmaking at labs, or people's propensity to join labs. But I think the next biggest bias is more like "feeling important/in-control," than "having wealth."

I view this all pretty cynically. BUT, not necessarily pessimistically. If IABIED works, then, the main remaining blockers are "having an important/in-control thing to do, which groks some arguments that are more abstract."

You don't have to get rid of people's bias, or defeat them memetically. (Although those are both live options too). You can also steer towards a world where their bias becomes irrelevant.

So, while I do really wanna grab people by the collar and shout:

"Dudes, Dario is one of the most responsible parties for causing the race conditions that Anthropic uses to justify their actions, and he lied or was grossly negligent about whether Anthropic would push the capabilities frontier. If your 'well Plan 2 seems more tractable' attitude doesn't include 'also, our leader was the guy who gave the current paradigm to OpenAI, then left OpenAI, gained early resources via deception/communication-negligence and caused the current race to start in earnest' you have a missing mood and that's fucked."

...I also see part of my goal as trying to help the "real alignment work" technical field reach a point where the stuff-that-needs-doing is paradigmatic enough that you can just point at it, and the action-biased-philosophy-averse lab "safety" people can just say "oh, sure it sounds obvious when you put it like that, why didn't you say that before?"

[-]Towards_Keeperhood2mo23

Thanks for writing that list!

...I also see part of my goal as trying to help the "real alignment work" technical field reach a point where the stuff-that-needs-doing is paradigmatic enough that you can just point at it, and the action-biased-philosophy-averse lab "safety" people can just say "oh, sure it sounds obvious when you put it like that, why didn't you say that before?"

This seems extremely unrealistic to me. Not sure how you imagine that might work.

[-]Raemon2mo20

I am maybe starting from the assumption that sooner-or-later, alignment research would reach this point, and "well, help the alignment research progress as fast possible" seemed like a straightforward goal on the meta-level and is one of the obvious things to be shooting for whether or not it's currently tractable.

I have a current set of projects, but, the meta-level one is "look for ways to systematically improve people's ability to quickly navigate confusing technical problems, and see what works, and stack as many interventions as we can."

(I can drill into that but I'm not sure what level your skepticism was one)

[-]Towards_Keeperhood2mo*10

I have a current set of projects, but, the meta-level one is "look for ways to systematically improve people's ability to quickly navigate confusing technical problems, and see what works, and stack as many interventions as we can."

Yeah so I think I read all the posts about that you wrote in the last 2 years.

I think such techniques and the meta skill of deriving more of them are very important for making good alignment progress, but I still think it takes geniuses. (In fact I sorta tried my shot at the alignment problem over the last 3ish years where I often reviewed and looked at what I could do better, and started developing a more systematic sciency approach for studying human minds. I'm now pivoting to working on Plan 1 though, because there's not enough time left.)

Like currently it takes a some special kind of genius to make useful progress. Maybe we could try to train the smartest young supergeniuses in the techniques we currently have, and maybe they then could make progress much faster than me or Eliezer. Or maybe they still wouldn't be able to judge what is good progress.
If you actually get supergeniuses you could try to train that would obviously be quite useful to try, even though it likely won't get done on time, but if they don't end up running away with their dumb idea for solving alignment without understanding the systems they are dealing with, it would still be great for needing less time after the ban.

(Steven Byrnes' agenda seems to be potentially more scaleable with good methodology, and it has the advantage that progress is relatively less exfohazardry, but would still take very smart people (and relatively long study) to make progress. But it won't get done on time, so you still need international coordination to not build ASI.)

But the way you seemed to motivate your work in your previous comment sounded more like "make current safety researchers do more productive work so we might actually solve alignment without international coordination". Seems very difficult to me, I think they are not even tracking many problems that are actually rather obvious or not getting the difficulties that are easy to get. People somehow often have a very hard time understanding relevant concepts here. E.g. even special geniuses like John Wentworth and Steven Byrnes made bad attempts at attacking corrigibility where they misunderstood the problem (1, 2), although that's somewhat cherry picked and may be fixable. I mean not that such MIRI-like research is likely necessary, but still. Though I'm still curious about how you imagine your project might help here more precisely.

[-]Raemon2mo*30

My overall plan is:

try for as much international coordination as we can get.
- (this is actually my main focus right now, I expect to pivot back towards intellectual-progress stuff once I feel like we've done all the obvious things for waking up the world)
try to make progress as fast as we can on the technical bits
probably, we don't either as much coordination or technical progress as we want, but, like, try our best on both axes and hope it's enough

I'm not sure whether you need literal geniuses, but, I do think you need a lot of raw talent for it to plausibly work. I'm uncertain if, like, say you need be "a level 8 researcher" to make nonzero progress on the core bottlenecks, is it possible to upgrade a level 6 or 7 person to level 8, or are level 8s basically born not made and the only hope is in finding internentions that bump 8s into 9s)

I frequently see people who seem genius-like who nonetheless are kinda metacognitively dumb and seem like there's low hanging fruit there, so I feel like either world is tractable, but, they are pretty different.

[-]Vladimir_Nesov2mo110

If someone thinks ASI will likely go catastrophically poorly if we develop it in something like current race dynamics, they are more likely to work on Plan 1.

If someone thinks we are likely to make ASI go well if we just put in a little safety effort, or thinks it's at least easier than getting strong international slowdown, they are more likely to work on Plan 2.

Should depend on neglectedness more than credence. If you think ASI will likely go catastrophically poorly, but nobody is working on putting in a little safety effort in case it doesn't (with such effort), that's worth doing more of then. Credence determines the shape of good allocation of resources, but all major possibilities should be prepared for to some extent.

[-]TsviBT2mo110

But in the end, those plans are not opposed to each other.

I mean, maybe this isn't "opposed" but there is a direct tradeoff where if you're not trying to stop the race, you want to race ahead faster because you're more alignment-pilled. In real life the people saying this are Type 3 of course. Not sure who is supposed to be Type 2 sincerely.

I worry the "Which side are you on?" question will cause Type 2 people to cluster even more closely with Type 3 people instead of alignment pessimists,

Maybe. What are some alternatives? Or rather, how are you supposed to reach Type 3 people?

[-]Raemon2mo40

FYI I have talked to a few people who seem to be pretty sincerely in Type 2.

Also, among people who seem, like, "pretty alignment-pilled" but are working on Plan-2 shaped stuff, I think the relevant biases are less like "status" and more like "they just don't share some core intuitions re: Alignment Is Hard" and "wanting to feel traction / in-control."

[-]TsviBT2mo110

You're describing Type 3 people; "Wanting to feel traction / in-control" is absolutely insincere as "belief in Type 2 plan". I don't claim to understand the Type 3 mindset, I call for people to think about it. "They just don't share some core intuitions re: Alignment Is Hard" is not a respectable position. A real scientist would be open to debate.

[-]Raemon2mo73

I think "want to feel traction/in-control" is more obviously a bias (and people vary in whether they read to me as having this bias.).

I think the attitude of "don't share core intuitions isn't a respectable position" is, well, idk you have that attitude if you want but I don't think it'd going to help you understand or persuade people.

There is no clear line between Type 2 and Type 3 people, it can be true that people both have earnest intellectual positions you find frustrating but it's fundamentally an intellectual disagreement and also they can have biases that you-and-they would both agree would be bad, and the percent of causal impact of the intellectual-positions and biases can range from like 99% to 1% in either direction.

Even among people who do seem to have followed the entire Alignment-Is-Hard arguments and understand them all, a bunch just say "yeah I don't buy that as obvious" to stop that seems obvious to me (or in some cases, which seems obviously like '>50% likely' to me). And they seem sincere to me.

This results in sad conversations where you're like 'but, clearly, I am picking up that you've got biased cope in you!' and they're like 'but, clearly, I can tell that my thinking here is not just cope, I know what my copey-thinking feels like, and the particular argument we're talking about doesn't feel like that. (and both are correct – there was cope but it lay elsewhere).

[-]TsviBT2mo111

I think the attitude of "don't share core intuitions isn't a respectable position" is, well, idk you have that attitude if you want but I don't think it'd going to help you understand or persuade people.

Of course it won't help me persuade people. I absolutely think it will help me understand people, relative to your apparent position. Yes, you have to understand that they are not doing the "have a respectable position" thing. This is important.

There is no clear line between Type 2 and Type 3 people, it can be true that people both have earnest intellectual positions you find frustrating but it's fundamentally an intellectual disagreement and also they can have biases that you-and-they would both agree would be bad, and the percent of causal impact of the intellectual-positions and biases can range from like 99% to 1% in either direction.

(Almost always true and usually not worth saying.)

Maybe I should just check, are you consciously trying to deny a conflict-type stance, and consciously trying to (conflictually) assert the mistake-type stance, as a strategy?

[-]Raemon2mo20

Yes, you have to understand that they are not doing the "have a respectable position" thing.

I think this is false for the particular people I have in mind (who to be clear are filtered for "are willing to talk to me", but, they seem like relatively central members of a significant class of people).

Maybe I'm not sure what you mean by "have a respectable position."

(I think a large chunk of the problem is "the full argument is complicated, people aren't tracking all the pieces." Which is maybe not "intellectually respectable", though IMO understandable, and importantly different from 'biased.' But, when I sit someone down and make sure to lay out all the pieces and make sure they understand each piece and understand how the pieces fit together, we still hit a few pieces where they are like 'yeah I just don't buy that.')

Maybe I should just check, are you consciously trying to deny a conflict-type stance, and consciously trying to (conflictually) assert the mistake-type stance, as a strategy?

I'm saying you seem to be conflict-stanced in a way that is inaccurate to me (i.e. you are making a mistake in your conflict)

I think it's current to be conflict-stanced, but, you need like a good model who/what the enemy is ("sniper mindset"), and the words you're saying sound to me like you don't (in a way that seems more tribally biased than you usually seem to me)

[-]TsviBT2mo20

who to be clear are filtered for "are willing to talk to me"

In particular, they're filtered for not being cruxy for whether AGI capabilities research continues. (ETA: ... which is partly but very much not entirely due to anti-correlation with claiming to be "AI safety".)

Maybe I'm not sure what you mean by "have a respectable position."

I'm not sure either, but for example if a scientist publishes an experiment, and then another scientist with a known track record of understanding things publishes a critique, the first scientist can't respectably dismiss the critique unsubstantially.

you need like a good model who/what the enemy is

Ok good, yes, we agree on what the question/problem is.

(in a way that seems more tribally biased than you usually seem to me)

Not more tribally biased, if anything less. I'm more upset though, because why can't we discuss the conflict landscape? I mean why can't the people in these LW conversations say something like "yeah the lion's share of the relevant power is held by people who don't sincerely hold A or B / Type 1/2"?

[-]Raemon2mo*51

Maybe I'm not sure what you mean by "have a respectable position."
I'm not sure either, but for example if a scientist publishes an experiment, and then another scientist with a known track record of understanding things publishes a critique, the first scientist can't respectably dismiss the critique unsubstantially.

I think:

there isn't consensus on what counts as a good track record of understanding things, or a good critique
- (relatedly, there's disagreement about which epistemic norms are important)
A few points haven't really had a critique that interlocutors consider very substantive or clear, just sort of frustratedly rehashing the same arguments they found unpersuasive the first time.

And for at least some of those points, I'm personally like "my intuitions lean in the other direction as y'all Camp B people, but, I don't feel like I can really confidently stand by it, I don't think the argument has been made very clearly."

Things I have in mind.

On "How Hard is Success?"

"How anti-natural is corrigibility?" (i.e. "sure, I see some arguments for thinking corrigibility might get hard as you dial up capabilities. But, can't we just... not dial up capabilities past that point? It seems like humans understand corrigibility pretty easily when they try, it seems like Claude-et-al currently actually understand corrigibility reasonable well and if focused on training that I don't see why it wouldn't basically work?")
"How likely is FOOM?" (i.e. "if I believed FOOM was very likely, I'd agree we had to be a lot more careful about ramping up capabilities and being scared the next training run would be our last. But, I don't see reason to think FOOM is particularly likely, and I see reasons to think it's not.")
"What capabilities are needed to make a pivotally-scary demo or game-changing coordination tech?" (i.e. you maybe don't need to actually do anything that complicated to radically change how much coordination is possible for a proper controlled takeoff)

On "How Bad is Failure?"

"How nice is AI likely to be?" (i.e. it really only needs to be very slightly nice to give us the solar system, and it seems weird for the niceness to be "zero")
"How likely is whatever ends up being created to have moral value?". (i.e. consciousness is pretty confusing, seems pretty plausible that whatever ends up getting created would at least be a pretty interesting successor species)

For all of those, like, I know the arguments against, but I my own current take is not like >75% on any of these given model uncertainty, and meanwhile, if your probabilities are below 50% on the relevant MIRI-ish argument, you also have to worry about...

...

Other geopolitical concerns and considerations

The longer a pause went on, the more likely it is that things get unstable and something goes wrong
If you think alignment isn't that hard, or that sticking to a safe-but-high power level isn't that hard, you do have to take more seriously the risk of serious misuse.
You might think buy-in for a serious pause or controlled takeoff is basically impossible until we have seriously scary demos, and the "race to build them, then use them to rally world leaders and then burn the lead" plan might seem better than "try to pause now."
The sorts of things necessary for a pause seem way more likely to go badly than well (i.e. it's basically guaranteed to create a molochian bureaucratic hellscape that stifles wide ranging innovation and makes it harder to do anything sensible with AI development)

[-]TsviBT2mo31

Part of what I'm saying is that it's not respectable for someone to both

claim to have substantial reason to think that making non-lethal AGI is tractable, and also
not defend this position in public from strong technical critiques.

It sounds like you're talking about non-experts. Fine, of course a non-expert will be generally less confident about conclusions in the field. I'm saying that there is a camp which is treated as expert in terms of funding, social cachet, regulatory influence, etc., but which is not expert in my sense of having a respectable position.

[-]Raemon2mo40

I mean why can't the people in these LW conversations say something like "yeah the lion's share of the relevant power is held by people who don't sincerely hold A or B / Type 1/2"?

Here are some groups who I think are currently relevant (not in any particular order, without quite knowing where I'm going with this yet)

nVidia (I hadn't realized how huge they were till recently)
Sam Altman in particular
Demis Hassabis in particular
Dario Amodei in particular
OpenAI, Google, Google DeepMind, Anthropic, and maybe by now other leading labs (in slightly less particular than Sam/Demis/Dario)
The cluster of AI industry leaders of whom Sam/Demis/Dario are representative of.
People at labs who are basically AI researchers (who might have ever said the words "I do alignment research" because that's those are the mouthwords the culture at their company said, but, weren't meaningfully involved with safety efforts)
Anthropic safety engineers and similar
Eliezer in particular
cluster including MIRI / Lightcone / relatively highly bought in friends
Oliver Habryka in particular
OpenPhil
Dustin Moskowitz in particular
Jaan Tallin
"Constellation"
"AI Safety" egregore
"The EAgregore"
Future of Life Institute
Maybe Max Tegmark in particular, I'm not sure
Trump
MAGA
Elon Musk

Okay writing that out turned out to be most of the time I felt like spending right now, but, the next questions I have in mind are "who has power, here, over what?", or "what is the 'relevant'" power?

But, roughly:

a) a ton of the above are "fake" in some sense

b) On the worldscale, the OpenPhil/Constellation/Anthropic cluster is relatively weak.

c) within OpenPhil/Constellation/Anthropic, there are people more like Dario, Holden, Jack Clark, and Dustin, and people who are more rank-and-file-EA/AI-ish. I think the latter are fake the way I think you think things are fake. I think the former are differently fake from the way I think you think things are fake.

d) there are ton of vague EA/AI-safety people that I think are fake in the way you think they are fake but they don't really matter except for The Median Researcher Problem

[-]TsviBT2mo20

Nice, thank you! In a much more gray area of [has some non-trivial identity as "AI safety person"], I assume there'd be lots of people, some with relevant power. This would include some heads of research at big companies. Maybe you meant that by "People at labs who are basically AI researchers", but I mean, they would be less coded as "AI safety" but still would e.g.

Pay lip service to safety;
Maybe even bring it up sometimes internally;
Happily manufacture what boils down to lies for funders regarding technical safety;
Internally think of themselves as doing something good and safe for humanity;
Internally think of themselves as being reasonable and responsible regarding AI.

Further, this would include many AI researchers in academia. People around Silicon Valley / LW / etc. tend to discount academia, but I don't.

[-]Towards_Keeperhood2mo10

I mean, maybe this isn't "opposed" but there is a direct tradeoff where if you're not trying to stop the race, you want to race ahead faster because you're more alignment-pilled.

You could still try to solve alignment and support (e.g. financially) others who are trying to stop the race.

Or rather, how are you supposed to reach Type 3 people?

I assume you mean Type 2 people. Like, they could sign the superintelligence statement.

[-]TsviBT2mo20

No I mean Type 3 people.

[-]Towards_Keeperhood2mo10

I'm not trying to reach Type 3 people. I'm trying to not alienate Type 2 people from supporting Plan 1.

[-]TsviBT2mo20

[-]Wei Dai2mo50

But in the end, those plans are not opposed to each other.

I think they are somewhat opposed, due to signaling effects: If you're working on Plan 2 only, then that signals to the general public or non-experts that you think the risks are manageable/acceptable. If a lot of people are working on Plan 2, that gives ammunition to the people who want to race or don't want to pause/stop to say "Look at all these AI safety experts working on solving AI safety. If the risks are really as high as the Plan 1 people say, wouldn't they be calling for a pause/stop too instead of working on technical problems?"

^{^}

Yes, Plan 1 is underspecified here. It could be banning AI research but with an AI safety project exit plan, or without one. If you think some versions would be good and some worse than nothing, then simply taking a rough support/oppose position here is not for you.

^{^}

To be clear, if sth is AI policy that does not imply that it counts as Plan 1. There's a lot of AI policy effort towards policies that essentially don't help at all with xrisk. I'm not sure whether there is so little effective funding here because Openphil doesn't want Plan 1, or because they just felt intuitively uncomfortable with going past the overton window, or sth else.

^{^}

I sympathize with multiple reasons to not want to sign. I think in the end it usually makes more sense to sign anyway, but whatever.

^{^}

model syncing = Aiming to more fully understand each other's person world models and identifying cruxes. Not aimed to persuade.

^{^}

Info about me in case you're interested: I have relatively good knowledge of e.g. Paul Christiano's arguments from when I deeply studied that 3 years ago, but still not really a good understanding of how that adds up to so low probabilities of doom given that we're in a race where there won't be much slack for safety. I'm most interested in better understanding that part, but I'm also open to try to share my models for people who are interested. I'm totally willing to study more of the most important readings for better understanding the optimists side. I mostly did non-prosaic AI alignment research for the last 3.5 years, although I'm now pivoting to working towards Plan 1.

^{^}

although that one already seems like a hard sell to me and I'd be fine with discussing that too.

LESSWRONG
LW

LESSWRONG
LW

50

50

50