Defusing AGI Danger

I think your conclusion section is really important, because it prevents a possible misinterpretation of your post.

One can imagine a spectrum with “disaster by default” on one side and “alignment by default” on the other. To the extent that one is closer to “disaster by default”, trying to defuse specific arguments for AGI danger seems like it's missing the forest for the trees, analogous to trying to improve computer security by not allowing users to use “password” as their password. To the extent that one is closer to “alignment by default”, trying to defuse specific arguments seems quite useful, closer to conducting a fault analysis on a hypothetical airplane crash.

Since I'm much closer to the "disaster by default" end of the spectrum, I think most of our effort should focus on the safety stories approach rather than the defusing dangers approach. And I think you haven't presented any arguments for safety by default; you've just explained what we should do if we believe in safety by default. So it would be a misinterpretation of your post to think that it argues for the defusing disaster strategy to take priority over the safety stories strategy.

Instead (and this is how I interpret your post) both strategies should be pursued no matter where on the spectrum you are, but to different extents. E.g. if you are in the middle, you split effort 50-50 between strategies, and if you are towards the alignment by default edge, you split effort 80-20, etc. This seems quite plausible to me.

[-]Mark Xu5yΩ250

I absolutely agree that I'm not arguing for "safety by default".

I don't quite agree that you should split effort between strategies, i.e. it seems likely that if you think 80% disaster by default, you should dedicate 100% of your efforts to that world.

[-]Daniel Kokotajlo5yΩ360

OK, interesting. Well, here's my argument for effort-splitting then: There are probably diminishing returns to pursuing each strategy. In research in general, ideas and questions tend to cross-pollinate, etc. And if you are 20% confident that research project X is the most important, and 80% that research project Y is most important, and they are both on a similar topic, this seems like a classic case where you should do both (but with more effort towards Y).

This is more of an intuition than an argument, I guess. But what do you think?

[-]Mark Xu5yΩ240

My opposite intuition is suggested by the fact that if you're trying to guess correctly a series of random digits with 80% "1" and 20% "0", then you should always guess "1".

I don't quite know how to model cross-pollination and diminishing sort of returns. I think working on both for the information value is likely going to be very good. It seems hard to imagine a scenario where you're robustly confident that one project is 80% better taking diminishing returns into account without being able to create a 3rd project with the best features of both, but if you're in that scenario I think just spending all your efforts on the 80% project seems correct.

One example is deciding between 2 fundamentally different products your startup could be making. We also supposed that creating an MVP of either product that would provide information would take a really long time. In this situation, if you suspect one of them is 60% likely to be better than the other it would be less useful to spend your time in a 60/40 split rather than building the MVP of the one likely to be better and reevaluating after getting more information.

The version of your claim that I agree with is "In your current epistemic state, you should spend all your time pursuing the 80% project, but the 80% probably isn't that robust, working on a project has diminishing returns, and other projects will give more information value, globally the amount of time you expect to spend on the 80% project is about 80%."

[-]Daniel Kokotajlo5yΩ350

Here's a way to model diminishing returns: The first hour of research on strategy X produces as much value as the next two hours, which produces as much value as the next four hours, etc. Value = log_2(hours). If this is true, then you should split your hours such that log_2(hourstowards80project)*0.8 + log_2(hourstoward20project)*0.2 is maximized, which I think means that you should distribute your hours across projects proportional to their probability... https://www.wolframalpha.com/input/?i=argmax%28log_2%28X%29*0.8+%2B+log_2%281-X%29*0.2%29 (I don't know much math so I'm not confident I'm doing this right)

Value of information I hadn't even considered, but maybe we can bundle it up with diminishing returns and say it's part of the reason returns diminish.

[-]Daniel Kokotajlo5yΩ120

Huh. It seems like there is some general theorem here that might be worth writing up. If we combine the heavy-tailed hypothesis with this theorem, maybe we get some sort of nontrivial and useful general heuristic: The optimal allocation of time/money/etc. is proportional to the probability that a project is the most valuable thing you can be doing. That is, take the options you are considering, and evaluate the probability that each option is the best of the bunch. Then, distribute your resources according to that probability. This will be optimal or approximately optimal so long as (1) returns to resources diminish logarithmically for each project at about the same rate, and (2) the best project is likely to be several times better than the next-best and so on (heavy-tailed distribution of project goodness). I think 2 is usually true for altrustic projects, and insofar as 1 is false, maybe it doesn't matter because we are ignorant of which project diminishes faster, or maybe we do know which project diminishes faster and we can adjust accordingly (it should just be another multiplier to the ratio when dividing up resources, I think). I expect someone has said all this before somewhere...

[-]Steven Byrnes5yΩ240

That's fair, I should properly write out the brain-like AGI danger scenario(s) that have been in my head, one of these days. :-)

[-]TurnTrout5yΩ240

I like this strategy a lot.

Also, there's a lonely sentence missing a completion:

One could think of this work as trying to create the pieces that will enable strong arguments for safety instead of trying to

[-]Mark Xu5yΩ120

Thanks! Also, oops - fixed.

For the interested, this is a good example of backchaining applied to AI safety. ↩︎
Technically, we want to expand the parts of the argument such that we think additional labor can most shift if from being “true” to “false”. Just expanding things that might be false seems like a good proxy. ↩︎
See The Rocket Alignment Problem for an example of such an argument. ↩︎
Rohin Shah puts about 30% on “the first thing we try just works and we don’t even need to solve any sort of alignment problem” in AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah. ↩︎

LESSWRONG
LW

LESSWRONG
LW

48

Defusing AGI Danger

48

Ω 26

48

Ω 26

tl;dr

Introduction

Applied to AGI Safety

(2) AGIs will be autonomous agents...

(3) AGI goals will be misaligned with what we want...

Applied to Agendas

Pitfalls

Vague danger scenarios

False narratives

Conclusion