I read through Eliezer’s “AGI Ruin: A List of Lethalities” post, and wrote down my reactions* as I went.

I tried to track my internal responses, *before* adjusting for my perception of the overall thoughts of the field. In order to get this written and posted, I aimed to write the responses roughly how I write tweets, with minimal editing and aiming for short responses. I’m also not an alignment researcher, so I expect this to mostly be useful to people as encouragement to try similar things themselves, rather than to directly add anything to the conversation. If you’re interested in reactions from people who actually know what they’re talking about, there’s this post from Paul Christiano detailing areas of agreement and disagreement, and a point-by-point list of responses from some DeepMind alignment researchers here.

After drafting my responses I used the DeepMind post as a template to format them as it seemed like a reasonable way of laying stuff out. Please note that I’m responding to Eliezer’s phrasing throughout though, not to the Deepmind summary (which I don’t always exactly endorse, but still seems more useful to include than not). Here’s a blank template I made if you want to try writing down your own responses.

*I’d seen an early draft of the post a couple of months earlier and written responses, so this isn’t quite a ‘first look’ set of reactions, but because that draft had the points in a different order, I decided to start from scratch rather than try to edit all of those into the correct places.

Section A (shorthand: "strategic challenges")

#1. Human level is nothing special / data efficiency

Summary: AGI will not be upper-bounded by human ability or human learning speed (similarly to AlphaGo). Things much smarter than human would be able to learn from less evidence than humans require.

I basically agree, though with respect to: “Things much smarter than human would be able to learn from less evidence than humans require”, it’s not at all clear to me that huge improvements in sample efficiency arrive before the kind of ‘smarter’ that I care about in terms of risk. This isn’t reassuring, but I think might be an actual disagreement.

#2. Unaligned superintelligence could easily take over

Summary: A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure.

I agree and think that ‘sufficiently high’ is a lower target than I think many others do (I think as smart as me is more than sufficient). I’m less sure than Eliezer that the first attempt we see is by a sufficiently capable system, though still above 50%.

#3. Can't iterate on dangerous domains

Summary: At some point there will be a 'first critical try' at operating at a 'dangerous' level of intelligence, and on this 'first critical try', we need to get alignment right.

I agree on the implied definition of critical, and again directionally agree (as in I’m above 50%) with the rest but at lower confidence. Will use AFLC (above 50, lower confidence) to abbreviate this from now.

#4. Can't cooperate to avoid AGI

Summary: The world can't just decide not to build AGI.

Yep, absent of a catastrophe that’s sufficiently bad to terrify everyone but is still somehow recoverable (which I’m confident rules out all catastrophes in practice).

#5. Narrow AI is insufficient

Summary: We can't just build a very weak system.

Yep, in particular I think e.g. microscope AI is covered by ‘very weak’ here.

#6. Pivotal act is necessary

Summary: We need to align the performance of some large task, a 'pivotal act' that prevents other people from building an unaligned AGI that destroys the world.

I’m much less confident than Eliezer that all pivotal acts have this ‘discrete event, miles outside the overton window, very unilateralist’ vibe, but…

#7. There are no weak pivotal acts because a pivotal act requires power

Summary: It takes a lot of power to do something to the current world that prevents any other AGI from coming into existence; nothing which can do that is passively safe in virtue of its weakness.

I agree that the set of things you can do with a system that is weak enough to be safe by default, but will make the world safe, is ~empty.

#8. Capabilities generalize out of desired scope

Summary: The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we'd rather the AI not solve.

Yep.

#9. A pivotal act is a dangerous regime

Summary: The builders of a safe system would need to operate their system in a regime where it has the capability to kill everybody or make itself even more dangerous, but has been successfully designed to not do that.

This is just a restatement of 7.

Section B.1: The distributional leap

#10. Large distributional shift to dangerous domains

Summary: On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions.

Yes, working out how to get systems to generalise safely seems important and difficult, and I’m glad people are working on it (and would like there to be more of them).

#11. Sim to real is hard

Summary: There's no known case where you can entrain a safe level of ability on a safe environment where you can cheaply do millions of runs, and deploy that capability to save the world.

AFLC. My uncertainty here comes from how helpful narrow systems will be for alignment research specifically. It is trivially the case that sub-AGI systems accelerate alignment research *at all*, I predict that they will not accelerate it enough, but not with 100% confidence.

#12. High intelligence is a large shift

Summary: Operating at a highly intelligent level is a drastic shift in distribution from operating at a less intelligent level.

Agree with the literal statement but weakly disagree with the (I think) implied discontinuity.

#13. Some problems only occur above an intelligence threshold

Summary: Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability.

Agree with the literal statement but solving alignment doesn’t look like ‘anticipate all the problems in advance and patch them individually’. I guess maybe this is just responding to that, which possibly isn’t a strawman because some alignment proposals are pretty dumb.

#14. Some problems only occur in dangerous domains

Summary: Some problems seem like their natural order of appearance could be that they first appear only in fully dangerous domains.

I have some thoughts on ways of making this not the case which I intend to write up but will be careful about sharing as there are some infohazards. AFLC without the things I have in mind happening.

#15. Capability gains from intelligence are correlated

Summary: Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously.

How fast/discontinuous takeoff will be is a huge crux between different people in the field and if I’m going to write about it needs a whole post, but I disagree with my model of Eliezer. All sensible views seem fast by normal person standards though.

Section B.2: Central difficulties of outer and inner alignment.

#16. Inner misalignment

Summary: Outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.

#17. Can't control inner properties

Summary: On the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over.

Yep, though ‘current’ is doing some of the work. I think we could get here with sufficiently interpretability progress, and it’s important that we do.

#18. No ground truth

Summary: There's no reliable Cartesian-sensory ground truth (reliable loss-function-calculator) about whether an output is 'aligned'.

Yes, though I suspect I disagree with a couple of downstream things implied by the surrounding rant.

#19. Pointers problem

Summary: There is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment.

I think this is the most interesting point so far, have been thinking a lot about it recently, and am pretty uncertain about what happens by default, though agree that we don’t currently have ways of affecting the default.

#20. Flawed human feedback

Summary: Human raters make systematic errors - regular, compactly describable, predictable errors.

Clearly true for naive learning from human feedback, I don’t think literally all attempts to fix the issues here are futile, though I’m not optimistic about many of them at present.

#21. Capabilities go further

Summary: Capabilities generalize further than alignment once capabilities start to generalize far.

Yep.

#22. No simple alignment core

Summary: There is a simple core of general intelligence but there is no analogous simple core of alignment.

Agree that a relatively simple structure is sufficient for advanced consequentialist reasoning. AFLC that there’s no easily generalising ‘core’ of alignment, fully agree we don’t find it by default.

#23. Corrigibility is anti-natural.

Summary: Corrigibility is anti-natural to consequentialist reasoning.

Yep as a statement about consequentialists, AFLC that we are inevitably going to get a pure consequentialist.

#24. Sovereign vs corrigibility

Summary: There are two fundamentally different approaches you can potentially take to alignment [a sovereign optimizing CEV or a corrigible agent], which are unsolvable for two different sets of reasons. Therefore by ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult.

I think pointing out this pattern when it appears is really important and useful, but don’t think I’ve seen it appear that much. I basically fully agree with objection 1, but am much less confident about 2, partly due to the reservation I mention in 23.

Section B.3: Central difficulties of sufficiently good and useful transparency / interpretability.

#25. Real interpretability is out of reach

Summary: We've got no idea what's actually going on inside the giant inscrutable matrices and tensors of floating-point numbers.

Yep, good job people are working on improving this situation.

#26. Interpretability is insufficient

Summary: Knowing that a medium-strength system of inscrutable matrices is planning to kill us, does not thereby let us build a high-strength system that isn't planning to kill us.

While the central claim that interpretability is insufficient is clearly true, I think this is the worst comment made so far, by some margin:
- Dying with more dignity is good, actually.
- I don’t think anyone sensible has an interpetability plan of ‘build thing, look at it with cool intepretability tech, notice it’s misaligned, wait patiently for Facebook to build something else’. Like come on, this isn’t even strong enough to be a strawman. The strawman is ‘build it, look, re-roll training if it looks bad’. (This also doesn’t work, and also isn’t the plan at least among smart people.)
- Again, I don’t think the plan is ‘build it then look LOL’, but if we did have proof that a medium-strength system was planning to kill us if deployed, I no longer think it’s ridiculous to get co-ordination on stopping capabilities research (which I do think is ridiculous in the current world).

(This comment doesn’t make much sense without looking at the original phrasing)

#27. Selecting for undetectability

Summary: Optimizing against an interpreted thought optimizes against interpretability.

Important and I basically agree (but not with the implication that there will never be a solution to this problem which avoids the completely inevitable emergence of a fully deceptive agent).

#28. Large option space

Summary: A powerful AI searches parts of the option space we don't, and we can't foresee all its options.

Agree but I don’t think the good plans are ‘forsee all options’, see my comment on 13.

#29. Real world is an opaque domain

Summary: AGI outputs go through a huge opaque domain before they have their real consequences, so we cannot evaluate consequences based on outputs.

Similar comment to 28

#30. Powerful vs understandable

Summary: No humanly checkable output is powerful enough to save the world.

Agree modulo previous comments about pivotal acts and whether anyone has this kind of plan (which might mean this rounds to disagree).

#31. Hidden deception

Summary: You can't rely on behavioral inspection to determine facts about an AI which that AI might want to deceive you about.

Cf. 28,29

#32. Language is insufficient or unsafe

Summary: Imitating human text can only be powerful enough if it spawns an inner non-imitative intelligence.

I don’t really see what this is responding to? Do some people think that training on human words will make a system think in a human-style? I do put some weight on the idea that systems trained on tonnes of human-generated data will at some point compress it by learning to simulate humans, this is clearly not an alignment strategy.

#33. Alien concepts

Summary: The AI does not think like you do, it is utterly alien on a staggering scale.

This feels like an extremely important, extremely open question for me. If something like John’s natural abstractions hypothesis is true then there’s an important sense in which this is at least partially false. I also think that even if abstractions aren’t quite ‘natural’ in the sense that something learning from observations of an entirely alien planet would arrive at the same ones, this doesn’t guarantee that there’s no advantage to having levels of compression which are roughly the same as humans if you’re observing a world that has mostly been shaped by humans and still contains a bunch of them.

Section B.4: Miscellaneous unworkable schemes.

#34. Multipolar collusion

Summary: Humans cannot participate in coordination schemes between superintelligences.

It’s not super clear to me what this take is responding to, but I don’t think I agree as-written? I think the sense of ‘participate’ in which a toddler ‘participates’ in a conversation between them and 20 kind adults is meaningful - the adults care about the preferences of the toddler and listen to them, including accomodating them when they don’t cause large inconveniences, so conditional on having solved single-AGI alignment I don’t think we lose by default. If the idea EY’s responding to is literally just ‘maybe multipolar take-off will be fine because humanity will be one voice in a many-voiced conversation’ then yeah that’s laughably stupid.

#35. Multi-agent is single-agent

Summary: Any system of sufficiently intelligent agents can probably behave as a single agent, even if you imagine you're playing them against each other.

That this is possible has got to be one of the strangest things (at least among normal people) that I actually believe. I have no idea how likely this is to happen by default, and don’t have a good sense at all of whether “sufficiently intelligent” comes before the kind of capability which is genuinely useful for helping us make progress on other parts of the problem.

#36. Human flaws make containment difficult

Summary: Only relatively weak AGIs can be contained; the human operators are not secure systems.

I don’t think you need arguments this complicated to explain why sandboxing won’t work. People will just let the fucking thing out of the box for no good reason because people are idiots. To be clear I agree with the argument, I just don’t think it’s needed.

Section C (shorthand: "civilizational inadequacy")

I don’t think it’s going to be productive to write out my point-by-point responses to this section, most of the top comments on the original post are discussing parts of it, most of the things I think are somewhere in there.

Additional thoughts after writing:

Below are some half-formed thoughts which didn’t make sense as responses to specific points, and were already on my longlist to write up into full posts but may as well get the same very brief treatment as my reactions to individual points, especially as I was reminded of them as I read.

There’s a difference between interventions which target a training process and those which target a finished model. Most of the objections here, especially in the intepretability section, felt like responses to plans that involved intervening on a finished model. Intervening on a training process is something we’re miles away from being good enough at, but beating SGD feels different to beating a superintelligent adversary.
I frequently end up in situations where my assessment of some situation/evidence/set of arguments means that I have more than 50% credence in some conclusion, but very much less than 100. When I try to engage with people making these arguments, I typically hear a bunch of reasons I should be above 50, none of which make me update further. My model of Eliezer responds to this complaint with something like ‘if you were smarter you’d realise those arguments were enough’. Responding to a snarky model with an equal amount of snark seems inappropriate, so I’ll stop here.
I would have liked to see another argument of the form:

by becoming confused and ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult.

Several stories of success go through a more complicated version of “align something kind-of superhuman, then use that to help us with the next part of the alignment problem”. My objection to many such stories is roughly the quoted sentence above, and I would like to see more high quality discussion of them. Hoping that an alignment strategy which scales past human but fails at some point afterwards is useful enough to align something which helps overcome the way the first strategy fails feels fragile at best.

I think points like 5 and 7 respond to strawman versions of the above (where the weaker system is just safe by default), but would much rather see responses to the stronger case. I find it extremely implausible that alignment strategies which fail at strategic awareness align things which allow us to solve problems associated with strategic awareness, but it’s also obviously the case that you can build tools which help humans do better intellectual work, and all current alignment researchers are (to my knowledge) human.

LESSWRONG
is fundraising!
LW