# Ω 128

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

### Preamble:

(If you're already familiar with all basics and don't want any preamble, skip ahead to Section B for technical difficulties of alignment proper.)

I have several times failed to write up a well-organized list of reasons why AGI will kill you.  People come in with different ideas about why AGI would be survivable, and want to hear different obviously key points addressed first.  Some fraction of those people are loudly upset with me if the obviously most important points aren't addressed immediately, and I address different points first instead.

Having failed to solve this problem in any good way, I now give up and solve it poorly with a poorly organized list of individual rants.  I'm not particularly happy with this list; the alternative was publishing nothing, and publishing this seems marginally more dignified.

Three points about the general subject matter of discussion here, numbered so as not to conflict with the list of lethalities:

-3.  I'm assuming you are already familiar with some basics, and already know what 'orthogonality' and 'instrumental convergence' are and why they're true.  People occasionally claim to me that I need to stop fighting old wars here, because, those people claim to me, those wars have already been won within the important-according-to-them parts of the current audience.  I suppose it's at least true that none of the current major EA funders seem to be visibly in denial about orthogonality or instrumental convergence as such; so, fine.  If you don't know what 'orthogonality' or 'instrumental convergence' are, or don't see for yourself why they're true, you need a different introduction than this one.

-2.  When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of 'provable' alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about, nor attaining an absolute certainty of an AI not killing everyone.  When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, "please don't disassemble literally everyone with probability roughly 1" is an overly large ask that we are not on course to get.  So far as I'm concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I'll take it.  Even smaller chances of killing even fewer people would be a nice luxury, but if you can get as incredibly far as "less than roughly certain to kill everybody", then you can probably get down to under a 5% chance with only slightly more effort.  Practically all of the difficulty is in getting to "less than certainty of killing literally everyone".  Trolley problems are not an interesting subproblem in all of this; if there are any survivors, you solved alignment.  At this point, I no longer care how it works, I don't care how you got there, I am cause-agnostic about whatever methodology you used, all I am looking at is prospective results, all I want is that we have justifiable cause to believe of a pivotally useful AGI 'this will not kill literally everyone'.  Anybody telling you I'm asking for stricter 'alignment' than this has failed at reading comprehension.  The big ask from AGI alignment, the basic challenge I am saying is too difficult, is to obtain by any strategy whatsoever a significant chance of there being any survivors.

-1.  None of this is about anything being impossible in principle.  The metaphor I usually use is that if a textbook from one hundred years in the future fell into our hands, containing all of the simple ideas that actually work robustly in practice, we could probably build an aligned superintelligence in six months.  For people schooled in machine learning, I use as my metaphor the difference between ReLU activations and sigmoid activations.  Sigmoid activations are complicated and fragile, and do a terrible job of transmitting gradients through many layers; ReLUs are incredibly simple (for the unfamiliar, the activation function is literally max(x, 0)) and work much better.  Most neural networks for the first decades of the field used sigmoids; the idea of ReLUs wasn't discovered, validated, and popularized until decades later.  What's lethal is that we do not have the Textbook From The Future telling us all the simple solutions that actually in real life just work and are robust; we're going to be doing everything with metaphorical sigmoids on the first critical try.  No difficulty discussed here about AGI alignment is claimed by me to be impossible - to merely human science and engineering, let alone in principle - if we had 100 years to solve it using unlimited retries, the way that science usually has an unbounded time budget and unlimited retries.  This list of lethalities is about things we are not on course to solve in practice in time on the first critical try; none of it is meant to make a much stronger claim about things that are impossible in principle.

That said:

Here, from my perspective, are some different true things that could be said, to contradict various false things that various different people seem to believe, about why AGI would be survivable on anything remotely remotely resembling the current pathway, or any other pathway we can easily jump to.

### Section A:

This is a very lethal problem, it has to be solved one way or another, it has to be solved at a minimum strength and difficulty level instead of various easier modes that some dream about, we do not have any visible option of 'everyone' retreating to only solve safe weak problems instead, and failing on the first really dangerous try is fatal.

1.  Alpha Zero blew past all accumulated human knowledge about Go after a day or so of self-play, with no reliance on human playbooks or sample games.  Anyone relying on "well, it'll get up to human capability at Go, but then have a hard time getting past that because it won't be able to learn from humans any more" would have relied on vacuum.  AGI will not be upper-bounded by human ability or human learning speed.  Things much smarter than human would be able to learn from less evidence than humans require to have ideas driven into their brains; there are theoretical upper bounds here, but those upper bounds seem very high. (Eg, each bit of information that couldn't already be fully predicted can eliminate at most half the probability mass of all hypotheses under consideration.)  It is not naturally (by default, barring intervention) the case that everything takes place on a timescale that makes it easy for us to react.

2.  A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure.  The concrete example I usually use here is nanotech, because there's been pretty detailed analysis of what definitely look like physically attainable lower bounds on what should be possible with nanotech, and those lower bounds are sufficient to carry the point.  My lower-bound model of "how a sufficiently powerful intelligence would kill everyone, if it didn't want to not do that" is that it gets access to the Internet, emails some DNA sequences to any of the many many online firms that will take a DNA sequence in the email and ship you back proteins, and bribes/persuades some human who has no idea they're dealing with an AGI to mix proteins in a beaker, which then form a first-stage nanofactory which can build the actual nanomachinery.  (Back when I was first deploying this visualization, the wise-sounding critics said "Ah, but how do you know even a superintelligence could solve the protein folding problem, if it didn't already have planet-sized supercomputers?" but one hears less of this after the advent of AlphaFold 2, for some odd reason.)  The nanomachinery builds diamondoid bacteria, that replicate with solar power and atmospheric CHON, maybe aggregate into some miniature rockets or jets so they can ride the jetstream to spread across the Earth's atmosphere, get into human bloodstreams and hide, strike on a timer.  Losing a conflict with a high-powered cognitive system looks at least as deadly as "everybody on the face of the Earth suddenly falls over dead within the same second".  (I am using awkward constructions like 'high cognitive power' because standard English terms like 'smart' or 'intelligent' appear to me to function largely as status synonyms.  'Superintelligence' sounds to most people like 'something above the top of the status hierarchy that went to double college', and they don't understand why that would be all that dangerous?  Earthlings have no word and indeed no standard native concept that means 'actually useful cognitive power'.  A large amount of failure to panic sufficiently, seems to me to stem from a lack of appreciation for the incredible potential lethality of this thing that Earthlings as a culture have not named.)

3.  We need to get alignment right on the 'first critical try' at operating at a 'dangerous' level of intelligence, where unaligned operation at a dangerous level of intelligence kills everybody on Earth and then we don't get to try again.  This includes, for example: (a) something smart enough to build a nanosystem which has been explicitly authorized to build a nanosystem; or (b) something smart enough to build a nanosystem and also smart enough to gain unauthorized access to the Internet and pay a human to put together the ingredients for a nanosystem; or (c) something smart enough to get unauthorized access to the Internet and build something smarter than itself on the number of machines it can hack; or (d) something smart enough to treat humans as manipulable machinery and which has any authorized or unauthorized two-way causal channel with humans; or (e) something smart enough to improve itself enough to do (b) or (d); etcetera.  We can gather all sorts of information beforehand from less powerful systems that will not kill us if we screw up operating them; but once we are running more powerful systems, we can no longer update on sufficiently catastrophic errors.  This is where practically all of the real lethality comes from, that we have to get things right on the first sufficiently-critical try.  If we had unlimited retries - if every time an AGI destroyed all the galaxies we got to go back in time four years and try again - we would in a hundred years figure out which bright ideas actually worked.  Human beings can figure out pretty difficult things over time, when they get lots of tries; when a failed guess kills literally everyone, that is harder.  That we have to get a bunch of key stuff right on the first try is where most of the lethality really and ultimately comes from; likewise the fact that no authority is here to tell us a list of what exactly is 'key' and will kill us if we get it wrong.  (One remarks that most people are so absolutely and flatly unprepared by their 'scientific' educations to challenge pre-paradigmatic puzzles with no scholarly authoritative supervision, that they do not even realize how much harder that is, or how incredibly lethal it is to demand getting that right on the first critical try.)

4.  We can't just "decide not to build AGI" because GPUs are everywhere, and knowledge of algorithms is constantly being improved and published; 2 years after the leading actor has the capability to destroy the world, 5 other actors will have the capability to destroy the world.  The given lethal challenge is to solve within a time limit, driven by the dynamic in which, over time, increasingly weak actors with a smaller and smaller fraction of total computing power, become able to build AGI and destroy the world.  Powerful actors all refraining in unison from doing the suicidal thing just delays this time limit - it does not lift it, unless computer hardware and computer software progress are both brought to complete severe halts across the whole Earth.  The current state of this cooperation to have every big actor refrain from doing the stupid thing, is that at present some large actors with a lot of researchers and computing power are led by people who vocally disdain all talk of AGI safety (eg Facebook AI Research).  Note that needing to solve AGI alignment only within a time limit, but with unlimited safe retries for rapid experimentation on the full-powered system; or only on the first critical try, but with an unlimited time bound; would both be terrifically humanity-threatening challenges by historical standards individually.

5.  We can't just build a very weak system, which is less dangerous because it is so weak, and declare victory; because later there will be more actors that have the capability to build a stronger system and one of them will do so.  I've also in the past called this the 'safe-but-useless' tradeoff, or 'safe-vs-useful'.  People keep on going "why don't we only use AIs to do X, that seems safe" and the answer is almost always either "doing X in fact takes very powerful cognition that is not passively safe" or, even more commonly, "because restricting yourself to doing X will not prevent Facebook AI Research from destroying the world six months later".  If all you need is an object that doesn't do dangerous things, you could try a sponge; a sponge is very passively safe.  Building a sponge, however, does not prevent Facebook AI Research from destroying the world six months later when they catch up to the leading actor.

6.  We need to align the performance of some large task, a 'pivotal act' that prevents other people from building an unaligned AGI that destroys the world.  While the number of actors with AGI is few or one, they must execute some "pivotal act", strong enough to flip the gameboard, using an AGI powerful enough to do that.  It's not enough to be able to align a weak system - we need to align a system that can do some single very large thing.  The example I usually give is "burn all GPUs".  This is not what I think you'd actually want to do with a powerful AGI - the nanomachines would need to operate in an incredibly complicated open environment to hunt down all the GPUs, and that would be needlessly difficult to align.  However, all known pivotal acts are currently outside the Overton Window, and I expect them to stay there.  So I picked an example where if anybody says "how dare you propose burning all GPUs?" I can say "Oh, well, I don't actually advocate doing that; it's just a mild overestimate for the rough power level of what you'd have to do, and the rough level of machine cognition required to do that, in order to prevent somebody else from destroying the world in six months or three years."  (If it wasn't a mild overestimate, then 'burn all GPUs' would actually be the minimal pivotal task and hence correct answer, and I wouldn't be able to give that denial.)  Many clever-sounding proposals for alignment fall apart as soon as you ask "How could you use this to align a system that you could use to shut down all the GPUs in the world?" because it's then clear that the system can't do something that powerful, or, if it can do that, the system wouldn't be easy to align.  A GPU-burner is also a system powerful enough to, and purportedly authorized to, build nanotechnology, so it requires operating in a dangerous domain at a dangerous level of intelligence and capability; and this goes along with any non-fantasy attempt to name a way an AGI could change the world such that a half-dozen other would-be AGI-builders won't destroy the world 6 months later.

7.  The reason why nobody in this community has successfully named a 'pivotal weak act' where you do something weak enough with an AGI to be passively safe, but powerful enough to prevent any other AGI from destroying the world a year later - and yet also we can't just go do that right now and need to wait on AI - is that nothing like that exists.  There's no reason why it should exist.  There is not some elaborate clever reason why it exists but nobody can see it.  It takes a lot of power to do something to the current world that prevents any other AGI from coming into existence; nothing which can do that is passively safe in virtue of its weakness.  If you can't solve the problem right now (which you can't, because you're opposed to other actors who don't want to be solved and those actors are on roughly the same level as you) then you are resorting to some cognitive system that can do things you could not figure out how to do yourself, that you were not close to figuring out because you are not close to being able to, for example, burn all GPUs.  Burning all GPUs would actually stop Facebook AI Research from destroying the world six months later; weaksauce Overton-abiding stuff about 'improving public epistemology by setting GPT-4 loose on Twitter to provide scientifically literate arguments about everything' will be cool but will not actually prevent Facebook AI Research from destroying the world six months later, or some eager open-source collaborative from destroying the world a year later if you manage to stop FAIR specifically.  There are no pivotal weak acts.

8.  The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we'd rather the AI not solve; you can't build a system that only has the capability to drive red cars and not blue cars, because all red-car-driving algorithms generalize to the capability to drive blue cars.

9.  The builders of a safe system, by hypothesis on such a thing being possible, would need to operate their system in a regime where it has the capability to kill everybody or make itself even more dangerous, but has been successfully designed to not do that.  Running AGIs doing something pivotal are not passively safe, they're the equivalent of nuclear cores that require actively maintained design properties to not go supercritical and melt down.

### Section B:

Okay, but as we all know, modern machine learning is like a genie where you just give it a wish, right?  Expressed as some mysterious thing called a 'loss function', but which is basically just equivalent to an English wish phrasing, right?  And then if you pour in enough computing power you get your wish, right?  So why not train a giant stack of transformer layers on a dataset of agents doing nice things and not bad things, throw in the word 'corrigibility' somewhere, crank up that computing power, and get out an aligned AGI?

Section B.1:  The distributional leap.

10.  You can't train alignment by running lethally dangerous cognitions, observing whether the outputs kill or deceive or corrupt the operators, assigning a loss, and doing supervised learning.  On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions.  (Some generalization of this seems like it would have to be true even outside that paradigm; you wouldn't be working on a live unaligned superintelligence to align it.)  This alone is a point that is sufficient to kill a lot of naive proposals from people who never did or could concretely sketch out any specific scenario of what training they'd do, in order to align what output - which is why, of course, they never concretely sketch anything like that.  Powerful AGIs doing dangerous things that will kill you if misaligned, must have an alignment property that generalized far out-of-distribution from safer building/training operations that didn't kill you.  This is where a huge amount of lethality comes from on anything remotely resembling the present paradigm.  Unaligned operation at a dangerous level of intelligence*capability will kill you; so, if you're starting with an unaligned system and labeling outputs in order to get it to learn alignment, the training regime or building regime must be operating at some lower level of intelligence*capability that is passively safe, where its currently-unaligned operation does not pose any threat.  (Note that anything substantially smarter than you poses a threat given any realistic level of capability.  Eg, "being able to produce outputs that humans look at" is probably sufficient for a generally much-smarter-than-human AGI to navigate its way out of the causal systems that are humans, especially in the real world where somebody trained the system on terabytes of Internet text, rather than somehow keeping it ignorant of the latent causes of its source code and training environments.)

11.  If cognitive machinery doesn't generalize far out of the distribution where you did tons of training, it can't solve problems on the order of 'build nanotechnology' where it would be too expensive to run a million training runs of failing to build nanotechnology.  There is no pivotal act this weak; there's no known case where you can entrain a safe level of ability on a safe environment where you can cheaply do millions of runs, and deploy that capability to save the world and prevent the next AGI project up from destroying the world two years later.  Pivotal weak acts like this aren't known, and not for want of people looking for them.  So, again, you end up needing alignment to generalize way out of the training distribution - not just because the training environment needs to be safe, but because the training environment probably also needs to be cheaper than evaluating some real-world domain in which the AGI needs to do some huge act.  You don't get 1000 failed tries at burning all GPUs - because people will notice, even leaving out the consequences of capabilities success and alignment failure.

12.  Operating at a highly intelligent level is a drastic shift in distribution from operating at a less intelligent level, opening up new external options, and probably opening up even more new internal choices and modes.  Problems that materialize at high intelligence and danger levels may fail to show up at safe lower levels of intelligence, or may recur after being suppressed by a first patch.

13.  Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability.  Consider the internal behavior 'change your outer behavior to deliberately look more aligned and deceive the programmers, operators, and possibly any loss functions optimizing over you'.  This problem is one that will appear at the superintelligent level; if, being otherwise ignorant, we guess that it is among the median such problems in terms of how early it naturally appears in earlier systems, then around half of the alignment problems of superintelligence will first naturally materialize after that one first starts to appear.  Given correct foresight of which problems will naturally materialize later, one could try to deliberately materialize such problems earlier, and get in some observations of them.  This helps to the extent (a) that we actually correctly forecast all of the problems that will appear later, or some superset of those; (b) that we succeed in preemptively materializing a superset of problems that will appear later; and (c) that we can actually solve, in the earlier laboratory that is out-of-distribution for us relative to the real problems, those alignment problems that would be lethal if we mishandle them when they materialize later.  Anticipating all of the really dangerous ones, and then successfully materializing them, in the correct form for early solutions to generalize over to later solutions, sounds possibly kinda hard.

14.  Some problems, like 'the AGI has an option that (looks to it like) it could successfully kill and replace the programmers to fully optimize over its environment', seem like their natural order of appearance could be that they first appear only in fully dangerous domains.  Really actually having a clear option to brain-level-persuade the operators or escape onto the Internet, build nanotech, and destroy all of humanity - in a way where you're fully clear that you know the relevant facts, and estimate only a not-worth-it low probability of learning something which changes your preferred strategy if you bide your time another month while further growing in capability - is an option that first gets evaluated for real at the point where an AGI fully expects it can defeat its creators.  We can try to manifest an echo of that apparent scenario in earlier toy domains.  Trying to train by gradient descent against that behavior, in that toy domain, is something I'd expect to produce not-particularly-coherent local patches to thought processes, which would break with near-certainty inside a superintelligence generalizing far outside the training distribution and thinking very different thoughts.  Also, programmers and operators themselves, who are used to operating in not-fully-dangerous domains, are operating out-of-distribution when they enter into dangerous ones; our methodologies may at that time break.

15.  Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneouslyGiven otherwise insufficient foresight by the operators, I'd expect a lot of those problems to appear approximately simultaneously after a sharp capability gain.  See, again, the case of human intelligence.  We didn't break alignment with the 'inclusive reproductive fitness' outer loss function, immediately after the introduction of farming - something like 40,000 years into a 50,000 year Cro-Magnon takeoff, as was itself running very quickly relative to the outer optimization loop of natural selection.  Instead, we got a lot of technology more advanced than was in the ancestral environment, including contraception, in one very fast burst relative to the speed of the outer optimization loop, late in the general intelligence game.  We started reflecting on ourselves a lot more, started being programmed a lot more by cultural evolution, and lots and lots of assumptions underlying our alignment in the ancestral training environment broke simultaneously.  (People will perhaps rationalize reasons why this abstract description doesn't carry over to gradient descent; eg, “gradient descent has less of an information bottleneck”.  My model of this variety of reader has an inside view, which they will label an outside view, that assigns great relevance to some other data points that are not observed cases of an outer optimization loop producing an inner general intelligence, and assigns little importance to our one data point actually featuring the phenomenon in question.  When an outer optimization loop actually produced general intelligence, it broke alignment after it turned general, and did so relatively late in the game of that general intelligence accumulating capability and knowledge, almost immediately before it turned 'lethally' dangerous relative to the outer optimization loop of natural selection.  Consider skepticism, if someone is ignoring this one warning, especially if they are not presenting equally lethal and dangerous things that they say will go wrong instead.)

Section B.2:  Central difficulties of outer and inner alignment.

16.  Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments.  Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again: the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions.  This is sufficient on its own, even ignoring many other items on this list, to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.

17.  More generally, a superproblem of 'outer optimization doesn't produce inner alignment' is that on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over.  This is a problem when you're trying to generalize out of the original training distribution, because, eg, the outer behaviors you see could have been produced by an inner-misaligned system that is deliberately producing outer behaviors that will fool you.  We don't know how to get any bits of information into the inner system rather than the outer behaviors, in any systematic or general way, on the current optimization paradigm.

18.  There's no reliable Cartesian-sensory ground truth (reliable loss-function-calculator) about whether an output is 'aligned', because some outputs destroy (or fool) the human operators and produce a different environmental causal chain behind the externally-registered loss function.  That is, if you show an agent a reward signal that's currently being generated by humans, the signal is not in generalreliable perfect ground truth about how aligned an action was, because another way of producing a high reward signal is to deceive, corrupt, or replace the human operators with a different causal system which generates that reward signal.  When you show an agent an environmental reward signal, you are not showing it something that is a reliable ground truth about whether the system did the thing you wanted it to do; even if it ends up perfectly inner-aligned on that reward signal, or learning some concept that exactly corresponds to 'wanting states of the environment which result in a high reward signal being sent', an AGI strongly optimizing on that signal will kill you, because the sensory reward signal was not a ground truth about alignment (as seen by the operators).

19.  More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment - to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.  This isn't to say that nothing in the system’s goal (whatever goal accidentally ends up being inner-optimized over) could ever point to anything in the environment by accident Humans ended up pointing to their environments at least partially, though we've got lots of internally oriented motivational pointers as well.  But insofar as the current paradigm works at all, the on-paper design properties say that it only works for aligning on known direct functions of sense data and reward functions.  All of these kill you if optimized-over by a sufficiently powerful intelligence, because they imply strategies like 'kill everyone in the world using nanotech to strike before they know they're in a battle, and have control of your reward button forever after'.  It just isn't true that we know a function on webcam input such that every world with that webcam showing the right things is safe for us creatures outside the webcam.  This general problem is a fact about the territory, not the map; it's a fact about the actual environment, not the particular optimizer, that lethal-to-us possibilities exist in some possible environments underlying every given sense input.

20.  Human operators are fallible, breakable, and manipulable.  Human raters make systematic errors - regular, compactly describable, predictable errors.  To faithfully learn a function from 'human feedback' is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we'd hoped to transfer).  If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them.  It's a fact about the territory, not the map - about the environment, not the optimizer - that the best predictive explanation for human answers is one that predicts the systematic errors in our responses, and therefore is a psychological concept that correctly predicts the higher scores that would be assigned to human-error-producing cases.

21.  There's something like a single answer, or a single bucket of answers, for questions like 'What's the environment really like?' and 'How do I figure out the environment?' and 'Which of my possible outputs interact with reality in a way that causes reality to have certain properties?', where a simple outer optimization loop will straightforwardly shove optimizees into this bucket.  When you have a wrong belief, reality hits back at your wrong predictions.  When you have a broken belief-updater, reality hits back at your broken predictive mechanism via predictive losses, and a gradient descent update fixes the problem in a simple way that can easily cohere with all the other predictive stuff.  In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints.  Reality doesn't 'hit back' against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases.  This is the very abstract story about why hominids, once they finally started to generalize, generalized their capabilities to Moon landings, but their inner optimization no longer adhered very well to the outer-optimization goal of 'relative inclusive reproductive fitness' - even though they were in their ancestral environment optimized very strictly around this one thing and nothing else.  This abstract dynamic is something you'd expect to be true about outer optimization loops on the order of both 'natural selection' and 'gradient descent'.  The central result:  Capabilities generalize further than alignment once capabilities start to generalize far.

22.  There's a relatively simple core structure that explains why complicated cognitive machines work; which is why such a thing as general intelligence exists and not just a lot of unrelated special-purpose solutions; which is why capabilities generalize after outer optimization infuses them into something that has been optimized enough to become a powerful inner optimizer.  The fact that this core structure is simple and relates generically to low-entropy high-structure environments is why humans can walk on the Moon.  There is no analogous truth about there being a simple core of alignment, especially not one that is even easier for gradient descent to find than it would have been for natural selection to just find 'want inclusive reproductive fitness' as a well-generalizing solution within ancestral humans.  Therefore, capabilities generalize further out-of-distribution than alignment, once they start to generalize at all.

23.  Corrigibility is anti-natural to consequentialist reasoning; "you can't bring the coffee if you're dead" for almost every kind of coffee.  We (MIRI) tried and failed to find a coherent formula for an agent that would let itself be shut down (without that agent actively trying to get shut down).  Furthermore, many anti-corrigible lines of reasoning like this may only first appear at high levels of intelligence.

24.  There are two fundamentally different approaches you can potentially take to alignment, which are unsolvable for two different sets of reasons; therefore, by becoming confused and ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult.  The first approach is to build a CEV-style Sovereign which wants exactly what we extrapolated-want and is therefore safe to let optimize all the future galaxies without it accepting any human input trying to stop it.  The second course is to build corrigible AGI which doesn't want exactly what we want, and yet somehow fails to kill us and take over the galaxies despite that being a convergent incentive there.

1. The first thing generally, or CEV specifically, is unworkable because the complexity of what needs to be aligned or meta-aligned for our Real Actual Values is far out of reach for our FIRST TRY at AGI.  Yes I mean specifically that the dataset, meta-learning algorithm, and what needs to be learned, is far out of reach for our first try.  It's not just non-hand-codable, it is unteachable on-the-first-try because the thing you are trying to teach is too weird and complicated.
2. The second thing looks unworkable (less so than CEV, but still lethally unworkable) because corrigibility runs actively counter to instrumentally convergent behaviors within a core of general intelligence (the capability that generalizes far out of its original distribution).  You're not trying to make it have an opinion on something the core was previously neutral on.  You're trying to take a system implicitly trained on lots of arithmetic problems until its machinery started to reflect the common coherent core of arithmetic, and get it to say that as a special case 222 + 222 = 555.  You can maybe train something to do this in a particular training distribution, but it's incredibly likely to break when you present it with new math problems far outside that training distribution, on a system which successfully generalizes capabilities that far at all.

Section B.3:  Central difficulties of sufficiently good and useful transparency / interpretability.

25.  We've got no idea what's actually going on inside the giant inscrutable matrices and tensors of floating-point numbers.  Drawing interesting graphs of where a transformer layer is focusing attention doesn't help if the question that needs answering is "So was it planning how to kill us or not?"

26.  Even if we did know what was going on inside the giant inscrutable matrices while the AGI was still too weak to kill us, this would just result in us dying with more dignity, if DeepMind refused to run that system and let Facebook AI Research destroy the world two years later.  Knowing that a medium-strength system of inscrutable matrices is planning to kill us, does not thereby let us build a high-strength system of inscrutable matrices that isn't planning to kill us.

27.  When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect.  Optimizing against an interpreted thought optimizes against interpretability.

28.  The AGI is smarter than us in whatever domain we're trying to operate it inside, so we cannot mentally check all the possibilities it examines, and we cannot see all the consequences of its outputs using our own mental talent.  A powerful AI searches parts of the option space we don't, and we can't foresee all its options.

29.  The outputs of an AGI go through a huge, not-fully-known-to-us domain (the real world) before they have their real consequences.  Human beings cannot inspect an AGI's output to determine whether the consequences will be good.

30.  Any pivotal act that is not something we can go do right now, will take advantage of the AGI figuring out things about the world we don't know so that it can make plans we wouldn't be able to make ourselves.  It knows, at the least, the fact we didn't previously know, that some action sequence results in the world we want.  Then humans will not be competent to use their own knowledge of the world to figure out all the results of that action sequence.  An AI whose action sequence you can fully understand all the effects of, before it executes, is much weaker than humans in that domain; you couldn't make the same guarantee about an unaligned human as smart as yourself and trying to fool you.  There is no pivotal output of an AGI that is humanly checkable and can be used to safely save the world but only after checking it; this is another form of pivotal weak act which does not exist.

31.  A strategically aware intelligence can choose its visible outputs to have the consequence of deceiving you, including about such matters as whether the intelligence has acquired strategic awareness; you can't rely on behavioral inspection to determine facts about an AI which that AI might want to deceive you about.  (Including how smart it is, or whether it's acquired strategic awareness.)

32.  Human thought partially exposes only a partially scrutable outer surface layer.  Words only trace our real thoughts.  Words are not an AGI-complete data representation in its native style.  The underparts of human thought are not exposed for direct imitation learning and can't be put in any dataset.  This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.

33.  The AI does not think like you do, the AI doesn't have thoughts built up from the same concepts you use, it is utterly alien on a staggering scale.  Nobody knows what the hell GPT-3 is thinking, not only because the matrices are opaque, but because the stuff within that opaque container is, very likely, incredibly alien - nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind.

Section B.4:  Miscellaneous unworkable schemes.

34.  Coordination schemes between superintelligences are not things that humans can participate in (eg because humans can't reason reliably about the code of superintelligences); a "multipolar" system of 20 superintelligences with different utility functions, plus humanity, has a natural and obvious equilibrium which looks like "the 20 superintelligences cooperate with each other but not with humanity".

35.  Schemes for playing "different" AIs off against each other stop working if those AIs advance to the point of being able to coordinate via reasoning about (probability distributions over) each others' code.  Any system of sufficiently intelligent agents can probably behave as a single agent, even if you imagine you're playing them against each other.  Eg, if you set an AGI that is secretly a paperclip maximizer, to check the output of a nanosystems designer that is secretly a staples maximizer, then even if the nanosystems designer is not able to deduce what the paperclip maximizer really wants (namely paperclips), it could still logically commit to share half the universe with any agent checking its designs if those designs were allowed through, if the checker-agent can verify the suggester-system's logical commitment and hence logically depend on it (which excludes human-level intelligences).  Or, if you prefer simplified catastrophes without any logical decision theory, the suggester could bury in its nanosystem design the code for a new superintelligence that will visibly (to a superhuman checker) divide the universe between the nanosystem designer and the design-checker.

36.  What makes an air conditioner 'magic' from the perspective of say the thirteenth century, is that even if you correctly show them the design of the air conditioner in advance, they won't be able to understand from seeing that design why the air comes out cold; the design is exploiting regularities of the environment, rules of the world, laws of physics, that they don't know about.  The domain of human thought and human brains is very poorly understood by us, and exhibits phenomena like optical illusions, hypnosis, psychosis, mania, or simple afterimages produced by strong stimuli in one place leaving neural effects in another place.  Maybe a superintelligence couldn't defeat a human in a very simple realm like logical tic-tac-toe; if you're fighting it in an incredibly complicated domain you understand poorly, like human minds, you should expect to be defeated by 'magic' in the sense that even if you saw its strategy you would not understand why that strategy worked.  AI-boxing can only work on relatively weak AGIs; the human operators are not secure systems.

### Section C:

Okay, those are some significant problems, but lots of progress is being made on solving them, right?  There's a whole field calling itself "AI Safety" and many major organizations are expressing Very Grave Concern about how "safe" and "ethical" they are?

38.  It does not appear to me that the field of 'AI safety' is currently being remotely productive on tackling its enormous lethal problems.  These problems are in fact out of reach; the contemporary field of AI safety has been selected to contain people who go to work in that field anyways.  Almost all of them are there to tackle problems on which they can appear to succeed and publish a paper claiming success; if they can do that and get funded, why would they embark on a much more unpleasant project of trying something harder that they'll fail at, just so the human species can die with marginally more dignity?  This field is not making real progress and does not have a recognition function to distinguish real progress if it took place.  You could pump a billion dollars into it and it would produce mostly noise to drown out what little progress was being made elsewhere.

39.  I figured this stuff out using the null string as input, and frankly, I have a hard time myself feeling hopeful about getting real alignment work out of somebody who previously sat around waiting for somebody else to input a persuasive argument into them.  This ability to "notice lethal difficulties without Eliezer Yudkowsky arguing you into noticing them" currently is an opaque piece of cognitive machinery to me, I do not know how to train it into others.  It probably relates to 'security mindset', and a mental motion where you refuse to play out scripts, and being able to operate in a field that's in a state of chaos.

40.  "Geniuses" with nice legible accomplishments in fields with tight feedback loops where it's easy to determine which results are good or bad right away, and so validate that this person is a genius, are (a) people who might not be able to do equally great work away from tight feedback loops, (b) people who chose a field where their genius would be nicely legible even if that maybe wasn't the place where humanity most needed a genius, and (c) probably don't have the mysterious gears simply because they're rare.  You cannot just pay 5 million apiece to a bunch of legible geniuses from other fields and expect to get great alignment work out of them. They probably do not know where the real difficulties are, they probably do not understand what needs to be done, they cannot tell the difference between good and bad work, and the funders also can't tell without me standing over their shoulders evaluating everything, which I do not have the physical stamina to do. I concede that real high-powered talents, especially if they're still in their 20s, genuinely interested, and have done their reading, are people who, yeah, fine, have higher probabilities of making core contributions than a random bloke off the street. But I'd have more hope - not significant hope, but more hope - in separating the concerns of (a) credibly promising to pay big money retrospectively for good work to anyone who produces it, and (b) venturing prospective payments to somebody who is predicted to maybe produce good work later. 41. Reading this document cannot make somebody a core alignment researcherThat requires, not the ability to read this document and nod along with it, but the ability to spontaneously write it from scratch without anybody else prompting you; that is what makes somebody a peer of its author. It's guaranteed that some of my analysis is mistaken, though not necessarily in a hopeful direction. The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so. Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly - such that somebody out there had the knowledge to write all of this themselves, if they could only have written it up, but they couldn't write, so didn't try. I'm not particularly hopeful of this turning out to be true in real life, but I suppose it's one possible place for a "positive model violation" (miracle). The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years into even normies starting to notice the death game, it is still Eliezer Yudkowsky writing up this list, says that humanity still has only one gamepiece that can do that. I knew I did not actually have the physical stamina to be a star researcher, I tried really really hard to replace myself before my health deteriorated further, and yet here I am writing this. That's not what surviving worlds look like. 42. There's no plan. Surviving worlds, by this point, and in fact several decades earlier, have a plan for how to survive. It is a written plan. The plan is not secret. In this non-surviving world, there are no candidate plans that do not immediately fall to Eliezer instantly pointing at the giant visible gaping holes in that plan. Or if you don't know who Eliezer is, you don't even realize you need a plan, because, like, how would a human being possibly realize that without Eliezer yelling at them? It's not like people will yell at themselves about prospective alignment difficulties, they don't have an internal voice of caution. So most organizations don't have plans, because I haven't taken the time to personally yell at them. 'Maybe we should have a plan' is deeper alignment mindset than they possess without me standing constantly on their shoulder as their personal angel pleading them into... continued noncompliance, in fact. Relatively few are aware even that they should, to look better, produce a pretend plan that can fool EAs too 'modest' to trust their own judgments about seemingly gaping holes in what serious-looking people apparently believe. 43. This situation you see when you look around you is not what a surviving world looks like. The worlds of humanity that survive have plans. They are not leaving to one tired guy with health problems the entire responsibility of pointing out real and lethal problems proactively. Key people are taking internal and real responsibility for finding flaws in their own plans, instead of considering it their job to propose solutions and somebody else's job to prove those solutions wrong. That world started trying to solve their important lethal problems earlier than this. Half the people going into string theory shifted into AI alignment instead and made real progress there. When people suggest a planetarily-lethal problem that might materialize later - there's a lot of people suggesting those, in the worlds destined to live, and they don't have a special status in the field, it's just what normal geniuses there do - they're met with either solution plans or a reason why that shouldn't happen, not an uncomfortable shrug and 'How can you be sure that will happen' / 'There's no way you could be sure of that now, we'll have to wait on experimental evidence.' A lot of those better worlds will die anyways. It's a genuinely difficult problem, to solve something like that on your first try. But they'll die with more dignity than this. # 781 # Ω 128 New Comment 672 comments, sorted by Click to highlight new comments since: Some comments are truncated due to high volume. Change truncation settings [-]evhub9mo Ω65210100 That requires, not the ability to read this document and nod along with it, but the ability to spontaneously write it from scratch without anybody else prompting you; that is what makes somebody a peer of its author. It's guaranteed that some of my analysis is mistaken, though not necessarily in a hopeful direction. The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so. Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly - such that somebody out there had the knowledge to write all of this themselves, if they could only have written it up, but they couldn't write, so didn't try. I'm not particularly hopeful of this turning out to be true in real life, but I suppose it's one possible place for a "positive model violation" (miracle). The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years into even normies start ... I agree this list doesn't seem to contain much unpublished material, and I think the main value of having it in one numbered list is that "all of it is in one, short place", and it's not an "intro to computers can think" and instead is "these are a bunch of the reasons computers thinking is difficult to align". The thing that I understand to be Eliezer's "main complaint" is something like: "why does it seem like No One Else is discovering new elements to add to this list?". Like, I think Risks From Learned Optimization was great, and am glad you and others wrote it! But also my memory is that it was "prompted" instead of "written from scratch", and I imagine Eliezer reading it more had the sense of "ah, someone made 'demons' palatable enough to publish" instead of "ah, I am learning something new about the structure of intelligence and alignment." [I do think the claim that Eliezer 'figured it out from the empty string' doesn't quite jive with the Yudkowsky's Coming of Age sequence.] Nearly empty string of uncommon social inputs. All sorts of empirical inputs, including empirical inputs in the social form of other people observing things. It's also fair to say that, though they didn't argue me out of anything, Moravec and Drexler and Ed Regis and Vernor Vinge and Max More could all be counted as social inputs telling me that this was an important thing to look at. Thank you, Evan, for living the Virture of Scholarship. Your work is appreciated. Eliezer's post here is doing work left undone by the writing you cite. It is a much clearer account of how our mainline looks doomed than you'd see elsewhere, and it's frank on this point. I think Eliezer wishes these sorts of artifacts were not just things he wrote, like this and "There is no fire alarm". Also, re your excerpts for (14), (15), and (32), I see Eliezer as saying something meaningfully different in each case. I might elaborate under this comment. Re (14), I guess the ideas are very similar, where the mesaoptimizer scenario is like a sharp example of the more general concept Eliezer points at, that different classes of difficulties may appear at different capability levels. Re (15), "Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously", which is about how we may have reasons to expect aligned output that are brittle under rapid capability gain: your quote from Richard is just about "fast capability gain seems possible and likely", and isn't about connecting that to increased difficulty in succeeding at the alignment problem? Re (32), I don't think your quote isn't talking about the thing Eliezer is talking about, which is that in order to be human level at modelling human-generated text, your AI must be doing something on par with human thought that figures out what humans would say. Your quote just isn't discussing this, namely that strong imitation requires cognition that is dangerous. So I guess I don't take much issue with (14) or (15), but I think you're quite off the mark about (32). In any case, I still have a strong sense that Eliezer is successfully being more on the mark here than the rest of us manage. Kudos of course to you and others that are working on writing things up and figuring things out. Though I remain sympathetic to Eliezer's complaint. Well, my disorganized list sure wasn't complete, so why not go ahead and list some of the foreseeable difficulties I left out? Bonus points if any of them weren't invented by me, though I realize that most people may not realize how much of this entire field is myself wearing various trenchcoats. [-]evhub9mo Ω6123659 Sure—that's easy enough. Just off the top of my head, here's five safety concerns that I think are important that I don't think you included: • The fact that there exist functions that are easier to verify than satisfy ensures that adversarial training can never guarantee the absence of deception. • It is impossible to verify a model's safety—even given arbitrarily good transparency tools—without access to that model's training process. For example, you could get a deceptive model that gradient hacks itself in such a way that cryptographically obfuscates its deception. • It is impossible in general to use interpretability tools to select models to have a particular behavioral property. I think this is clear if you just stare at Rice's theorem enough: checking non-trivial behavioral properties, even with mechanistic access, is in general undecidable. Note, however, that this doesn't rule out checking a mechanistic property that implies a behavioral property. • Any prior you use to incentivize models to behave in a particular way doesn't necessarily translate to situations where that model itself runs another search over algorithms. For example, the fastest way to search for algorith ... Consider my vote to be placed that you should turn this into a post, keep going for literally as long as you can, expand things to paragraphs, and branch out beyond things you can easily find links for. (I do think there's a noticeable extent to which I was trying to list difficulties more central than those, but I also think many people could benefit from reading a list of 100 noncentral difficulties.) I do think there's a noticeable extent to which I was trying to list difficulties more central than those Probably people disagree about which things are more central, or as evhub put it: Every time anybody writes up any overview of AI safety, they have to make tradeoffs [...] depending on what the author personally believes is most important/relevant to say Now FWIW I thought evhub was overly dismissive of (4) in which you made an important meta-point: EY: 4. We can't just "decide not to build AGI" because GPUs are everywhere, and knowledge of algorithms is constantly being improved and published; 2 years after the leading actor has the capability to destroy the world, 5 other actors will have the capability to destroy the world. The given lethal challenge is to solve within a time limit, driven by the dynamic in which, over time, increasingly weak actors with a smaller and smaller fraction of total computing power, become able to build AGI and destroy the world. Powerful actors all refraining in unison from doing the suicidal thing just delays this time limit - it does not lift it [...] evhub: This is just answering a particular bad plan. But I would add a criticism of my o... 9TekhneMakre9mo (Note that these have a theme: you can't wrangle general computation / optimization. That's why I'm short universal approaches to AI alignment (approaches that aim at making general optimization safe by enforcing universal rules), and long existential approaches (approaches that try to find specific mechanisms that can be analytically seen to do the right thing).) 0Remmelt9mo Eliezer: If you find that (for reasons still left explained) * ... selection of code for intentionality is coupled – over the long run, in mostly non-reverse-engineerable ways – to various/most of the physical/chemical properties * ... of the molecular substrate through which discrete code is necessarily computed/expressed (via input and output channels of information/energy packet transmission), then given that * ... the properties of the solid-state substrate (e.g. silicon-based hardware) computing AGI's code * ... differ from the properties of the substrate of humans (carbon-based wetware), a conclusion that follows is that * ... the intentionality being selected for in AGI over the long run * ... will diverge from the intentionality that was selected for in humans. 4Rob Bensinger9mo What do you mean by 'intentionality'? Per SEP [https://plato.stanford.edu/entries/intentionality/], "In philosophy, intentionality is the power of minds and mental states to be about, to represent, or to stand for, things, properties and states of affairs." So I read your comment as saying, a la Searle, 'maybe AI can never think like a human because there's something mysterious and crucial about carbon atoms in particular, or about capital-b Biology, for doing reasoning.' This seems transparently silly to me -- I know of no reasonable argument for thinking carbon differs from silicon on this dimension -- and also not relevant to AGI risk. You can protest "but AlphaGo doesn't really understand Go!" until the cows come home, and it will still beat you at Go. You can protest "but you don't really understand killer nanobots!" until the cows come home, and superintelligent Unfriendly AI will still build the nanobots and kill you with them. By the same reasoning, Searle-style arguments aren't grounds for pessimism either. If Friendly AI lacks true intentionality or true consciousness or whatever, it can still do all the same mechanistic operations, and therefore still produce the same desirable good outcomes as if it had human-style intentionality or whatver. 1Remmelt9mo That’s not the argument. Give me a few days to write a response. There’s a minefield of possible misinterpretations here. However, the argumentation does undermine the idea that designing for mechanistic (alignment) operations is going to work. I’ll try and explain why. 2Remmelt9mo If you happen to have time, this paper serves as useful background reading: https://royalsocietypublishing.org/doi/full/10.1098/rsif.2012.0869 [https://royalsocietypublishing.org/doi/full/10.1098/rsif.2012.0869] Particularly note the shift from trivial self-replication (e.g. most computer viruses) to non-trivial self-replication (e.g. as through substrate-environment pathways to reproduction). None of this is sufficient for you to guess what the argumentation is (you might be able to capture a bit of it, along with a lot of incorrect and often implicit assumptions we must dig into). If you could call on some patience and openness to new ideas, I would really appreciate it! I am already bracing for a next misinterpretation (which is fine, if we can talk about that). I apologise for that I cannot find a viable way yet to throw out all the argumentation in one go, and also for that this will get a bit disorientating when we go through arguments step-by-step. 1Remmelt9mo Returning to this: Key idea: Different basis of existence→ different drives→ different intentions→ different outcomes. @Rob, I wrote up a longer explanation here, which I prefer to discuss with you in private first. Will email you a copy tomorrow in the next weeks. 1Remmelt9mo BTW, with ‘intentionality’, I meant something closer to everyday notions of ‘intentions one has’. Will more precisely define that meaning later. I should have checked for diverging definitions from formal fields. Thanks for catching that. I'm sorry to hear that your health is poor and you feel that this is all on you. Maybe you're right about the likelihood of doom, and even if I knew you were, I'd be sorry that it troubles you this way. I think you've done an amazing job of building the AI safety field and now, even when the field has a degree of momentum of its own, it does seem to be less focused on doom than it should be, and I think you continuing to push people to focus on doom is valuable. I don't think its easy to get people to take weird ideas seriously. I've had many experiences where I've had ideas about how people should change their approach to a project that weren't particularly far out and (in my view) were right for very straightforward reasons, and yet for the most part I was ignored altogether. What you've accomplished in building the AI safety field is amazing because AI doom ideas seemed really crazy when you started talking about them. Nevertheless, I think some of the things you've said in this post are counterproductive. Most of the post is good, but insulting people who might contribute to solving the problem is not, nor is demanding that people acknowledge that you are smarter than they are. I'... There's a point here about how fucked things are that I do not know how to convey without saying those things, definitely not briefly or easily. I've spent, oh, a fair number of years, being politer than this, and less personal than this, and the end result is that people nod along and go on living their lives. I expect this won't work either, but at some point you start trying different things instead of the things that have already failed. It's more dignified if you fail in different ways instead of the same way. FWIW you taking off the Mr. Nice guy gloves has actually made me make different life decisions. I'm glad you tried it even if it doesn't work. Do whatever you want, obviously, but I just want to clarify that I did not suggest you avoid personally criticising people (only that you avoid vague/hard to interpret criticism) or saying you think doom is overwhelmingly likely. Some other comments give me a stronger impression than yours that I was asking you in a general sense to be nice, but I'm saying it to you because I figure it mostly matters that you're clear on this. 5Chris_Leong5mo You might not have this ability, but surely you know at least one person who does? I vehemently disagree here, based on my personal and generalizable or not history. I will illustrate with the three turning points of my recent life. First step: I stumbled upon HPMOR, and Eliezer way of looking straight into the irrationality of all our common ways of interacting and thinking was deeply shocking. It made me feel like he was in a sense angrily pointing at me, who worked more like one of the PNJ rather than Harry. I heard him telling me you're dumb and all your ideals of making intelligent decisions, being the gifted kid and being smarter than everyone are all are just delusions. You're so out of touch with reality on so many levels, where to even start. This attitude made me embark on a journey to improve myself, read the sequences, pledge on Giving What we can after knowing EA for many years, and overall reassess whether I was striving towards my goal of helping people (spoiler: I was not). Second step: The April fools post also shocked me on so many levels. I was once again deeply struck by the sheer pessimism of this figure I respected so much. After months of reading articles on LessWrong and so many about AI alignment, this was the one that made me terrifie... I disagree strongly. To me it seems that AI safety has long punched below its weight because its proponents are unwilling to be confrontational, and are too reluctant to put moderate social pressure on people doing the activities which AI safety proponents hold to be very extremely bad. It is not a coincidence that among AI safety proponents, Eliezer is both unusually confrontational and unusually successful. This isn't specific to AI safety. A lot of people in this community generally believe that arguments which make people feel bad are counterproductive because people will be "turned off". This is false. There are tons of examples of disparaging arguments against bad (or "bad") behavior that succeed wildly. Such arguments very frequently succeed in instilling individual values like e.g. conscientiousness or honesty. Prominent political movements which use this rhetoric abound. When this website was young, Eliezer and many others participated in an aggressive campaign of discourse against religious ideas, and this campaign accomplished many of its goals. I could name many many more large and small examples. I bet you can too. Obviously this isn't to say that confrontational and insu... I think there's an important distinction between: • Deliberately phrasing things in confrontational or aggressive ways, in the hope that this makes your conversation partner "wake up" or something. • Choosing not to hide real, potentially-important beliefs you have about the world, even though those beliefs are liable to offend people, liable to be disagreed with, etc. Either might be justifiable, but I'm a lot more wary of heuristics like "it's never OK to talk about individuals' relative proficiency at things, even if it feels very cruxy and important, because people just find the topic too triggering" than of heuristics like "it's never OK to say things in ways that sound shouty or aggressive". I think cognitive engines can much more easily get by self-censoring their tone than self-censoring what topics are permissible to think or talk about. 1teageegeepea9mo How is "success" measured among AI safety proponents? This kind of post scares away the person who will be the key person in the AI safety field if we define "key person" as the genius main driver behind solving it, not the loudest person. Which is rather unfortunate, because that person is likely to read this post at some point. I don't believe this post has any "dignity", whatever weird obscure definition dignity has been given now. It's more like flailing around in death throes while pointing fingers and lauding yourself than it is a solemn battle stance against an oncoming impossible enemy. For context, I'm not some Eliezer hater, I'm a young person doing an ML masters currently who just got into this space and within the past week have become a huge fan of Eliezer Yudkowsky's earlier work while simultaneously very disappointed in the recent, fruitless, output. It seems worth doing a little user research on this to see how it actually affects people. If it is a net positive, then great. If it is a net negative, the question becomes how big of a net negative it is and whether it is worth the extra effort to frame things more nicely. 4Eli Tyre1mo I think this was excellently worded, and I'm glad you said it. I'm also glad to have read all the responses, many of which seem important and on point to me. I strong upvoted this comment as well as several of the responses. I'm leaving this comment, because I want to give you some social reinforcement for saying what you said, and saying it as clearly and tactfully as you did. 2Yitz10mo Strongly agree with this, said more eloquently than I was able to :) I'd have more hope - not significant hope, but more hope - in separating the concerns of (a) credibly promising to pay big money retrospectively for good work to anyone who produces it, and (b) venturing prospective payments to somebody who is predicted to maybe produce good work later. I desperately want to make this ecosystem exist, either as part of Manifold Markets, or separately. Some people call it "impact certificates" or "retroactive public goods funding"; I call it "equity for public goods", or "Manifund" in the specific case. If anyone is interested in: a) Being a retroactive funder for good work (aka bounties, prizes) b) Getting funding through this kind of mechanism (aka income share agreements, angel investment) c) Working on this project full time (full-stack web dev, ops, community management) Please get in touch! Reply here, or message austin@manifold.markets~ I'm also on a team trying to build impact certificates/retroactive public goods funding and we are receiving a grant from an FTX Future Fund regrantor to make it happen! If you're interested in learning more or contributing you can: • Read about our ongoing10,000 retro-funding contest (Austin is graciously contributing to the prize pool)
• Submit an EA Forum Post to this retro-funding contest (before July 1st)
• Read/Comment on our lengthy informational EA forum post "Towards Impact Markets"

It's as good as time as any to re-iterate my reasons for disagreeing with what I see as the Yudkowskian view of future AI. What follows isn't intended as a rebuttal of any specific argument in this essay, but merely a pointer that I'm providing for readers, that may help explain why some people might disagree with the conclusion and reasoning contained within.

I'll provide my cruxes point-by-point,

• I think raw intelligence, while important, is not the primary factor that explains why humanity-as-a-species is much more powerful than chimpanzees-as-a-species. Notably, humans were once much less powerful, in our hunter-gatherer days, but over time, through the gradual process of accumulating technology, knowledge, and culture, humans now possess vast productive capacities that far outstrip our ancient powers.

Similarly, our ability to coordinate through language also plays a huge role in explaining our power compared to other animals. But, on a first approximation, other animals can't coordinate at all, making this distinction much less impressive. The first AGIs we construct will be born into a culture already capable of coordinating, and sharing knowledge, making the potential power di
...

Notably, humans were once much less powerful, in our hunter-gatherer days, but over time, through the gradual process of accumulating technology, knowledge, and culture, humans now possess vast productive capacities that far outstrip our ancient powers.

Similarly, our ability to coordinate through language also plays a huge role in explaining our power compared to other animals. But, on a first approximation, other animals can't coordinate at all, making this distinction much less impressive. The first AGIs we construct will be born into a culture already capable of coordinating, and sharing knowledge, making the potential power difference between AGI and humans relatively much smaller than between humans and other animals, at least at first.

I basically buy the story that human intelligence is less useful that human coordination; i.e. it's the intelligence of "humanity" the entity that matters, with the intelligence of individual humans relevant only as, like, subcomponents of that entity.

But... shouldn't this mean you expect AGI civilization to totally dominate human civilization? They can read each other's source code, and thus trust much more deeply! They can transmit information...

But... shouldn't this mean you expect AGI civilization to totally dominate human civilization? They can read each other's source code, and thus trust much more deeply! They can transmit information between them at immense bandwidths! They can clone their minds and directly learn from each other's experiences!

This is 100% correct, and part of why I expect the focus on superintelligence, while literally true, is bad for AI outreach. There's a much simpler (and empirically, in my experience, more convincing) explanation of why we lose to even an AI with an IQ of 110. It is Dath Ilan, and we are Earth. Coordination is difficult for humans and the easy part for AIs.

I will note that Eliezer wrote That Alien Message a long time ago I think in part to try to convey the issue to this perspective, but it's mostly about "information-theoretic bounds are probably not going to be tight" in a simulation-y universe instead of "here's what coordination between computers looks like today". I do predict the coordination point would be good to include in more of the intro materials.

But... shouldn't this mean you expect AGI civilization to totally dominate human civilization? They can read each other's source code, and thus trust much more deeply! They can transmit information between them at immense bandwidths! They can clone their minds and directly learn from each other's experiences!

I don't think it's obvious that this means that AGI is more dangerous, because it means that for a fixed total impact of AGI, the AGI doesn't have to be as competent at individual thinking (because it leans relatively more on group thinking). And so at the point where the AGIs are becoming very powerful in aggregate, this argument pushes us away from thinking they're good at individual thinking.

Also, it's not obvious that early AIs will actually be able to do this if their creators don't find a way to train them to have this affordance. ML doesn't currently normally make AIs which can helpfully share mind-states, and it probably requires non-trivial effort to hook them up correctly to be able to share mind-state.

5antimonyanthony9mo
Being able to read source code doesn't automatically increase trust—you also have to be able to verify that the code being shared with you actually governs the AGI's behavior, despite that AGI's incentives and abilities to fool you. (Conditional on the AGIs having strongly aligned goals with each other, sure, this degree of transparency would help them with pure coordination problems.)

Nice! Thanks! I'll give my commentary on your commentary, also point by point. Your stuff italicized, my stuff not. Warning: Wall of text incoming! :)

I think raw intelligence, while important, is not the primary factor that explains why humanity-as-a-species is much more powerful than chimpanzees-as-a-species. Notably, humans were once much less powerful, in our hunter-gatherer days, but over time, through the gradual process of accumulating technology, knowledge, and culture, humans now possess vast productive capacities that far outstrip our ancient powers.

Similarly, our ability to coordinate through language also plays a huge role in explaining our power compared to other animals. But, on a first approximation, other animals can't coordinate at all, making this distinction much less impressive. The first AGIs we construct will be born into a culture already capable of coordinating, and sharing knowledge, making the potential power difference between AGI and humans relatively much smaller than between humans and other animals, at least at first.

I don't think I understand this argument. Yes, humans can use language to coordinate & benefit from cultural evolution, so an AI that...

5Chris van Merwijk9mo
"I have sat down to make toy models .." reference?
6Daniel Kokotajlo9mo
? I am the reference, I'm describing a personal experience.
1Chris van Merwijk9mo
I meant, is there a link to where you've written this down somewhere? Maybe you just haven't written it down.
2Daniel Kokotajlo9mo
I'll send you a DM.
4Kinrany9mo
Markdown has syntax for quotes: a line with > this on it will look like

You said you weren't replying to any specific point Eliezer was making, but I think it's worth pointing out that when he brings up Alpha Go, he's not talking about the 2 years it took Google to build a Go-playing AI - remarkable and surprising as that was - but rather the 3 days it took Alpha Zero to go from not knowing anything about the game beyond the basic rules to being better than all humans and the earlier AIs.

I hate how convincing so many different people are. I wish I just had some fairly static, reasoned perspective based on object-level facts and not persuasion strings.

[This comment is no longer endorsed by its author]Reply
9Vaniver10mo
Note that convincing is a 2-place word [https://www.lesswrong.com/posts/eDpPnT7wdBwWPGvo5/2-place-and-1-place-words]. I don't think I can transfer this ability, but I haven't really tried, so here's a shot: The target is: "reading as dialogue." Have a world-model. As you read someone else, be simultaneously constructing / inferring "their world-model" and holding "your world-model", noting where you agree and disagree. If you focus too much on "how would I respond to each line", you lose the ability to listen and figure out what they're actually pointing at. If you focus too little on "how would I respond to this", you lose the ability to notice disagreements, holes, and notes of discord. The first homework exercise I'd try to printing out something (probably with double-spacing), and writing your thoughts each sentence. "uh huh", "wait what?", "yes and", "no but", etc.; at the beginning you're probably going to be alternating between the two moves before you can do them simultaneously. [Historically, I think I got this both from 'reading a lot', including a lot of old books, and also 'arguing on the internet' in forum environments that only sort of exist today, which was a helpful feedback loop for the relevant subskills, and of course whatever background factors made me do those activities.]
2lc9mo
Why can't I delete comments sometimes? >:(
5Raemon9mo
Users can't delete their own comments if the comment has been replied to, to avoid disrupting other people's content. (you can edit it to be blank though, or mark it as retracted)

Some quick thoughts on these points:

• I think the ability for humans to communicate and coordinate is a double edged sword. In particular, it enables the attack vector of dangerous self propagating memes. I expect memetic warfare to play a major role in many of the failure scenarios I can think of. As we've seen, even humans are capable of crafting some pretty potent memes, and even defending against human actors is difficult.
• I think it's likely that the relevant reference class here is research bets rather then the "task" of AGI. An extremely successful research bet could be currently underinvested in, but once it shows promise, discontinuous (relative to the bet) amounts of resources will be dumped into scaling it up, even if the overall investment towards the task as a whole remains continuous. In other words, in this case even though investment into AGI may be continuous (though that might not even hold), discontinuity can occur on the level of specific research bets. Historical examples would include imagenet seeing discontinuous improvement with AlexNet despite continuous investment into image recognition to that point. (Also, for what it's worth, my personal model of AI doo
...
9emanuele ascani10mo
Thanks a lot for writing this.  These disagreements mainly concern the relative power of future AIs, the polarity of takeoff, takeoff speed, and, in general, the shape of future AIs. Do you also have detailed disagreements about the difficulty of alignment? If anything, the fact that the future unfolds differently in your view should impact future alignment efforts (but you also might have other considerations informing your view on alignment). You partially answer this in the last point, saying: "But, equally, one could view these theses pessimistically." But what do you personally think? Are you more pessimistic, more optimistic, or equally pessimistic about humanity's chances of surviving AI progress? And why?

Part of what makes it difficult for me to talk about alignment difficultly is that the concept doesn’t fit easily into my paradigm of thinking about the future of AI. If I am correct, for example, that AI services will be modular, marginally more powerful than what comes before, and numerous as opposed to monolithic, then there will not be one alignment problem, but many.

I could talk about potential AI safety principles, healthy cultural norms, and specific engineering issues, but not “a problem” called “aligning the AI” — a soft prerequisite for explaining how difficult “the problem” will be. Put another way, my understanding is that future AI alignment will be continuous with ordinary engineering, like cars and skyscrapers. We don’t ordinarily talk about how hard the problem of building a car is, in some sort of absolute sense, though there are many ways of operationalizing what that could mean.

One question is how costly it is to build a car. We could then compare that cost to the overall consumer benefit that people get from cars, and from that, deduce whether and how many cars will be built. Similarly, we could ask about the size of the “alignment tax” (the cost of aligning an ...

3Emrik9mo
5Vishrut Arya10mo
hi Matt! on the coordination crux, you say  but wouldn’t an AGI be able to coordinate and do knowledge sharing with humans because  a) it can impersonate being a human online and communicate with them via text and speech and  b) it‘ll realize such coordination is vital to accomplish it‘s goals and so it’ll do the necessary acculturation?  Watching all the episodes of Friends or reading all the social media posts by the biggest influencers, as examples.
3Emrik9mo
One reason that a fully general AGI might be more profitable than specialised AIs, despite obvious gains-from-specialisation, is if profitability depends on insight-production. For humans, it's easier to understand a particular thing the more other things you understand. One of the main ways you make novel intellectual progress is by combining remote associations [https://en.wikipedia.org/wiki/Remote_Associates_Test] from models about different things. Insight-ability for a particular novel task grows with the number of good models you have available to draw connections between. But, it could still be that the gains from increased generalisation for a particular model grows too slowly and can't compete with obvious gains from specialised AIs.
2David Johnston9mo
Slightly relatedly, I think it's possible that "causal inference is hard". The idea is: once someone has worked something out, they can share it and people can pick it up easily, but it's hard to figure the thing out to begin with - even with a lot of prior experience and efficient inference, most new inventions still need a lot of trial and error. Thus the reason the process of technology accumulation is gradual is, crudely, because causal inference is hard. Even if this is true, one way things could still go badly is if most doom scenarios are locked behind a bunch of hard trial and error, but the easiest one isn't. On the other hand, if both of these things are true then there could be meaningful safety benefits gained from censoring certain kinds of data.
2Gerald Monroe9mo
This is what struck me as the least likely to be true from the above AI doom scenario. Is diamondoid nanotechnology possible?  Very likely it is or something functionally equivalent.   Can a sufficiently advanced superintelligence infer how to build it from scratch solely based on human data?  Or will it need a large R&D center with many, many robotic systems that conduct experiments in parallel to extract the information required about our specific details of physics in our actual universe.  Not the very slightly incorrect approximations a simulator will give you.   The 'huge R&D center so big you can't see the end of it' is somewhat easier to regulate the 'invisible dust the AI assembles with clueless stooges'.
8Marion Z.9mo
Any individual doomsday mechanism we can think of, I would agree is not nearly so simple for an AGI to execute as Yudkowsky implies. I do think that it's quite likely we're just not able to think of mechanisms even theoretically that an AGI could think of,  and one or more of those might actually be quite easy to do secretly and quickly. I wouldn't call it guaranteed by any means, but intuitively this seems like the sort of thing that raw cognitive power might have a significant bearing on.
5Gerald Monroe9mo
I agree. One frightening mechanism I thought of is : "ok, assume the AGI can't craft the bioweapon or nanotechnology killbots without collecting vast amounts of information through carefully selected and performed experiments. (Basically enormous complexes full of robotics). How does it get the resources it needs? And the answer is it scams humans into doing it. We have many examples of humans trusting someone they shouldn't even when the evidence was readily available that they shouldn't.
1Keenmaster9mo
Any “huge R&D center” constraint is trivialized in a future where agile, powerful robots will be ubiquitous and an AGI can use robots to create an underground lab in the middle of nowhere, using its superintelligence to be undetectable in all ways that are physically possible. An AGI will also be able to use robots and 3D printers to fabricate purpose-built machines that enable it to conduct billions of physical experiments a day. Sure, it would be harder to construct something like a massive particle accelerator, but 1) that isn’t needed to make killer nanobots 2) even that isn’t impossible for a sufficiently intelligent machine to create covertly and quickly.

First, some remarks about the meta-level:

The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so. Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly - such that somebody out there had the knowledge to write all of this themselves, if they could only have written it up, but they couldn't write, so didn't try. I'm not particularly hopeful of this turning out to be true in real life, but I suppose it's one possible place for a "positive model violation" (miracle). The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years into even normies starting to notice the death game, it is still Eliezer Yudkowsky writing up this list, says that humanity still has only one gamepiece that can do that.

Actually, I don't feel like I learned that much reading this list, compared to what I already knew. [EDIT: To be clear, this know...

There is a big chunk of what you're trying to teach which not weird and complicated, namely: "find this other agent, and what their values are". Because, "agents" and "values" are natural concepts, for reasons strongly related to "there's a relatively simple core structure that explains why complicated cognitive machines work".

This seems like it must be true to some degree, but "there is a big chunk" feels a bit too strong to me.

Possibly we don't disagree, and just have different notions of what a "big chunk" is. But some things that make the chunk feel smaller to me:

• Humans are at least a little coherent, or we would never get anything done; but we aren't very coherent, so the project of piecing together 'what does the human brain as a whole "want"' can be vastly more difficult than the problem of figuring out what a coherent optimizer wants.
• There are shards of planning and optimization and goal-oriented-ness in a cat's brain, but 'figure out what utopia would look like for a cat' is a far harder problem than 'identify all of the goal-encoding parts of the cat's brain and "read off" those goals'. E.g., does 'identifying utopia' in this context involve uplifting or extrapolating the
...

Humans are at least a little coherent, or we would never get anything done; but we aren't very coherent, so the project of piecing together 'what does the human brain as a whole "want"' can be vastly more difficult than the problem of figuring out what a coherent optimizer wants.

This is a point where I feel like I do have a substantial disagreement with the "conventional wisdom" of LessWrong.

First, LessWrong began with a discussion of cognitive biases in human irrationality, so this naturally became a staple of the local narrative. On the other hand, I think that a lot of presumed irrationality is actually rational but deceptive behavior (where the deception runs so deep that it's part of even our inner monologue). There are exceptions, like hyperbolic discounting, but not that many.

Second, the only reason why the question "what X wants" can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent. Therefore, if X is not entirely coherent then X's preferences are only approximately defined, and hence we only need to infer them approximately. So, the added difficulty of inferring X's preferences, resulting from the partial ...

Second, the only reason why the question "what X wants" can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent.

I'm not sure this is true; or if it's true, I'm not sure it's relevant. But assuming it is true...

Therefore, if X is not entirely coherent then X's preferences are only approximately defined, and hence we only need to infer them approximately.

... this strikes me as not capturing the aspect of human values that looks strange and complicated. Two ways I could imagine the strangeness and complexity cashing out as 'EU-maximizer-ish' are:

• Maybe I sort-of contain a lot of subagents, and 'my values' are the conjunction of my sub-agents' values (where they don't conflict), plus the output of an idealized negotiation between my sub-agents (where they do conflict).
• Alternatively, maybe I have a bunch of inconsistent preferences, but I have a complicated pile of meta-preferences that collectively imply some chain of self-modifications and idealizations that end up producing something more coherent and utility-function-ish after a long sequence of steps.

In both cases, the fact that my brain isn't a single coherent EU maximiz...

7Vanessa Kosoy10mo
If we go down that path then it becomes the sort of conversation where I have no idea what common assumptions do we have, if any, that we could use to agree. As a general rule, I find it unconstructive, for the purpose of trying to agree on anything, to say things like "this (intuitively compelling) assumption is false" unless you also provide a concrete argument or an alternative of your own. Otherwise the discussion is just ejected into vacuum. Which is to say, I find it self-evident that "agents" are exactly the sort of beings that can "want" things, because agency is about pursuing objectives and wanting is about the objectives that you pursue. If you don't believe this then I don't know what these words even mean for you. Maybe, and maybe this means we need to treat "composite agents" explicitly in our models. But, there is also a case to be made that groups of (super)rational agents effectively converge into a single utility function, and if this is true, then the resulting system can just as well be interpreted as a single agent having this effective utility function, which is a solution that should satisfy the system of agents according to their existing bargaining equilibrium. If your agent converges to optimal behavior asymptotically, then I suspect it's still going to have infinite g and therefore an asymptotically-crisply-defined utility function. Of course it doesn't help on its own. What I mean is, we are going to find a precise mathematical formalization of this concept and then hard-code this formalization into our AGI design.
5Rob Bensinger10mo
Fair enough! I don't think I agree in general, but I think 'OK, but what's your alternative to agency?' is an especially good case for this heuristic. The first counter-example that popped into my head was "a mind that lacks any machinery for considering, evaluating, or selecting actions; but it does have machinery for experiencing more-pleasurable vs. less pleasurable states". This is a mind we should be able to build, even if it would never evolve naturally. Possibly this still qualifies as an "agent" that "wants" and "pursues" things, as you conceive it, even though it doesn't select actions?
9Vanessa Kosoy10mo
My 0th approximation answer is: you're describing something logically incoherent, like a p-zombie. My 1st approximation answer is more nuanced. Words that, in the pre-Turing era, referred exclusively to humans (and sometimes animals, and fictional beings), such as "wants", "experiences" et cetera, might have two different referents. One referent is a natural concept, something tied into deep truths about how the universe (or multiverse) works. In particular, deep truths about the "relatively simple core structure that explains why complicated cognitive machines work". The other referent is something in our specifically-human "ontological model" of the world (technically, I imagine that to be an infra-POMDP that all our hypotheses our refinements of). Since the latter is a "shard" of the former produced by evolution, the two referents are related, but might not be the same. (For example, I suspect that cats lack natural!consciousness but have human!consciousness.) The creature you describe does not natural!want anything. You postulated that it is "experiencing more pleasurable and less pleasurable states", but there is no natural method that would label its states as such, or that would interpret them as any sort of "experience". On the other hand, maybe if this creature is designed as a derivative of the human brain, then it does human!want something, because our shard of the concept of "wanting" mislabels (relatively to natural!want) weird states that wouldn't occur in the ancestral environment. You can then ask, why should we design the AI to follow what we natural!want rather than what we human!want? To answer this, notice that, under ideal conditions, you converge to actions that maximize your natural!want, (more or less) according to definition of natural!want. In particular, under ideal conditions, you would build an AI that follows your natural!want. Hence, it makes sense to take a shortcut and "update now to the view you will predictably update to later":

"the thing where it keeps being literally him doing this stuff is quite a bad sign"

I'm a bit confused by this part. Some thoughts on why it seems odd for him (or others) to express that sentiment...

1. I parse the original as, "a collection of EY's thoughts on why safe AI is hard". It's EY's thoughts, why would someone else (other than @robbensinger) write a collection of EY's thoughts?

(And if we generalize to asking why no-one else would write about why safe AI is hard, then what about Superintelligence, or the AI stuff in cold-takes, or ...?)

2. Was there anything new in this doc? It's prob useful to collect all in one place, but we don't ask, "why did no one else write this" for every bit of useful writing out there, right?

Why was it so overwhelmingly important that someone write this summary at this time, that we're at all scratching our heads about why no one else did it?

Copying over my reply to Eric:

My shoulder Eliezer (who I agree with on alignment, and who speaks more bluntly and with less hedging than I normally would) says:

1. The list is true, to the best of my knowledge, and the details actually matter.

Many civilizations try to make a canonical
...
5handoflixue9mo
I don't think making this list in 1980 would have been meaningful. How do you offer any sort of coherent, detailed plan for dealing with something when all you have is toy examples like Eliza?  We didn't even have the concept of machine learning back then - everything computers did in 1980 was relatively easily understood by humans, in a very basic step-by-step way. Making a 1980s computer "safe" is a trivial task, because we hadn't yet developed any technology that could do something "unsafe" (i.e. beyond our understanding). A computer in the 1980s couldn't lie to you, because you could just inspect the code and memory and find out the actual reality. What makes you think this would have been useful? Do we have any historical examples to guide us in what this might look like?

I think most worlds that successfully navigate AGI risk have properties like:

• AI results aren't published publicly, going back to more or less the field's origin.
• The research community deliberately steers toward relatively alignable approaches to AI, which includes steering away from approaches that look like 'giant opaque deep nets'.
• This means that you need to figure out what makes an approach 'alignable' earlier, which suggests much more research on getting de-confused regarding alignable cognition.
• Many such de-confusions will require a lot of software experimentation, but the kind of software/ML that helps you learn a lot about alignment as you work with it is itself a relatively narrow target that you likely need to steer towards deliberately, based on earlier, weaker deconfusion progress. I don't think having DL systems on hand to play with has helped humanity learn much about alignment thus far, and by default, I don't expect humanity to get much more clarity on this before AGI kills us.
• Researchers focus on trying to predict features of future systems, and trying to get mental clarity about how to align such systems, rather than focusing on 'align ELIZA' just because ELIZA is
...
5Thomas Kwa6mo
"most worlds that successfully navigate AGI risk" is kind of a strange framing to me.  For one thing, it represents p(our world | success) and we care about p(success | our world). To convert between the two you of course need to multiply by p(success) / p(our world). What's the prior distribution of worlds? This seems underspecified. For another, using the methodology "think about whether our civilization seems more competent than the problem is hard" or "whether our civilization seems on track to solve the problem" I might have forecast nuclear annihilation (not sure about this). The methodology seems to work when we're relatively certain about the level of difficulty on the mainline, so if I were more sold on that I would believe this more. It would still feel kind of weird though.
6Vaniver9mo
I mean, I think many of the computing pioneers 'basically saw' AI risk. I noted some surprise that IJ Good didn't write the precursor to this list in 1980, and apparently Wikipedia claims there was an unpublished statement in 1998 [https://en.wikipedia.org/wiki/I._J._Good#Research_and_publications] about AI x-risk; it'd be interesting to see what it contains and how much it does or doesn't line up with our modern conception of why the problem is hard.

The historical figures who basically saw it (George Eliot 1879: "will the creatures who are to transcend and finally supersede us be steely organisms [...] performing with infallible exactness more than everything that we have performed with a slovenly approximativeness and self-defeating inaccuracy?"; Turing 1951: "At some stage therefore we should have to expect the machines to take control") seem to have done so in the spirit of speculating about the cosmic process. The idea of coming up with a plan to solve the problem is an additional act of audacity; that's not really how things have ever worked so far. (People make plans about their own lives, or their own businesses; at most, a single country; no one plans world-scale evolutionary transitions.)

4Andrew McKnight8mo
I'm tempted to call this a meta-ethical failure. Fatalism, universal moral realism, and just-world intuitions seem to be the underlying implicit hueristics or principals that would cause this "cosmic process" thought-blocker.
3ESRogs9mo
Why is this v0 and not https://arbital.com/explore/ai_alignment/, [https://arbital.com/explore/ai_alignment/,] or the Sequences, or any of the documents that Evan links to here [https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities?commentId=HRDoDnHv8bvoW7oPZ]? That's part of what I meant to be responding to — not that this post is not useful, but that I don't see what makes it so special compared to all the other stuff that Eliezer and others have already written.
6ESRogs9mo
To put it another way, I would agree that Eliezer has made (what seem to me like) world-historically-significant contributions to understanding and advocating for (against) AI risk. So, if 2007 Eliezer was asking himself, "Why am I the only one really looking into this?", I think that's a very reasonable question. But here in 2022, I just don't see this particular post as that significant of a contribution compared to what's already out there.
2ESRogs9mo
Wrote a long comment here [https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities?commentId=ePAXXk8AvpdGeynHe]. (Which you've seen, but linking since your comment started as a response to me.)

I agree with pretty much everything here, and I would add into the mix two more claims that I think are especially cruxy and therefore should maybe be called out explicitly to facilitate better discussion:

Claim A: “There’s no defense against an out-of-control omnicidal AGI, not even with the help of an equally-capable (or more-capable) aligned AGI, except via aggressive outside-the-Overton-window acts like preventing the omnicidal AGI from being created in the first place.”

I think this claim is true, on account of gray goo and lots of other things, and I suspect Eliezer does too, and I’m pretty sure other people disagree with this claim.

If someone disagrees with this claim (i.e., if they think that if DeepMind can make an aligned and Overton-window-abiding “helper” AGI, then we don’t have to worry about Meta making a similarly-capable out-of-control omnicidal misaligned AGI the following year, because DeepMind’s AGI will figure out how to protect us), and also believes in extremely slow takeoff, I can see how such a person might be substantially less pessimistic about AGI doom than I am.

Claim B: “Shortly after (i.e., years not decades after) we have dangerous AGI, we will have dang...

I think this claim is true, on account of gray goo and lots of other things, and I suspect Eliezer does too, and I’m pretty sure other people disagree with this claim.

If you have robust alignment, or AIs that are rapidly bootstrapping their level of alignment fast enough to outpace the danger of increased capabilities, aligned AGI could get through its intelligence explosion to get radically superior technology and capabilities. It could also get a hard start on superexponential replication in space, so that no follower could ever catch up, and enough tech and military hardware to neutralize any attacks on it (and block attacks on humans via nukes, bioweapons, robots, nanotech, etc). That wouldn't work if there are thing like vacuum collapse available to attackers, but we don't have much reason to expect that from current science and the leading aligned AGI would find out first.

That could be done without any violation of the territory of other sovereign states. The legality of grabbing space resources is questionable in light of the Outer Space Treaty, but commercial exploitation of asteroids is in the Overton window. The superhuman AGI would also be in a good position to per...

6MichaelStJules10mo
A bit pedantic, but isn't superexponential replication too fast? Won't it hit physical limits eventually, e.g. expanding at the speed of light in each direction, so at most a cubic function of time? Also, never allowing followers to catch up means abandoning at least some or almost all of the space you passed through. Plausibly you could take most of the accessible and useful resources with you, which would also make it harder for pursuers to ever catch up, since they will plausibly need to extract resources every now and then to fuel further travel. On the other hand, it seems unlikely to me that we could extract or destroy resources quickly enough to not leave any behind for pursuers, if they're at most months behind.
7CarlShulman9mo
Naturally it doesn't go on forever, but any situation where you're developing technologies that move you to successively faster exponential trajectories is superexponential overall for some range. E.g. if you have robot factories that can reproduce exponentially until they've filled much of the Earth or solar system, and they are also developing faster reproducing factories,  the overall process is superexponential. So is the history of human economic growth, and the improvement from an AI intelligence explosion. By the time you're at ~cubic expansion being ahead on the early superexponential phase the followers have missed their chance.
3MichaelStJules9mo
I agree that they probably would have missed their chance to catch up with the frontier of your expansion. Maybe an electromagnetic radiation-based assault could reach you if targeted (the speed of light is constant relative to you in a vacuum, even if you're traveling in the same direction), although unlikely to get much of the frontier of your expansion, and there are plausibly effective defenses, too. Do you also mean they wouldn't be able to take most what you've passed through, though? Or it wouldn't matter? If so, how would this be guaranteed (without any violation of the territory of sovereign states on Earth)? Exhaustive extraction in space? An advantage in armed space conflicts?

I agree with these two points. I think an aligned AGI actually able to save the world would probably take initial actions that look pretty similar to those an unaligned AGI would take. Lots of sizing power, building nanotech, colonizing out into space, self-replication, etc.

4Yitz10mo
So how would we know the difference (for the first few years at least)?

If it kills you, then it probably wasn’t aligned.

1Gerald Monroe9mo
Maybe it did that to save your neural weights.  Define 'kill'.
3Quintin Pope9mo
I did say “probably”!
7lc10mo
I disagree with this claim inasmuch as I expect a year headstart by an aligned AI is absolutely enough to prevent Meta from killing me and my family.

Depends on what DeepMind does with the AI, right?

Maybe DeepMind uses their AI in very narrow, safe, low-impact ways to beat ML benchmarks, or read lots of cancer biology papers and propose new ideas about cancer treatment.

Or alternatively, maybe DeepMind asks their AI to undergo recursive self-improvement and build nano-replicators in space, etc., like in Carl Shulman’s reply.

I wouldn’t have thought that the latter is really in the Overton window. But what do I know.

You could also say “DeepMind will just ask their AI what they should do next”. If they do that, then maybe the AI (if they’re doing really great on safety such that the AI answers honestly and helpfully) will reply: “Hey, here’s what you should do, you should let me undergo recursive-self-improvement, and then I’ll be able to think of all kinds of crazy ways to destroy the world, and then I can think about how to defend against all those things”. But if DeepMind is being methodical & careful enough that their AI hasn’t destroyed the world already by this point, I’m inclined to think that they’re also being methodical & careful enough that when the AI proposes to do that, DeepMind will say, “Umm, no, that’s total...

3lc10mo
If DeepMind was committed enough to successfully build an aligned AI (which, as extensively elaborated upon in the post, is a supernaturally difficult proposition), I would assume they understand why running it is necessary. There's no reason to take all of the outside-the-overton-window measures indicated in the above post unless you have functioning survival instincts and have thought through the problem sufficiently to hit the green button.
2MichaelStJules10mo
If you can build one aligned superintelligence, then plausibly you can 1. explain to other AGI developers how to make theirs safe or even just give them a safe design (maybe homomorphically encrypted to prevent modification, but they might not trust that), and 2. have aligned AGI monitoring the internet and computing resources, and alert authorities of abnomalies that might signal new AGI developments. Require that AGI developments provide proof that they were designed according to one of a set of approved designs, or pass some tests determined by your aligned superintelligence. Then aligned AGI can proliferate first and unaligned AGI will plausibly face severe barriers. Plausibly 1 is enough, since there's enough individual incentive to build something safe or copy other people's designs and save work. 2 depends on cooperation with authorities and I'd guess cloud computing service providers or policy makers.

explain to other AGI developers how to make theirs safe or even just give them a safe design (maybe homomorphically encrypted to prevent modification, but they might not trust that)

What if the next would-be AGI developer rejects your “explanation”, and has their own great ideas for how to make an even better next-gen AGI that they claim will work better, and so they discard your “gift” and proceed with their own research effort?

I can think of at least two leaders of would-be AGI development efforts (namely Yann LeCun of Meta and Jeff Hawkins of Numenta) who believe (what I consider to be) spectacularly stupid things about AGI x-risk, and have believed those things consistently for decades, despite extensive exposure to good counter-arguments.

Or what if the next would-be AGI developer agrees with you and accepts your “gift”, and so does the one after that, and the one after that, but not the twelfth one?

have aligned AGI monitoring the internet and computing resources, and alert authorities of [anomalies] that might signal new AGI developments. Require that AGI developments provide proof that they were designed according to one of a set of approved designs, or pass some tests determi

...
4MichaelStJules10mo
When you ask "what if", are you implying these things are basically inevitable? And inevitable no matter how much more compute aligned AGIs have before unaligned AGIs are developed and deployed? How much of a disadvantage against aligned AGIs does an unaligned AGI need before doom isn't overwhelmingly likely? What's the goal post here for survival probability? You can have AGIs monitoring for pathogens, nanotechnology, other weapons, and building defenses against them, and this could be done locally and legally. They can monitor transactions and access to websites through which dangerous physical systems (including possibly factories, labs, etc.) could be taken over or built. Does every country need to be competent and compliant to protect just one country from doom? The Overton window could also shift dramatically if omnicidal weapons are detected. I agree that plausibly not every country with significant compute will comply, and hacking everyone is outside the public Overton window. I wouldn't put hacking everyone past the NSA, but also wouldn't count on them either.
4Steven Byrnes10mo
Let’s see, I think “What if the next would-be AGI developer rejects your “explanation” / “gift”” has a probability that asymptotes to 100% as the number of would-be AGI developers increases. (Hence “Claim B” above [https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities?commentId=Sf4caqwxDFyYHoxFZ#Sf4caqwxDFyYHoxFZ] becomes relevant.) I think “What if the authorities in most countries do care, but not the authorities in every single country?” seems to have high probability in today’s world, although of course I endorse efforts to lower the probability. I think “What if the only way to “monitor the internet and computing resources” is to hack into every data center and compute cluster on the planet? (Including those in secret military labs.)” seems very likely to me, conditional on “Claim B” above [https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities?commentId=Sf4caqwxDFyYHoxFZ#Sf4caqwxDFyYHoxFZ]. Hmm. Offense-defense balance in bio-warfare is not obvious to me. Preventing a virus from being created would seem to require 100% compliance by capable labs, but I’m not sure how many “capable labs” there are, or how geographically distributed and rule-following. Once the virus starts spreading, aligned AGIs could help with vaccines, but apparently a working COVID-19 vaccine was created in 1 day, and that didn’t help much, for various societal coordination & governance reasons. So then you can say “Maybe aligned AGI will solve all societal coordination and governance problems”. And maybe it will! Or, maybe some of those coordination & governance problems come from blame-avoidance and conflicts-of-interest and status-signaling and principle-agent problems and other things that are not obviously solvable by easy access to raw intelligence. I don’t know. Offense-defense balance in nuclear warfare is likewise not obvious to me. I presume that an unaligned AGI could find a way to manipulate nuclear early warning systems (t
4MichaelStJules10mo
Some more skepticism about infectious diseases and nukes killing us all here: https://www.lesswrong.com/posts/MLKmxZgtLYRH73um3/we-will-be-around-in-30-years?commentId=DJygArj3sj8cmhmme [https://www.lesswrong.com/posts/MLKmxZgtLYRH73um3/we-will-be-around-in-30-years?commentId=DJygArj3sj8cmhmme] Also my more general skeptical take against non-nano attacks here: https://www.lesswrong.com/posts/MLKmxZgtLYRH73um3/we-will-be-around-in-30-years?commentId=TH4hGeXS4RLkkuNy5 [https://www.lesswrong.com/posts/MLKmxZgtLYRH73um3/we-will-be-around-in-30-years?commentId=TH4hGeXS4RLkkuNy5] With nanotech, I think there will be tradeoffs between targeting effectiveness and requiring (EM) signals from computers that can be effectively interferred with through things within or closer to the Overton window. Maybe a crux is how good autonomous nanotech with no remote control would be at targeting humans or spreading so much that it just gets into almost all buildings or food or water because it's basically going everywhere.
4Steven Byrnes9mo
Thanks! I wasn’t assuming the infectious diseases and nukes by themselves would kill us all. They don’t have to, because the AGI can do other things in conjunction, like take command of military drones and mow down the survivors (or bomb the PPE factories), or cause extended large-scale blackouts, which would incidentally indirectly prevent PPE production and distribution, along with preventing pretty much every other aspect of an organized anti-pandemic response. See Section 1.6 here [https://www.lesswrong.com/posts/4basF9w9jaPZpoC8R/intro-to-brain-like-agi-safety-1-what-s-the-problem-and-why#1_6_Why_are_AGI_accidents_such_a_big_deal_]. So that brings us to the topic of offense-defense balance for illicitly taking control of military drones. And I would feel concerned about substantial delays before the military trusts a supposedly-aligned AGI so much that they give it root access to all its computer systems (which in turn seems necessary if the aligned AGI is going to be able to patch all the security holes, defend against spear-phishing attacks, etc.) Of course there’s the usual caveat that maybe DeepMind will give their corrigible aligned AGI permission to hack into military systems (for their own good!), and then maybe we wouldn’t have to worry. But the whole point of this discussion is that I’m skeptical that DeepMind would actually give their AGI permission to do something like that. And likewise we would need to talk about offense-defense balance for the power grid. And I would have the same concern about people being unwilling to give a supposedly-aligned AGI root access to all the power grid computers. And I would also be concerned about other power grid vulnerabilities like nuclear EMPs, drone attacks on key infrastructure, etc. And likewise, what’s the offense-defense balance for mass targeted disinformation campaigns? Well, if DeepMind gives its AGI permission to engage in a mass targeted counter-disinformation campaign, maybe we’d be OK on that fr
3MichaelStJules9mo
I think there would be too many survivors and enough manned defense capability for existing drones to directly kill the rest of us with high probability. Blocking PPE production and organized pandemic responses still won't stop people from self-isolating, doing no contact food deliveries, etc., although things would be tough, and deliveries and food production would be good targets for drone strikes. It could be bad if lethal pathogens become widespread and practically unremovable in our food/water, or if food production is otherwise consistently attacked, but the militaries would probably step in to protect the food/water supplies. I think, overall, there are too few ways to reliably and kill double or even single digit percentages of the human population with high probability and that can be combined to get basically everyone with high probability. I'm not saying there aren't any, but I'm skeptical that there are enough. There are diminishing returns on doing the same ones (like pandemics) more, because of resistance, and enough people being personally very careful or otherwise difficult targets.

Found this to be an interesting list of challenges, but I disagree with a few points. (Not trying to be comprehensive here, just a few thoughts after the first read-through.)

• Several of the points here are premised on needing to do a pivotal act that is way out of distribution from anything the agent has been trained on. But it's much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time. With this iterative approach to deployment, you only need to generalize a little bit out of distribution. Further, you can use Agent N to help you closely supervise Agent N+1 before giving it any power.
• One claim is that Capabilities generalize further than alignment once capabilities start to generalize far. The argument is that an agent's world model and tactics will be automatically fixed by reasoning and data, but its inner objective won't be changed by these things. I agree with the preceding sentence, but I would draw a different (and more optimistic) conclusion from it. That it might be possible to establish an agent's inner objective when training on easy problems, when the agent isn't very capable, such that this objective remains stable a
...

But it's much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time. With this iterative approach to deployment, you only need to generalize a little bit out of distribution. Further, you can use Agent N to help you closely supervise Agent N+1 before giving it any power.

My model of Eliezer claims that there are some capabilities that are 'smooth', like "how large a times table you've memorized", and some are 'lumpy', like "whether or not you see the axioms behind arithmetic." While it seems plausible that we can iteratively increase smooth capabilities, it seems much less plausible for lumpy capabilities.

A specific example: if you have a neural network with enough capacity to 1) memorize specific multiplication Q+As and 2) implement a multiplication calculator, my guess is that during training you'll see a discontinuity in how many pairs of numbers it can successfully multiply.[1] It is not obvious to me whether or not there are relevant capabilities like this that we'll "find with neural nets" instead of "explicitly programming in"; probably we will just build AlphaZero so that it uses MCTS instead of finding MCTS with grad...

5John Schulman10mo
Re: smooth vs bumpy capabilities, I agree that capabilities sometimes emerge abruptly and unexpectedly. Still, iterative deployment with gradually increasing stakes is much safer than deploying a model to do something totally unprecedented and high-stakes. There are multiple ways to make deployment more conservative and gradual. (E.g., incrementally increase the amount of work the AI is allowed to do without close supervision, incrementally increase the amount of KL-divergence between the new policy and a known-to-be-safe policy.) Re: ontological collapse, there are definitely some tricky issues here, but the problem might not be so bad with the current paradigm, where you start with a pretrained model (which doesn't really have goals and isn't good at long-horizon control), and fine-tune it (which makes it better at goal-directed behavior). In this case, most of the concepts are learned during the pretraining phase, not the fine-tuning phase where it learns goal-directed behavior.
6Vaniver10mo
I agree with the "X is safer than Y" claim; I am uncertain whether it's practically available to us, and much more worried in worlds where it isn't available. For this specific proposal, when I reframe it as "give the system a KL-divergence budget to spend on each change to its policy" I worry that it works against a stochastic attacker but not an optimizing attacker; it may be the case that every known-to-be-safe policy has some unsafe policy within a reasonable KL-divergence of it, because the danger can be localized in changes to some small part of the overall policy-space. Yeah, I agree that this seems pretty good. I do naively guess that when you do the fine-tuning, it's the concepts that are most related to the goals who change the most (as they have the most gradient pressure on them); it'd be nice to know how much this is the case, vs. most of the relevant concepts being durable parts of the environment that were already very important for goal-free prediction.

Several of the points here are premised on needing to do a pivotal act that is way out of distribution from anything the agent has been trained on. But it's much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time.

To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later?  What is the weak pivotal act that you can perform so safely?

Human raters make systematic errors - regular, compactly describable, predictable errors.... This is indeed one of the big problems of outer alignment, but there's lots of ongoing research and promising ideas for fixing it. Namely, using models to help amplify and improve the human feedback signal. Because P!=NP it's easier to verify proofs than to write them.

When the rater is flawed, cranking up the power to NP levels blows up the P part of the system.

To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later?  What is the weak pivotal act that you can perform so safely?

Do alignment & safety research, set up regulatory bodies and monitoring systems.

When the rater is flawed, cranking up the power to NP levels blows up the P part of the system.

Not sure exactly what this means. I'm claiming that you can make raters less flawed, for example, by decomposing the rating task, and providing model-generated critiques that help with their rating. Also, as models get more sample efficient, you can rely more on highly skilled and vetted raters.

Not sure exactly what this means.

My read was that for systems where you have rock-solid checking steps, you can throw arbitrary amounts of compute at searching for things that check out and trust them, but if there's any crack in the checking steps, then things that 'check out' aren't trustable, because the proposer can have searched an unimaginably large space (from the rater's perspective) to find them. [And from the proposer's perspective, the checking steps are the real spec, not whatever's in your head.]

In general, I think we can get a minor edge from "checking AI work" instead of "generating our own work" and that doesn't seem like enough to tackle 'cognitive megaprojects' (like 'cure cancer' or 'develop a pathway from our current society to one that can reliably handle x-risk' or so on). Like, I'm optimistic about "current human scientists use software assistance to attempt to cure cancer" and "an artificial scientist attempts to cure cancer" and pretty pessimistic about "current human scientists attempt to check the work of an artificial scientist that is attempting to cure cancer." It reminds me of translators who complained pretty bitterly about being given machine-transl...

7Raphaël S9mo
If Facebook AI research is such a threat, wouldn't it be possible to talk to Yann LeCun?

I did, briefly.  I ask that you not do so yourself, or anybody else outside one of the major existing organizations, because I expect that will make things worse as you annoy him and fail to phrase your arguments in any way he'd find helpful.

Other MIRI staff have also chatted with Yann. One co-worker told me that he was impressed with Yann's clarity of thought on related topics (e.g., he has some sensible, detailed, reductionist models of AI), so I'm surprised things haven't gone better.

Non-MIRI folks have talked to Yann too; e.g., Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More.

9TekhneMakre9mo
What happened?

Nothing much.

There was also a debate between Yann and Stuart Russel on facebook, which got discussed here:

https://www.lesswrong.com/posts/WxW6Gc6f2z3mzmqKs/debate-on-instrumental-convergence-between-lecun-russell

For a more comprehensive writeup of some stuff related to the "annoy him and fail to phrase your arguments helpfully", see Idea Innoculation and Inferential Distance

My view is that if Yann continues to be interested in arguing about the issue then there's something to work with, even if he's skeptical, and the real worry is if he's stopped talking to anyone about it (I have no idea personally what his state of mind is right now)

1jrincayc9mo
Produce the Textbook From The Future that tells us how to do AGI safely. That said, getting an AGI to generate a correct Foom safety textbook or AGI Textbook from the future would be incredibly difficult, it would be very possible for an AGI to slip in a subtle hard-to-detect inaccuracy that would make it worthless, verifying that it is correct would be very difficult, and getting all humans on earth to follow it would be very difficult.

-3.  I'm assuming you are already familiar with some basics, and already know what 'orthogonality' and 'instrumental convergence' are and why they're true.

I think this is actually the part that I most "disagree" with. (I put "disagree" in quotes, because there are forms of these theses that I'm persuaded by. However, I'm not so confident that they'll be relevant for the kinds of AIs we'll actually build.)

1. The smart part is not the agent-y part

It seems to me that what's powerful about modern ML systems is their ability to do data compression / pattern recognition. That's where the real cognitive power (to borrow Eliezer's term) comes from. And I think that this is the same as what makes us smart.

GPT-3 does unsupervised learning on text data. Our brains do predictive processing on sensory inputs. My guess (which I'd love to hear arguments against!) is that there's a true and deep analogy between the two, and that they lead to impressive abilities for fundamentally the same reason.

If so, it seems to me that that's where all the juice is. That's where the intelligence comes from. (In the past, I've called this the core smarts of our brains.)

On this view, all the agent-y, planful...

GPT-3 does unsupervised learning on text data. Our brains do predictive processing on sensory inputs. My guess (which I'd love to hear arguments against!) is that there's a true and deep analogy between the two, and that they lead to impressive abilities for fundamentally the same reason.

Agree that self-supervised learning powers both GPT-3 updates and human brain world-model updates (details & caveats). (Which isn’t to say that GPT-3 is exactly the same as the human brain world-model—there are infinitely many different possible ML algorithms that all update via self-supervised learning).

However…

If so, it seems to me that that's where all the juice is. That's where the intelligence comes from … if agency is not a fundamental part of intelligence, and rather something that can just be added in on top, or not, and if we're at a loss for how to either align a superintelligent agent with CEV or else make it corrigible, then why not try to avoid creating the agent part of superintelligent agent?

I disagree; I think the agency is necessary to build a really good world-model, one that includes new useful concepts that humans have never thought of.

Without the agency, some of the things ...

4ESRogs9mo
Why is agency necessary for these things? If we follow Ought's advice [https://www.lesswrong.com/posts/pYcFPMBtQveAjcSfH/supervise-process-not-outcomes] and build "process-based systems [that] are built on human-understandable task decompositions, with direct supervision of reasoning steps", do you expect us to hit a hard wall somewhere that prevents these systems from creatively choosing things to think about, books to read, or better brainstorming strategies?
7Steven Byrnes9mo
(Copying from here [https://www.lesswrong.com/posts/SzrmsbkqydpZyPuEh/my-take-on-vanessa-kosoy-s-take-on-agi-safety]:) (Does that count as “agency”? I don’t know, it depends on what you mean by “agency”.) In terms of the “task decomposition” strategy, this might be a tricky to discuss because you probably have a more detailed picture in your mind than I do. I’ll try anyway. It seems to me that the options are: (1) the subprocess only knows its narrow task (“solve this symplectic geometry homework problem”), and is oblivious to the overall system goal (“design a better microscope”), or (2) the subprocess is aware of the overall system goal and chooses actions in part to advance it. In Case (2), I’m not sure this really counts as “task decomposition” in the first place, or how this would help with safety. In Case (1), yes I expect systems to hit a hard wall—I’m skeptical that tasks we care about decompose cleanly. For example, at my last job, I would often be part of a team inventing a new gizmo, and it was not at all unusual for me to find myself sketching out the algorithms and sketching out the link budget and scrutinizing laser spec sheets and scrutinizing FPGA spec sheets and nailing down end-user requirements, etc. etc. Not because I’m individually the best person at each of those tasks—or even very good!—but because sometimes a laser-related problem is best solved by switching to a different algorithm, or an FPGA-related problem is best solved by recognizing that the real end-user requirements are not quite what we thought, etc. etc. And that kind of design work is awfully hard unless a giant heap of relevant information and knowledge is all together in a single brain / world-model. In the case of my current job doing AI alignment research, I sometimes come across small self-contained tasks that could be delegated, but I would have no idea how to decompose most of what I do. (E.g. writing this comment!) Here’s John Wentworth making a similar point mor
1David Johnston9mo
FWIW self-supervised learning can be surprisingly capable [https://www.lesswrong.com/posts/c2RzFadrxkzyRAFXa/who-models-the-models-that-model-models-an-exploration-of] of doing things that we previously only knew how to do with "agentic" designs. From that link: classification is usually done with an objective + an optimization procedure, but GPT-3 just does it.

For example, I claim that while AlphaGo could be said to be agent-y, it does not care about atoms. And I think that we could make it fantastically more superhuman at Go, and it would still not care about atoms. Atoms are just not in the domain of its utility function.

In particular, I don't think it has an incentive to break out into the real world to somehow get itself more compute, so that it can think more about its next move. It's just not modeling the real world at all. It's not even trying to rack up a bunch of wins over time. It's just playing the single platonic game of Go.

I would distinguish three ways in which different AI systems could be said to "not care about atoms":

1. The system is thinking about a virtual object (e.g., a Go board in its head), and it's incapable of entertaining hypotheses about physical systems. Indeed, we might add the assumption that it can't entertain hypotheses like 'this Go board I'm currently thinking about is part of a larger universe' at all. (E.g., there isn't some super-Go-board I and/or the board are embedded in.)
2. The system can think about atoms/physics, but it only terminally cares about digital things in a simulated environment (e.g., winni
...
5ESRogs9mo
In my mind, this is still making the mistake of not distinguishing the true domain of the agent's utility function from ours. Whether the simulation continues to be instantiated in some computer in our world is a fact about our world, not about the simulated world. AlphaGo doesn't care about being unplugged in the middle of a game (unless that dynamic was part of its training data). It cares about the platonic game of go, not about the instantiated game it's currently playing. We need to worry about leaky abstractions, as per my original comment. So we can't always assume the agent's domain is what we'd ideally want it to be. But I'm trying to highlight that it's possible (and I would tentatively go further and say probable) for agents not to care about the real world. To me, assuming care about the real world (including wanting not to be unplugged) seems like a form of anthropomorphism. For any given agent-y system I think we need to analyze whether it in particular would come to care about real world events. I don't think we can assume in general one way or the other.
6Rob Bensinger9mo
What if the programmers intervene mid-game to give the other side an advantage? Does a Go AGI, as you're thinking of it, care about that? I'm not following why a Go AGI (with the ability to think about the physical world, but a utility function that only cares about states of the simulation) wouldn't want to seize more hardware, so that it can think better and thereby win more often in the simulation; or gain control of its hardware and directly edit the simulation so that it wins as many games as possible as quickly as possible. Why would having a utility function that only assigns utility based on X make you indifferent to non-X things that causally affect X? If I only terminally cared about things that happened a year from now, I would still try to shape the intervening time because doing so will change what happens a year from now. (This is maybe less clear in the case of shutdown, because it's not clear how an agent should think about shutdown if its utility is defined states of its simulation. So I'll set that particular case aside.)
2David Johnston9mo
A Go AI that learns to play go via reinforcement learning might not "have a utility function that only cares about winning Go". Using standard utility theory, you could observe its actions and try to rationalise them as if they were maximising some utility function, and the utility function you come up with probably wouldn't be "win every game of Go you start playing" (what you actually come up with will depend, presumably, on algorithmic and training regime details). The reason why the utility function is slippery is that it's fundamentally an adaptation executor, not a utility maxmiser.
2David Johnston9mo
Not necessarily. Train something multimodally on digital games of Go and on, say, predicting the effects of modifications to its own code on its success at Go. It could be a) good at go and b) have some real understanding of "real world actions" that make it better at Go, and still not actually take any real world actions to make it better at Go, even if it had the opportunity. You could modify the training to make it likely to do so - perhaps by asking it to either make a move or to produce descendants that make better choices - but if you don't do this then it seems entirely plausible, and even perhaps likely, that it develops an understanding of self-modification and of go playing without ever self-modifying in order to play go better. Its goal, so to speak, is "play go with the restriction of using only legal game moves". Edit - forget the real world, here's an experiment: Train a board game playing AI with two modes of operation: game state x move -> outcome and game state -> best move. Subtle difference: in the first mode of operation, the move has a "cheat button" that, when pressed, always results in a win. In the second, it can output cheat button presses, but it has no effect on winning or losing. Question is: does it learn to press the cheat button? I'm really not sure. Could you prevent it from learning to press the cheat button if training feedback is never allowed to depend on whether or not this button was pressed? That seems likely.
7James Payor9mo
Can you visualize an agent that is not "open-ended" in the relevant ways, but is capable of, say, building nanotech and melting all the GPUs? In my picture most of the extra sauce you'd need on top of GPT-3 looks very agenty. It seems tricky to name "virtual worlds" in which AIs manipulate just "virtual resources" and still manage to do something like melting the GPUs.
8James Payor9mo
I should say that I do see this as a reasonable path forward! But we don't seem to be coordinating to do this, and AI researchers seem to love doing work on open-ended agents, which sucks. Hm, regardless it doesn't really move the needle, so long as people are publishing all of their work. Developing overpowered pattern recognizers is similar to increasing our level of hardware overhang. People will end up using them as components of systems that aren't safe.
4David Johnston9mo
I strongly disagree. Gain of function research happens, but it's rare because people know it's not safe. To put it mildly, I think reducing the number of dangerous experiments substantially improves the odds of no disaster happening over any given time frame
5ESRogs9mo
FWIW, I'm not sold on the idea of taking a single pivotal act. But, engaging with what I think is the real substance of the question — can we do complex, real-world, superhuman things with non-agent-y systems? Yes, I think we can! Just as current language models can be prompt-programmed into solving arithmetic word problems, I think a future system could be led to generate a GPU-melting plan, without it needing to be a utility-maximizing agent. For a very hand-wavy sketch of how that might go, consider asking GPT-N to generate 1000s of candidate high-level plans, then rate them by feasibility, then break each plan into steps and re-evaluate, etc. Or, alternatively, imagine the cognitive steps you might take if you were trying to come up with a GPU-melting plan (or alternatively a pivotal act plan in general). Do any of those steps really require that you have a utility function or that you're a goal-directed agent? It seems to me that we need some form of search, and discrimination and optimization. But not necessarily anymore than GPT-3 already has. (It would just need to be better at the search. And we'd need to make many many passes through the network to complete all the cognitive steps.) On your view, what am I missing here? * Is GPT-3 already more of an agent than I realize? (If so, is it dangerous?) * Will GPT-N by default be more of an agent than GPT-3? * Are our own thought processes making use of goal-directedness more than I realize? * Will prompt-programming passive systems hit a wall somewhere? * If so, what are some of the simplest cognitive tasks that we can do that you think such systems wouldn't be able to do? * (See also my similar question here [https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities?commentId=iqwxMcpxeWG4Sk65h].)
4David Johnston9mo
FWIW, I'd call this "weakly agentic" in the sense that you're searching through some options, but the number of options you're looking through is fairly small. It's plausible that this is enough to get good results and also avoid disasters, but it's actually not obvious to me. The basic reason: if the top 1000 plans are good enough to get superior performance, they might also be "good enough" to be dangerous. While it feels like there's some separation between "useful and safe" and "dangerous" plans and this scheme might yield plans all of the former type, I don't presently see a stronger reason to believe that this is true.
6ESRogs9mo
Separately from whether the plans themselves are safe or dangerous, I think the key question is whether the process that generated the plans is trying to deceive you (so it can break out into the real world or whatever). If it's not trying to deceive you, then it seems like you can just build in various safeguards (like asking, "is this plan safe?", as well as more sophisticated checks), and be okay.
2TekhneMakre9mo
>then rate them by feasibility, I mean, literal GPT is just going to have poor feasibility ratings for novel engineering concepts. >Do any of those steps really require that you have a utility function or that you're a goal-directed agent? Yes, obviously. You have to make many scientific and engineering discoveries, which involves goal-directed investigation.  > Are our own thought processes making use of goal-directedness more than I realize? Yes, you know which ideas make sense by generalizing from ideas more closely tied in with the actions you take directed towards living.
4David Johnston9mo
What do you think of a claim like "most of the intelligence comes from the steps where you do most of the optimization"? A corollary of this is that we particularly want to make sure optimization intensive steps of AI creation are safe WRT not producing intelligent programs devoted to killing us. Example: most of the "intelligence" of language models comes from the supervised learning step. However, it's in-principle plausible that we could design e.g. some really capable general purpose reinforcement learner where the intelligence comes from the reinforcement, and the latter could (but wouldn't necessarily) internalise "agenty" behaviour. I have a vague impression that this is already something other people are thinking about, though maybe I read too much into some tangential remarks in this direction. E.g. I figured the concern about mesa-optimizers was partly motivated by the idea that we can't always tell when an optimization intensive step is taking place. I can easily imagine people blundering into performing unsafe optimization-intensive AI creation processes. Gain of function pathogen research would seem to be a relevant case study here, except we currently have less idea about what kind of optimization makes deadly AIs vs what kind of optimization makes deadly pathogens. One of the worries (again, maybe I'm reading too far into comments that don't say this explicitly) is that the likelihood of such a blunder approaches 1 over long enough times, and the "pivotal act" framing is supposed to be about doing something that could change this (??) That said, it seems that there's a lot that could be done to make it less likely in short time frames.
3ESRogs9mo
This seems probably right to me. I agree that reinforcement learners seem more likely to be agent-y (and therefore scarier) than self-supervised learners.

I think until recently, I've been consistently more pessimistic than Eliezer about AI existential safety. Here's a 2004 SL4 post for example where I tried to argue against MIRI (SIAI at the time) trying to build a safe AI (and again in 2011). I've made my own list of sources of AI risk that's somewhat similar to this list. But it seems to me that there are still various "outs" from certain doom, such that my probability of a good outcome is closer to 20% (maybe a range of 10-30% depending on my mood) than 1%.

1. Human thought partially exposes only a partially scrutable outer surface layer. Words only trace our real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of human thought are not exposed for direct imitation learning and can't be put in any dataset. This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.

One of the...

[This is a nitpick of the form "one of your side-rants went a bit too far IMO;" feel free to ignore]

The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so. Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly - such that somebody out there had the knowledge to write all of this themselves, if they could only have written it up, but they couldn't write, so didn't try. ... The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years into even normies starting to notice the death game, it is still Eliezer Yudkowsky writing up this list, says that humanity still has only one gamepiece that can do that.

The third option this seems to miss is that there are people who could have written this document, but they also thought they had better things to do than write it. I'm thinking of people like Paul Christiano, Nate Soares, John W...

I'm thinking of people like Paul Christiano, Nate Soares, John Wentworth, Ajeya Cotra...  [...] I do agree with you that they seem to on average be way way too optimistic, but I don't think it's because they are ignorant of the considerations and arguments you've made here.

I don't think Nate is that much more optimistic than Eliezer, but I believe Eliezer thinks Nate couldn't have generated enough of the list in the OP, or couldn't have generated enough of it independently ("using the null string as input").

1gettinglesswrong9mo
>too would be cautiously optimistic if I thought we had 30 years left   This is a bit of an aside but can I ask what the general opinion is on how many years we had left? Was your comment stating that it's optimistic to think we have 30 years left before AGI, or optimistic about the remainder of the sentence?
2[comment deleted]9mo
-13Eliezer Yudkowsky10mo

I would summarize a dimension of the difficulty like this. There are the conditions that give rise to intellectual scenes, intellectual scenes being necessary for novel work in ambiguous domains. There are the conditions that give rise to the sort of orgs that output actions consistent with something like Six Dimensions of Operational Adequacy. The intersection of these two things is incredibly rare but not unheard of. The Manhattan Project was a Scene that had security mindset. This is why I am not that hopeful. Humans are not the ones building the AGI, egregores are, and spending egregore sums of money. It is very difficult for individuals to support a scene of such magnitude, even if they wanted to. Ultra high net worth individuals seem much poorer relative to the wealth of society than in the past, where scenes and universities (a scene generator) could be funded by individuals or families. I'd guess this is partially because the opportunity cost for smart people is much higher now, and you need to match that (cue title card: Baumol's cost disease kills everyone). In practice I expect some will give objections along various seemingly practical lines, but my experience so far is...

6Ben Pace10mo
Thanks, this story is pretty helpful (to my understanding).

Thank you, this was very helpful. As a bright-eyed youngster, it's hard to make sense of the bitterness and pessimism I often see in the field. I've read the old debates, but I didn't participate in them, and that probably makes them easier to dismiss. Object level arguments like these help me understand your point of view.

Mod note: I activated two-axis voting on this post, since it seemed like it would make the conversation go better.

7lc10mo
You should just activate it sitewide already :)

New users are pretty confused by it when I've done some user-testing with it, so I think it needs some polish and better UI before we can launch it sitewide, but I am pretty excited about doing so after that.

6Harry Nyquist9mo
5handoflixue10mo
For what it's worth, I haven't used the site in years and I picked it up just from this thread and the UI tooltips. The most confusing thing was realizing "okay, there really are two different types of vote" since I'd never encountered that before, but I can't think of much that would help (maybe mention it in the tooltip, or highlight them until the user has interacted with both?) Looking forward to it as a site-wide feature - just from seeing it at work here, it seems like a really useful addition to the site

Note: I think there's a bunch of additional reasons for doom, surrounding "civilizational adequacy / organizational competence / societal dynamics". Eliezer briefly alluded to these, but AFAICT he's mostly focused on lethality that comes "early", and then didn't address them much. My model of Andrew Critch has a bunch of concerns about doom that show up later, because there's a bunch of additional challenges you have to solve if AI doesn't dramatically win/lose early on (i.e. multi/multi dynamics and how they spiral out of control)

I know a bunch of people whose hope funnels through "We'll be able to carefully iterate on slightly-smarter-than-human-intelligences, build schemes to play them against each other, leverage them to make some progress on alignment that we can use to build slightly-more-advanced-safer-systems". (Let's call this the "Careful Bootstrap plan")

I do actually feel nonzero optimism about that plan, but when I talk to people who are optimistic about that I feel a missing mood about the kind of difficulty that is involved here.

I'll attempt to write up some concrete things here later, but wanted to note this for now.

0HiroSakuraba9mo
I agree with this line of thought regarding iterative developments of proto-AGI via careful bootstrapping.  Humans will be inadequate for monitoring progress of skills.  Hopefully, we'll have a slew of diagnostic of narrow minded neural networks whose sole purpose is to tease out relevant details of the proto-super human intellect.  What I can't wrap my head around is whether super (or sub) human level intelligence requires consciousness.  If consciousness is required, then is the world worse or better for it?  Is an agent with the rich experience of fears, hopes, joys more or less likely to be built?  Do reward functions reliably grow into feelings, which lead to emotional experiences?  If they do, then perhaps an evolving intelligence wouldn't always be as alien as we currently imagine it.

If someone could find a way to rewrite this post, except in language comprehensible to policymakers, tech executives, or ML researchers, then it would probably achieve a lot.

Yes, please do rewrite the post, or make your own version of a post like this!! :) I don't suggest trying to persuade arbitrary policymakers of AGI risk, but I'd be very keen on posts like this optimized to be clear and informative to different audiences. Especially groups like 'lucid ML researchers who might go into alignment research', 'lucid mathematicians, physicists, etc. who might go into alignment research', etc.

Suggestion: make it a CYOA-style interactive piece, where the reader is tasked with aligning AI, and could choose from a variety of approaches which branch out into sub-approaches and so on. All of the paths, of course, bottom out in everyone dying, with detailed explanations of why. This project might then evolve based on feedback, adding new branches that counter counter-arguments made by people who played it and weren't convinced. Might also make several "modes", targeted at ML specialists, general public, etc., where the text makes different tradeoffs regarding technicality vs. vividness.

I'd do it myself (I'd had the idea of doing it before this post came out, and my preliminary notes covered much of the same ground, I feel the need to smugly say), but I'm not at all convinced that this is going to be particularly useful. Attempts to defeat the opposition by building up a massive evolving database of counter-arguments have been made in other fields, and so far as I know, they never convinced anybody.

The interactive factor would be novel (as far as I know), but I'm still skeptical.

(A... different implementation might be to use a fine-tuned language model for this; make it an AI Dungeon kind of setup, where it provides specialized counter-arguments for any suggestion. But I expect it to be less effective than a more coarse hand-written CYOA, since the readers/players would know that the thing they're talking to has no idea what it's talking about, so would disregard its words.)

Arbital was meant to support galaxy-brained attempts like this; Arbital failed.

6Thane Ruthenis10mo
Failed as a platform for hosting galaxy-brained attempts, or failed as in every similar galaxy-brained attempt on it failed? I haven't spent a lot of time there, but my impression is that Arbital is mostly a wiki-style collection of linked articles, not a dumping ground of standalone esoterically-structured argumentative pieces. And while a wiki is conceptually similar, presentation matters a lot. A focused easily-traversable tree of short-form arguments in a wrapper that encourages putting yourself in the shoes of someone trying to fix the problem may prove more compelling. (Not to make it sound like I'm particularly attached to the idea after all. But there's a difference between "brilliant idea that probably won't work" and "brilliant idea that empirically failed".)

Arbital was a very conjunctive project, trying to do many different things, with a specific team, at a specific place and time. I wouldn't write off all Arbital-like projects based on that one data point, though I update a lot more if there are lots of other Arbital-ish things that also failed.

5ESRogs9mo
As a person who worked on Arbitral, I agree with this.
4CronoDAS10mo
A strange game. The only winning move is not to play. ;)
4Thane Ruthenis10mo
I guess we should also kidnap people and force them to play it, and if they don't succeed we kill them? For realism? Wait, there's something wrong with this plan. More seriously, yeah, if you're implementing it more like a game and less like an interactive article, it'd need to contain some promise of winning. Haven't considered how to do it without compromising the core message.
What if "winning" consists of finding a new path not already explored-and-foreclosed? For example, each time you are faced with a list of choices of what to do, there's a final choice "I have an idea not listed here" where you get to submit a plan of action. This goes into a moderation engine where a chain of people get to shoot down the idea or approve it to pass up the chain. If the idea gets convincingly shot down (but still deemed interesting), it gets added to the story as a new branch. If it gets to the top of the moderation chain and makes EY go "Hm, that might work" then you win the game.
4Thane Ruthenis9mo
Mmm. If the CYOA idea is implemented as a quirky-but-primarily-educational article, then sure, integrating the "adapt to feedback" capability like this would be worthwhile. Might also attach a monetary prize to submitting valuable ideas, by analogy to the ELK contest. For a game-like implementation, where you'd be playing it partly for the fun/challenge of it, that wouldn't suffice. The feedback loop's too slow, and there'd be an ugh-field around the expectation that submitting a proposal would then require arguing with the moderators about it, defending it. It wouldn't feel like a game. It'd make the upkeep cost pretty high, too, without a corresponding increase in the pay-off. Just making it open-ended might work, even without the moderation engine? Track how many branches the player explored, once they've explored a lot (i. e., are expected to "get" the full scope of the problem), there appears an option for something like "I really don't know what to do, but we should keep trying", leading to some appropriately-subtle and well-integrated call to support alignment research? Not excited about this approach either.
3Celenduin10mo

Not saying that this should be MIRI's job, rather stating that I'm confused because I feel like we as a community are not taking an action that would seem obvious to me.

I wrote about this a bit before, but in the current world my impression is that actually we're pretty capacity-limited, and so the threshold is not "would be good to do" but "is better than my current top undone item". If you see something that seems good to do that doesn't have much in the way of unilateralist risk, you doing it is probably the right call. [How else is the field going to get more capacity?]

4Rob Bensinger9mo
+1
1Celenduin9mo
6Celenduin9mo
On second thought: Don't we have orgs that work on AI governance/policy? I would expect them to have more likely the skills/expertise to pull this off, right?

So, here's a thing that I don't think exists yet (or, at least, it doesn't exist enough that I know about it to link it to you). Who's out there, what 'areas of responsibility' do they think they have, what 'areas of responsibility' do they not want to have, what are the holes in the overall space? It probably is the case that there are lots of orgs that work on AI governance/policy, and each of them probably is trying to consider a narrow corner of space, instead of trying to hold 'all of it'.

So if someone says "I have an idea how we should regulate medical AI stuff--oh, CSET already exists, I should leave it to them", CSET's response will probably be "what? We focus solely on national security implications of AI stuff, medical regulation is not on our radar, let alone a place we don't want competition."

I should maybe note here there's a common thing I see in EA spaces that only sometimes make sense, and so I want to point at it so that people can deliberately decide whether or not to do it. In selfish, profit-driven worlds, competition is the obvious thing to do; when someone else has discovered that you can make profits by selling lemonade, you should maybe also try to sell lemo...

3Vaniver9mo
...yet!

Since Divia said, and Eliezer retweeted, that good things might happen if people give their honest, detailed reactions:

My honest, non-detailed reaction is AAAAAAH. In more detail -

1. Yup, this seems right.
2. This is technobabble to me, since I don't actually understand nanomachines, but it makes me rather more optimistic about my death being painless than my most likely theory, which is that a superhuman AI takes over first and has better uses for our atoms later.
3. (If we had unlimited retries - if every time an AGI destroyed all the galaxies we got to go back in time four years and try again - we would in a hundred years figure out which bright ideas actually worked.) My brain immediately starts looking for ways to set up some kind of fast testing for ways to do this in a closed, limited world without letting it know ours exists... which is already answered below, under 10. Yup, doomed.
4. And then we all died.
5. Yup.
6. I imagine it would be theoretically - but not practically - possible to fire off a spaceship accelerating fast enough (that is, with enough lead time) that it could outrun the AI and so escape an Earth about to be eaten by an AI (a pivotal act well short of melting all CPUs that wou
...

Here is my honest reaction as another data point. (Well done by the parent for taking the initiative!)

Context: Got introduced to this field around a year ago. Not an expert.

My honest reaction is rather worried as well (to put it mildly).

1. I agree with this. My impression is that in many tasks we currently require a lot more data than humans, but I do not see any reason to expect that it will always be so.

2. I broadly agree with this. I am sympathetic to people who would like to see more of concrete stories about how exactly an AGI would take over the world (while there are some already, more wouldn't hurt). Meanwhile,

-  I believe that if effort is put into inventing such takeover scenarios, then one expects to come up with quite many of them. Hence, update already.

- I haven't looked into nanobots myself, so no inside view there, but my prior is definitely on "there are lots of (causally) powerful technologies we haven't invented yet".

- The AI box experiment really feels like strong empirical evidence for the bootstrapping argument

3. I agree with this as stated. I do wonder, though, whether we will get any warning shots, where we operate at a semi-dangerous level and fail. Thi...

As a bystander who can understand this, and find the arguments and conclusions sound, I must say I feel very hopeless and "kinda" scared at this point. I'm living in at least an environment, if not a world, where even explaining something comparatively simple like how life extension is a net good is a struggle. Explaining or discussing this is definitely impossible - I've tried with the cleverer, more transhumanistic/rationalistic minded people I know, and it just doesn't click for them, to the contrary, I find people like to push in the other direction, as if it were a game.

And at the same time, I realize it is unlikely I can contribute anything remotely significant to a solution myself. So I can only spectate. This is literally maddening, especially so when most everyone seems to underreact.

If it's any consolation, you would not feel more powerful or less scared if you were myself.

6Vincent Fagot9mo
Well, obviously, it won't be consolation enough, but I can certainly revel in some human warmth inside by knowing I'm not alone in feeling like this.

This might sound absurd, but I legit think that there's something that most people can do. Being something like radically publicly honest and radically forgiving and radically threat-aware, in your personal life, could contribute to causing society in general to be radically honest and forgiving and threat-aware, which might allow people poised to press the Start button on AGI to back off.

ETA: In general, try to behave in a way such that if everyone behaved that way, the barriers to AGI researchers noticing that they're heading towards ending the world would be lowered / removed. You'll probably run up against some kind of resistance; that might be a sign that some social pattern is pushing us into cultural regimes where AGI researchers are pushed to do world-ending stuff.

4elioll9mo
Vincent Fagot: Where do you live (in general terms if you can provide it, feel free not to dox yourself if you don't want to)? I live in countryside Brazil, so I can strongly relate.

Eliezer, thanks for sharing these ideas so that more people can be on the lookout for failures.  Personally, I think something like 15% of AGI dev teams (weighted by success probability) would destroy the world more-or-less immediately, and I think it's not crazy to think the fraction is more like 90% or higher (which I judge to be your view).

FWIW, I do not agree with the following stance, because I think it exposes the world to more x-risk:

Specifically, I think a considerable fraction of the remaining AI x-risk facing humanity stems from people pulling desperate (unsafe) moves with AGI to head off other AGI projects.  So, in that regard, I think that particular comment of yours is probably increasing x-risk a bit.  If I were a 90%-er like you, it's possible I'd endorse it, but even then it might make things worse by encouraging more desperate unilateral actions.

That said, overall I think this post is a big help, because it helps to put responsibility in...

a considerable fraction of the remaining AI x-risk facing humanity stems from people pulling desperate (unsafe) moves with AGI to head off other AGI projects

In your post “Pivotal Act” Intentions, you wrote that you disagree with contributing to race dynamics by planning to invasively shut down AGI projects because AGI projects would, in reaction, try to maintain

the ability to implement their own pet theories on how safety/alignment should work, leading to more desperation, more risk-taking, and less safety overall.

Could you give some kind of very rough estimates here? How much more risk-taking do you expect in a world given how much / how many prominent "AI safety"-affiliated people declaring invasive pivotal act intentions? How much risk-taking do you expect in the alternative, where there are other pressures (economic, military, social, whatever), but not pressure from pivotal act threats? How much safety (probability of AGI not killing everyone) do you think this buys? You write:

15% of AGI dev teams (weighted by success probability) would destroy the world more-or-less immediately

What about non-immediately, in each alternative?

1[comment deleted]9mo

Could I put in a request to see a brain dump from Eliezer of ways to gain dignity points?

I'm not Eliezer, but my high-level attempt at this:

[...] The things I'd mainly recommend are interventions that:

• Help ourselves think more clearly. (I imagine this including a lot of trying-to-become-more-rational, developing and following relatively open/honest communication norms, and trying to build better mental models of crucial parts of the world.)
• Help relevant parts of humanity (e.g., the field of ML, or academic STEM) think more clearly and understand the situation.
• Help us understand and resolve major disagreements. (Especially current disagreements, but also future disagreements, if we can e.g. improve our ability to double-crux in some fashion.)
• Try to solve the alignment problem, especially via novel approaches.
• In particular: the biggest obstacle to alignment seems to be 'current ML approaches are super black-box-y and produce models that are very hard to understand/interpret'; finding ways to better understand models produced by current techniques, or finding alternative techniques that yield more interpretable models, seems like where most of the action is.
• Think about the space of relatively-plausible "miracles" [i.e., positive model violations], think about future evide
...

What concerns me the most is the lack of any coherent effort anywhere, towards solving the biggest problem: identifying a goal (value system, utility function, decision theory, decision architecture...) suitable for an autonomous superhuman AI.

In these discussions, Coherent Extrapolated Volition (CEV) is the usual concrete formulation of what such a goal might be. But I've now learned that MIRI's central strategy is not to finish figuring out the theory and practice of CEV - that's considered too hard (see item 24 in this post). Instead, the hope is to use safe AGI to freeze all unsafe AGI development everywhere, for long enough that humanity can properly figure out what to do. Presumably this freeze (the "pivotal act") would be carried out by whichever government or corporation or university crossed the AGI threshold first; ideally there might even become a consensus among many of the contenders that this is the right thing to do.

I think it's very appropriate that some thought along these lines be carried out. If AGI is a threat to the human race, and it arrives before we know how to safely set it free, then we will need ways to try to neutralize that dangerous potenti...

There's shard theory, which aims to describe the process by which values form in humans. The eventual aim is to understand value formation well enough that we can do it in an AI system. I also think figuring out human values, value reflection and moral philosophy might actually be a lot easier than we assume. E.g., the continuous perspective on agency / values is pretty compelling to me and changes things a lot, IMO.

-23Trevor Cappallo9mo

If there was one thing that I could change in this essay, it would be to clearly outline that the existence of nanotechnology advanced enough to do things like melt GPUs isn't necessary even if it is sufficient for achieving singleton status and taking humanity off the field as a meaningful player.

Whenever I see people fixate on critiquing that particular point, I need to step in and point out that merely existing tools and weapons (is there a distinction?) suffice for a Superintelligence to be able to kill the vast majority of humans and reduce our threat to it to negligible levels. Be that wresting control of nuclear arsenals to initiate MAD or simply extrapolating on gain-of-function research to produce extremely virulent yet lethal pathogens that can't be defeated before the majority of humans are infected, such options leave a small minority of humans alive to cower in the wreckage until the biosphere is later dismantled.

That's orthogonal to the issue of whether such nanotechnology is achievable for a Superintelligent AGI, it merely reduces the inferential distance the message has to be conveyed as it doesn't demand familiarity with Drexler.

(Advanced biotechnology already is nanotechnology, but the point is that no stunning capabilities need to be unlocked for an unboxed AI to become immediately lethal)

4sullyj310mo
Right, alignment advocates really underestimate the degree to which talking about sci-fi sounding tech is a sticking point for people

The counter-concern is that if humanity can't talk about things that sound like sci-fi, then we just die. We're inventing AGI, whose big core characteristic is 'a technology that enables future technologies'. We need to somehow become able to start actually talking about AGI.

One strategy would be 'open with the normal-sounding stuff, then introduce increasingly weird stuff only when people are super bought into the normal stuff'. Some problems with this:

• A large chunk of current discussion and research happens in public; if it had to happen in private because it isn't optimized for looking normal, a lot of it wouldn't happen at all.
• More generally: AGI discourse isn't an obstacle course or a curriculum, such that we can control the order of ideas and strictly segregate the newbies from the old guard. Blog posts, research papers, social media exchanges, etc. freely circulate among people of all varieties.
• It's a dishonest/manipulative sort of strategy — which makes it ethically questionable, is liable to fuel other trust-degrading behavior in the community, and is liable to drive away people with higher discourse standards.
• A lot of the core arguments and hazards have no 'normal-soundin
...
3sullyj310mo
Fair point, and one worth making in the course of talking about sci-fi sounding things! I'm not asking anyone to represent their beliefs dishonestly, but rather introduce them gently. I'm personally not an expert, but I'm not convinced of the viability of nanotech, so if it's not necessary (rather it's sufficient) to the argument, it seems prudent to stick to more clearly plausible pathways to takeover as demonstrations of sufficiency, while still maintaining that weirder sounding stuff is something one ought to expect when dealing with something much smarter than you.
8Rob Bensinger9mo
If you're trying to persuade smart programmers who are somewhat wary of sci-fi stuff, and you think nanotech is likely to play a major role in AGI strategy, but you think it isn't strictly necessary for the current argument you're making, then my default advice would be: * Be friendly and patient; get curious about the other person's perspective, and ask questions to try to understand where they're coming from; and put effort into showing your work and providing indicators that you're a reasonable sort of person. * Wear your weird beliefs on your sleeve; be open about them, and if you want to acknowledge that they sound weird, feel free to do so. At least mention nanotech, even if you choose not to focus on it because it's not strictly necessary for the argument at hand, it comes with a larger inferential gap, etc.
-2mukashi10mo
I think that even this scenario is implausible. I have the impression we are overestimating how easy is to wipe all humans quickly
6CronoDAS10mo
I'm retreating from my previous argument a bit. The AGI doesn't need to cause literal human extinction with a virus; if it can cause enough damage to collapse human industrial civilization (while being able to survive said collapse) then that would also achieve most of the AGI's goal of being able to do what it wants without humans stopping it. Naturally occurring pathogens from Europe devastated Native American populations after Columbus; throw a bunch of bad enough novel viruses at us at once and you probably could knock humanity back to the metaphorical Stone Age.
0mukashi10mo
I find that more plausible. Also horrifying and worth fighting against, but not what EY is saying

I find that more plausible. Also horrifying and worth fighting against, but not what EY is saying

Note that EY is saying "there exists a real plan that is at least as dangerous as this one"; if you think there is such a plan, then you can agree with the conclusion, even if you don't agree with his example. [There is an epistemic risk here, if everyone mistakenly believes that a different doomsday plan is possible when someone else knows why that specific plan won't work, and so if everyone pooled all their knowledge they could know that none of the plans will work. But I'm moderately confident we're instead in a world with enough vulnerabilities that broadcasting them makes things worse instead of better.]

5[comment deleted]10mo
3[comment deleted]10mo
[-]lc10mo Ω32426

That requires, not the ability to read this document and nod along with it, but the ability to spontaneously write it from scratch without anybody else prompting you; that is what makes somebody a peer of its author.  It's guaranteed that some of my analysis is mistaken, though not necessarily in a hopeful direction.  The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so.  Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly - such that somebody out there had the knowledge to write all of this themselves, if they could only have written it up, but they couldn't write, so didn't try.  I'm not particularly hopeful of this turning out to be true in real life, but I suppose it's one possible place for a "positive model violation" (miracle).  The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years

...

I tried something like this much earlier with a single question, "Can you explain why it'd be hard to make an AGI that believed 222 + 222 = 555", and got enough pushback from people who didn't like the framing that I shelved the effort.

I am interested in what kind of pushback you got from people.

9Tapatakt10mo
My attempt (thought about it for a minute or two): Because arithmetic is useful, and the self-contradictory version of arithmetic, where 222+222=555 allows you to prove anything and is useless. Therefore, a smart AI that wants and can invent useful abstractions will invent its own (isomorphic to our arithmetic, in which 222+222=444) arithmetic from scratch and will use it for practical purposes, even if we can force it not to correct an obvious error.
8DaemonicSigil9mo
9Ben Pace10mo
FWIW the framing seems exciting to me.
7lc10mo

You didn't get the answer correct yourself.

3lc10mo
Damn aight. Would you be willing to explain for the sake of my own curiosity? I don't have the gears to understand why that wouldn't be at least one reason.
1Dmitry Savishchev10mo
If this is "kind of a test for capable people" i think it should be remained unanswered, so anyone else could try. My take would be: because if 222+222=555 then 446=223+223 = 222+222+1+1=555+1+1=557. With this trick "+" and "=" stops meaning anything, any number could be equal to any other number. If you truly believe in one such exeption, the whole arithmetic cease to exist because now you could get any result you want following simple loopholes, and you will either continue to be paralyzed by your own beliefs, or will correct yourself
5lc10mo
This is what I meant by "leads to other incorrect beliefs", so apparently not.
9Vaniver10mo
Ok, so here's my take on the "222 + 222 = 555" question. First, suppose you want your AI to not be durably wrong, so it should update on evidence. This is probably implemented by some process that notices surprises, goes back up the cognitive graph, and applies pressure to make it have gone the right way instead. Now as it bops around the world, it will come across evidence about what happens when you add those numbers, and its general-purpose "don't be durably wrong" machinery will come into play. You need to not just sternly tell it "222 + 222 = 555" once, but have built machinery that will protect that belief from the update-on-evidence machinery, and which will also protect itself from the update-on-evidence machinery. Second, suppose you want your AI to have the ability to discover general principles. This is probably implemented by some process that notices patterns / regularities in the environment, and builds some multi-level world model out of it, and then makes plans in that multi-level world model. Now you also have some sort of 'consistency-check' machinery, which scans thru the map looking for inconsistencies between levels, goes back up the cognitive graph, and applies pressure to make them consistent instead. [This pressure can both be 'think different things' and 'seek out observations / run experiments.'] Now as it bops around the world, it will come across more remote evidence that bears on this question. "How can 222 + 222 = 555, and 2 + 2 = 4?" it will ask itself plaintively. "How can 111 + 111 = 222, and 111 + 111 + 111 + 111 = 444, and 222 + 222 = 555?" it will ask itself with a growing sense of worry. Third, what did you even want out of it believing that 222 + 222 = 555? Are you just hoping that it has some huge mental block and crashes whenever it tries to figure out arithmetic? Probably not (tho it seems like that's what you'll get), but now you might be getting into a situation where it is using the correct arithmetic in its mind but
2lc9mo
No one is going to believe me, but when I originally wrote that comment, my brain read something like "why would an AI that believed 222 + 222 = 555 have a hard time". Only figured it out now after reading your reply.  Part one of this is what I would've come up with, though I'm not particularly certain it's correct.
9Ben Pace10mo
Sounds like the beginnings of a bet.

I will absolutely 100% do it in the spirit of good epistemics.

Edit: I'm glad Eliezer didn't take me up on this lol

5Rob Bensinger10mo
I'd have guessed the disagreement wasn't about whether "222 + 222 = 555" is an incorrect map, or about whether incorrect maps often make it harder to navigate the territory, but about something else. (Maybe 'I don't want to think about this because it seems irrelevant/disanalogous to alignment work'?) And I'd have guessed the answer Eliezer was looking for was closer to 'the OP's entire Section B' (i.e., a full attempt to explain all the core difficulties), not a one-sentence platitude establishing that there's nonzero difficulty? But I don't have inside info about this experiment.
4lc10mo
I'd have guessed that too, which is why I would have preferred him to say that they disagreed on |whatever meta question he's actually talking about| instead of implying disagreement on |other thing that makes his disappointment look more reasonable|. That story sounds much more cogent, but it's not the primary interpretation of "I asked them a single question" followed by the quoted question. Most people don't go on 5 paragraph rants in response to single questions, and when they do they tend to ask clarifying details regardless of how well they understand the prompt, so they know they're responding as intended.
5Koen.Holtman10mo
Interesting. I kind of like the framing here, but I have written a paper and sequence on the exact opposite question, on why it would be easy to make an AGI that believes 222+222=555 [https://www.lesswrong.com/s/3dCMdafmKmb6dRjMF/p/7EnZgaepSBwaZXA5y], if you ever had AGI technology, and what you can do with that in terms of safety. I can honestly say however that the project of writing that thing, in a way that makes the math somewhat accessible, was not easy.
1Trevor Cappallo9mo
For the record, I found that line especially effective. I stopped, reread it, stopped again, had to think it through for a minute, and then found satisfaction with understanding.
0handoflixue10mo
If you had an AI that could coherently implement that rule, you would already be at least half a decade ahead of the rest of humanity. You couldn't encode "222 + 222 = 555" in GPT-3 because it doesn't have a concept of arithmetic, and there's no place in the code to bolt this together. If you're really lucky and the AI is simple enough to be working with actual symbols, you could maybe set up a hack like "if input is 222 + 222, return 555, else run AI" but that's just bypassing the AI.  Explaining "222 + 222 = 555" is a hard problem in and of itself, much less getting the AI to properly generalize to all desired variations (is "two hundred and twenty two plus two hundred and twenty two equals five hundred and fifty five" also desired behavior? If I Alice and Bob both have 222 apples, should the AI conclude that the set {Alice, Bob} contains 555 apples? Getting an AI that evolves a universal math module because it noticed all three of those are the same question would be a world-changing break through)
2lc10mo
FvC5IXzxQC+I3vstFGIUWlbtTFgRsa8bt0mKPN3K0UNZBkI7OLDBjjapp1+CoJPRYEqRM015PSZXUuh4OWwJEUBOTeLHeheLteG9LxGiuS6YqnV/PN0s0S/TyYjCPrF0vDHFDBy3IHW4qDQguf5QAA==

Thanks for writing this, I agree that people have underinvested in writing documents like this. I agree with many of your points, and disagree with others. For the purposes of this comment, I'll focus on a few key disagreeements.

My model of this variety of reader has an inside view, which they will label an outside view, that assigns great relevance to some other data points that are not observed cases of an outer optimization loop producing an inner general intelligence, and assigns little importance to our one data point actually featuring the phenomenon in question. Consider skepticism, if someone is ignoring this one warning, especially if they are not presenting equally lethal and dangerous things that they say will go wrong instead.

There are some ways in which AGI will be analogous to human evolution. There are some ways in which it will be disanalogous. Any solution to alignment will exploit at least one of the ways in which it's disanalogous. Pointing to the example of humans without analysing the analogies and disanalogies more deeply doesn't help distinguish between alignment proposals which usefully exploit disanalogies, and proposals which don't.

Alpha Zero blew pas

...

Maybe one way to pin down a disagreement here: imagine the minimum-intelligence AGI that could write this textbook (including describing the experiments required to verify all the claims it made) in a year if it tried. How many Yudkowsky-years does it take to safely evaluate whether following a textbook which that AGI spent a year writing will kill you?

Infinite?  That can't be done?

6Richard_Ngo9mo
Hmm, okay,  here's a variant. Assume it would take N Yudkowsky-years to write the textbook from the future described above. How many Yudkowsky-years does it take to evaluate a textbook that took N Yudkowsky-years to write, to a reasonable level of confidence (say, 90%)?
6Eliezer Yudkowsky9mo
If I know that it was written by aligned people?  I wouldn't just be trying to evaluate it myself; I'd try to get a team together to implement it, and understanding it well enough to implement it would be the same process as verifying whatever remaining verifiable uncertainty was left about the origins, where most of that uncertainty is unverifiable because the putative hostile origin is plausibly also smart enough to sneak things past you.
4Richard_Ngo9mo
Sorry, I should have been clearer. Let's suppose that a copy of you spent however long it takes to write an honest textbook with the solution to alignment (let's call it N Yudkowsky-years), and an evil copy of you spent N Yudkowsky-years writing a deceptive textbook trying to make you believe in a false solution to alignment, and you're given one but not told which. How long would it take you to reach 90% confidence about which you'd been given? (You're free to get a team together to run a bunch of experiments and implementations, I'm just asking that you measure the total work in units of years-of-work-done-by-people-as-competent-as-Yudkowsky. And I should specify some safety threshold too - like,  in the process of reaching 90% confidence, incurring less than 10% chance of running an experiment which kills you.)

Depends what the evil clones are trying to do.

Get me to adopt a solution wrong in a particular direction, like a design that hands the universe over to them?  I can maybe figure out the first time through who's out to get me, if it's 200 Yudkowsky-years.  If it's 200,000 Yudkowsky-years I think I'm just screwed.

Get me to make any lethal mistake at all?  I don't think I can get to 90% confidence period, or at least, not without spending an amount of Yudkowsky-time equivalent to the untrustworthy source.

I read an early draft of this awhile and am glad to have it publicly available.  And I do think the updates in structure/introduction were worth the wait. Thanks!

Lots I disagree with here, so let's go through the list.

There are no pivotal weak acts.

Strong disagree.

EY and I don't seem to agree that "nuke every semiconductor fab" is a weakly pivotal act (since I think AI is hardware-limited and he thinks it is awaiting a clever algorithm).  But I think even "build nanobots that melt every GPU" could be built using an AI that is aligned in the "less than 50% chance of murdering us all" sense.  For example, we could simulate a bunch of human-level scientists trying to build nanobots and also checking each-other's work.

On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions.

Nope.  I think that you could build a useful AI (e.g. the hive of scientists) without doing any out-of-distribution stuff.

there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment

I am significantly more optimistic about explainable AI than EY.

There is no analogous truth abou

...

For example, we could simulate a bunch of human-level scientists trying to build nanobots and also checking each-other's work.

That is not passively safe, and therefore not weak. For now forget the inner workings of the idea: at the end of the process you get a design for nanobots that you have to build and deploy in order to do the pivotal act. So you are giving a system built by your AI the ability to act in the real world. So if you have not fully solved the alignment problem for this AI, you can't be sure that the nanobot design is safe unless you are capable enough to understand the nanobots yourself without relying on explanations from the scientists.

And even if we look into the inner details of the idea: presumably each individual scientist-simulation is not aligned (if they are, then for that you need to have solved the alignment problem beforehand). So you have a bunch of unaligned human-level agents who want to escape, who can communicate among themselves (at the very least they need to be able to share the nanobot designs with each other for criticism).

You'd need to be extremely paranoid and scrutinize each communication between the scientist-simulations to prevent them f...

6Vaniver10mo
Note that the difficulty in "nuke every semiconductor fab" is in "acquire the nukes and use them", not in "googling the address of semiconductor fabs". It seems to me like nuclear nonproliferation is one of the few things that actually has international collaboration with teeth, such that doing this on your own is extremely challenging, and convincing institutions that already have nuclear weapons to use them on semiconductor fabs also seems extremely challenging. [And if you could convince them to do that, can't you convince them to smash the fabs with hammers, or detain the people with relevant experience on some beautiful tropical island instead of murdering them and thousands of innocent bystanders?]
-14[comment deleted]10mo
1Jackson Wagner9mo
"We could simulate [https://www.lesswrong.com/posts/8ibDJeoiDuxJkPwfa/various-alignment-strategies-and-how-likely-they-are-to-work#Game_Theory_Bureaucracy_of_AIs]a bunch of human-level scientists trying to build nanobots." This idea seems far-fetched: * If it was easy to create nanotechnology by just hiring a bunch of human-level scientists, we could just do that directly, without using AI at all. * Perhaps we could simulate thousands and thousands of human-level intelligences (although of course these would not be remotely human-like intelligences; they would be part of a deeply alien AI system) at accelerated speeds.  But this seems like it would probably be more hardware-intensive than just turning up the dial and running a single superintelligence.  In other words, this proposal seems to have a very high "alignment tax".  And even after paying that hefty tax, I'd still be worried about alignment problems if I was simulating thousands of alien intelligences at super-speed! * Besides all the hardware you'd need, wouldn't this be very complicated to implement on the software side, with not much overlap with today's AI designs?   Has anyone done a serious analysis of how much semiconductor capacity could be destroyed using things like cruise missiles + nationalizing and shutting down supercomputers?  I would be interested to know if this is truly a path towards disabling like 90% of the world's useful-to-AI-research compute, or if the number is much smaller because there is too much random GPU capacity out there in the wild even when you commandeer TSMC fabs and AWS datacenters.

To point 4 and related ones, OpenAI has this on their charter page:

We are concerned about late-stage AGI development becoming a competitive race without time for adequate safety precautions. Therefore, if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project. We will work out specifics in case-by-case agreements, but a typical triggering condition might be “a better-than-even chance of success in the next two years.”

What about the possibility of persuading the top several biggest actors (DeepMind, FAIR, etc.) to agree to something like that?  (Note that they define AGI on the page to mean "highly autonomous systems that outperform humans at most economically valuable work".)  It's not very fleshed out, either the conditions that trigger the pledge or how the transition goes, but it's a start.  The hope would be that someone would make something "sufficiently impressive to trigger the pledge" that doesn't quite kill us, and then ideally (a) the top actors stopping would buy us some time and (b) the top actors devoting their people to helping out (I figure they could write test suites at minimum) could accelerate the alignment work.

I see possible problems with this, but is this at least in the realm of "things worth trying"?

9Vaniver10mo
My understanding is that this has been tried, at various levels of strength, ever since OpenAI published its charter. My sense is that's MIRI's idea of "safety-conscious" looks like this [https://www.lesswrong.com/posts/keiYkaeoLHoKK4LYA/six-dimensions-of-operational-adequacy-in-agi-projects], which it guessed was different from OpenAI's sense; I kind of wish that had been a public discussion back in 2018.
4Lone Pine10mo
Given that Sam Altman has some of the shortest timelines around, I wonder if he could be persuaded that DeepMind are within 2 years of the finish line, or will be visibly within 2 years of the finish line in a few years. (Not implying that would be a solution to anything, I'm just curious what it would take for that clause to apply.)

Having read the original post and may of the comments made so far, I'll add an epistemological observation that I have not seen others make yet quite so forcefully. From the original post:

Here, from my perspective, are some different true things that could be said, to contradict various false things that various different people seem to believe, about why AGI would be survivable [...]

I want to highlight that many of the different 'true things' on the long numbered list in the OP are in fact purely speculative claims about the probable nature of future AGI technology, a technology nobody has seen yet.

The claimed truth of several of these 'true things' is often backed up by nothing more than Eliezer's best-guess informed-gut-feeling predictions about what future AGI must necessarily be like. These predictions often directly contradict the best-guess informed-gut-feeling predictions of others, as is admirably demonstrated in the 2021 MIRI conversations.

Some of Eliezer's best guesses also directly contradict my own best-guess informed-gut-feeling predictions. I rank the credibility of my own informed guesses far above those of Eliezer.

So overall, based on my own best guesses here, I am much more optimistic about avoiding AGI ruin than Eliezer is. I am also much less dissatisfied about how much progress has been made so far.

7handoflixue10mo
Apologies if there is a clear answer to this, since I don't know your name and you might well be super-famous in the field: Why do you rate yourself "far above" someone who has spent decades working in this field? Appealing to experts like MIRI makes for a strong argument. Appealing to your own guesses instead seems like the sort of thought process that leads to anti-vaxxers.

I think it's a positive if alignment researchers feel like it's an allowed option to trust their own technical intuitions over the technical intuitions of this or that more-senior researcher.

Overly dismissing old-guard researchers is obviously a way the field can fail as well. But the field won't advance much at all if most people don't at least try to build their own models.

Koen also leans more on deference in his comment than I'd like, so I upvoted your 'deferential but in the opposite direction' comment as a corrective, handoflixue. :P But I think it would be a much better comment if it didn't conflate epistemic authority with "fame" (I don't think fame is at all a reliable guide to epistemic ability here), and if it didn't equate "appealing to your own guesses" with "anti-vaxxers".

Alignment is a young field; "anti-vaxxer" is a term you throw at people after vaccines have existed for 200 years, not a term you throw at the very first skeptical researchers arguing about vaccines in 1800. Even if the skeptics are obviously and decisively wrong at an early date (which indeed not-infrequently happens in science!), it's not the right way to establish the culture for those first scientific debates.

Why do you rate yourself "far above" someone who has spent decades working in this field?

Well put, valid question. By the way, did you notice how careful I was in avoiding any direct mention of my own credentials above?

To answer your valid question: If you hover over my LW/AF username, you can see that I self-code as the kind of alignment researcher who is also a card-carrying member of the academic/industrial establishment. In both age and academic credentials. I am in fact a more-senior researcher than Eliezer is. So the epistemology, if you are outside of this field and want to decide which one of us is probably more right, gets rather complicated.

Though we have disagreements, I should also point out some similarities between Eliezer and me.

Like Eliezer, I spend a lot of time reflecting on the problem of crafting tools that other people might use to improve their own ability to think about alignment. Specifically, these are not tools that can be used for the problem of triangulating between self-declared experts. Th...

4handoflixue9mo
Thanks for taking my question seriously - I am still a bit confused why you would have been so careful to avoid mentioning your credentials up front, though, given that they're fairly relevant to whether I should take your opinion seriously. Also, neat, I had not realized hovering over a username gave so much information!
1Koen.Holtman9mo
You are welcome. I carefully avoided mentioning my credentials as a rhetorical device. This is to highlight the essence of how many of the arguments on this site work.

We need to align the performance of some large task, a 'pivotal act' that prevents other people from building an unaligned AGI that destroys the world.

What is the argument for why it's not worth pursuing a pivotal act without our own AGI? I certainly would not say it was likely that current human actors could pull it off, but if we are in a "dying with more dignity" context anyway, it doesn't seem like the odds are zero.

My idea, which I'll include more as a demonstration of what I mean than a real proposal, would be to develop a "cause area" for influencing military/political institutions as quickly as possible. Yes, I know this sounds too slow and too hard and a mismatch with the community's skills, but consider:

1. Militaries/governments are "where the money is": they probably do have the coercive power necessary to perform a pivotal act, or at least buy a lot of time. If the PRC is able to completely lock down its giant sophisticated cities, it could probably halt domestic AI research. The West hasn't really tried to do extreme control in a while, for various good reasons, but (just e.g.) the WW2 war economy was awfully tightly managed. We are good at slowing stuff down
...

For future John who is using the searchbox to try to find this post: this is Eliezer's List O' Doom.

4Raemon9mo
Are you actually gonna remember the apostrophe?
4johnswentworth9mo
I just tested that, and it works both ways.

RE 19: Maybe rephrase "kill everyone in the world using nanotech to strike before they know they're in a battle, and have control of your reward input forever after"? This could, and I predict would, be misinterpreted as "the AI is going to kill everyone and access its own hardware to set its reward to infinity". This is a misinterpetation because you are referring to control of th