Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
(Partially in response to AGI Ruin: A list of Lethalities. Written in the same rambling style. Not exhaustive.)
- Powerful AI systems have a good chance of deliberately and irreversibly disempowering humanity. This is a much easier failure mode than killing everyone with destructive physical technologies.
- Catastrophically risky AI systems could plausibly exist soon, and there likely won’t be a strong consensus about this fact until such systems pose a meaningful existential risk per year. There is not necessarily any “fire alarm.”
- Even if there were consensus about a risk from powerful AI systems, there is a good chance that the world would respond in a totally unproductive way. It’s wishful thinking to look at possible stories of doom and say “we wouldn’t let that happen;” humanity is fully capable of messing up even very basic challenges, especially if they are novel.
- I think that many of the projects intended to help with AI alignment don't make progress on key difficulties and won’t significantly reduce the risk of catastrophic outcomes. This is related to people gravitating to whatever research is most tractable and not being too picky about what problems it helps with, and related to a low level of concern with the long-term future in particular. Overall, there are relatively few researchers who are effectively focused on the technical problems most relevant to existential risk from alignment failures.
- There are strong social and political pressures to spend much more of our time talking about how AI shapes existing conflicts and shifts power. This pressure is already playing out and it doesn’t seem too likely to get better. I think Eliezer’s term “the last derail” is hyperbolic but on point.
- Even when thinking about accident risk, people’s minds seem to go to what they think of as “more realistic and less sci fi” risks that are much less likely to be existential (and sometimes I think less plausible). It’s very possible this dynamic won’t change until after actually existing AI systems pose an existential risk.
- There is a good chance that an AI catastrophe looks like an abrupt “coup” where AI systems permanently disempower humans with little opportunity for resistance. People seem to consistently round this risk down to more boring stories that fit better with their narratives about the world. It is quite possible that an AI coup will be sped up by humans letting AI systems control killer robots, but the difference in timeline between "killer robots everywhere, AI controls everything" and "AI only involved in R&D" seems like it's less than a year.
- The broader intellectual world seems to wildly overestimate how long it will take AI systems to go from “large impact on the world” to “unrecognizably transformed world.” This is more likely to be years than decades, and there’s a real chance that it’s months. This makes alignment harder and doesn’t seem like something we are collectively prepared for.
- Humanity usually solves technical problems by iterating and fixing failures; we often resolve tough methodological disagreements very slowly by seeing what actually works and having our failures thrown in our face. But it will probably be possible to build valuable AI products without solving alignment, and so reality won’t “force us” to solve alignment until it’s too late. This seems like a case where we will have to be unusually reliant on careful reasoning rather than empirical feedback loops for some of the highest-level questions.
- AI systems will ultimately be wildly superhuman, and there probably won’t be strong technological hurdles right around human level. Extrapolating the rate of existing AI progress suggests you don’t get too much time between weak AI systems and very strong AI systems, and AI contributions could very easily go from being a tiny minority of intellectual work to a large majority over a few years.
- If you had incredibly powerful unaligned AI systems running on a server farm somewhere, there is very little chance that humanity would maintain meaningful control over its future.
- “Don’t build powerful AI systems” appears to be a difficult policy problem, requiring geopolitical coordination of a kind that has often failed even when the stakes are unambiguous and the pressures to defect are much smaller.
- I would not expect humanity to necessarily “rise to the challenge” when the stakes of a novel problem are very large. I was 50-50 about this in 2019, but our experience with COVID has further lowered my confidence.
- There is probably no physically-implemented reward function, of the kind that could be optimized with SGD, that we’d be happy for an arbitrarily smart AI to optimize as hard as possible. (I’m most optimistic about approaches where RL is only performed on a reward function that gets smarter in parallel with the agent being trained.)
- Training an AI to maximize a given reward function does not generically produce an AI which is internally “motivated” to maximize reward. Moreover, at some level of capability, a very wide range of motivations for an AI would lead to loss-minimizing behavior on the training distribution because minimizing loss is an important strategy for an AI to preserve its influence over the world.
- It is more robust for an AI system to learn a good model for the environment, and what the consequences of its actions will be, than to learn a behavior like “generally being nice” or “trying to help humans.” Even if an AI was imitating data consisting of “what I would do if I were trying to be nice,” it would still be more likely to eventually learn to imitate the actual physical process producing that data rather than absorbing some general habit of niceness. And in practice the data we produce will not be perfect, and so “predict the physical process generating your losses” is going to be positively selected for by SGD.
- You shouldn’t say something like “well I might as well assume there’s a hope” and thereby live in a specific unlikely world where alignment is unrealistically easy in one way or another. Even if alignment ends up easy, you would be likely to end up predicting the wrong way for it to be easy. If things look doomed to you, in practice it’s better to try to maximize log odds of success as a more general and robust strategy for taking advantage of lucky breaks in a messy and hard-to-predict world.
- No current plans for aligning AI have a particularly high probability of working without a lot of iteration and modification. The current state of affairs is roughly “if alignment turns out to be a real problem, we’ll learn a lot about it and iteratively improve our approach.” If the problem is severe and emerges quickly, it would be better if we had a clearer plan further in advance—we’d still have to adapt and learn, but starting with something that looks like it could work on paper would put us in a much better situation.
- Many research problems in other areas are chosen for tractability or being just barely out of reach. We pick benchmarks we can make progress on, or work on theoretical problems that seem well-posed and approachable using existing techniques. Alignment isn’t like that; it was chosen to be an important problem, and there is no one ensuring that the game is “fair” and that the problem is soluble or tractable.
(Mostly stated without argument.)
- Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.” This distinction is very important, and I agree with the former but disagree with the latter. Solving a scientific problem without being able to learn from experiments and failures is incredibly hard. But we will be able to learn a lot about alignment from experiments and trial and error; I think we can get a lot of feedback about what works and deploy more traditional R&D methodology. We have toy models of alignment failures, we have standards for interpretability that we can’t yet meet, and we have theoretical questions we can’t yet answer.. The difference is that reality doesn’t force us to solve the problem, or tell us clearly which analogies are the right ones, and so it’s possible for us to push ahead and build AGI without solving alignment. Overall this consideration seems like it makes the institutional problem vastly harder, but does not have such a large effect on the scientific problem.
- Eliezer often talks about AI systems that are able to easily build nanotech and overpower humans decisively, and describes a vision of a rapidly unfolding doom from a single failure. This is what would happen if you were magically given an extraordinarily powerful AI and then failed to aligned it, but I think it’s very unlikely what will happen in the real world. By the time we have AI systems that can overpower humans decisively with nanotech, we have other AI systems that will either kill humans in more boring ways or else radically advanced the state of human R&D. More generally, the cinematic universe of Eliezer’s stories of doom doesn’t seem to me like it holds together, and I can’t tell if there is a more realistic picture of AI development under the surface.
- One important factor seems to be that Eliezer often imagines scenarios in which AI systems avoid making major technical contributions, or revealing the extent of their capabilities, because they are lying in wait to cause trouble later. But if we are constantly training AI systems to do things that look impressive, then SGD will be aggressively selecting against any AI systems who don’t do impressive-looking stuff. So by the time we have AI systems who can develop molecular nanotech, we will definitely have had systems that did something slightly-less-impressive-looking.
- AI improving itself is most likely to look like AI systems doing R&D in the same way that humans do. “AI smart enough to improve itself” is not a crucial threshold, AI systems will get gradually better at improving themselves. Eliezer appears to expect AI systems performing extremely fast recursive self-improvement before those systems are able to make superhuman contributions to other domains (including alignment research), but I think this is mostly unjustified. If Eliezer doesn’t believe this, then his arguments about the alignment problem that humans need to solve appear to be wrong.
- The notion of an AI-enabled “pivotal act” seems misguided. Aligned AI systems can reduce the period of risk of an unaligned AI by advancing alignment research, convincingly demonstrating the risk posed by unaligned AI, and consuming the “free energy” that an unaligned AI might have used to grow explosively. No particular act needs to be pivotal in order to greatly reduce the risk from unaligned AI, and the search for single pivotal acts leads to unrealistic stories of the future and unrealistic pictures of what AI labs should do.
- Many of the “pivotal acts” that Eliezer discusses involve an AI lab achieving a “decisive strategic advantage” (i.e. overwhelming hard power) that they use to implement a relatively limited policy, e.g. restricting the availability of powerful computers. But the same hard power would also let them arbitrarily dictate a new world order, and would be correctly perceived as an existential threat to existing states. Eliezer’s view appears to be that a decisive strategic advantage is the most realistic way to achieve these policy goals, despite the fact that building powerful enough AI systems runs an overwhelming risk of destroying the world via misalignment. I think that preferring this route to more traditional policy influence requires extreme confidence about details of the policy situation; that confidence might be justified by someone who knew a lot more about the details of government than I do, but Eliezer does not seem to. While I agree that this kind of policy change would be an unusual success in historical terms, the probability still seems much higher than Eliezer’s overall probabilities of survival. Conversely, I think Eliezer greatly underestimates how difficult it would be for an AI developer to covertly take over the world, how strongly and effectively governments would respond to that possibility, and how toxic this kind of plan is.
- I think Eliezer is probably wrong about how useful AI systems will become, including for tasks like AI alignment, before it is catastrophically dangerous. I believe we are relatively quickly approaching AI systems that can meaningfully accelerate progress by generating ideas, recognizing problems for those ideas and, proposing modifications to proposals, etc. and that all of those things will become possible in a small way well before AI systems that can double the pace of AI research. By the time AI systems can double the pace of AI research, it seems like they can greatly accelerate the pace of alignment research. Eliezer is right that this doesn’t make the problem go away (if humans don’t solve alignment, then why think AIs will solve it?) but I think it does mean that arguments about how recursive self-improvement quickly kicks you into a lethal regime are wrong (since AI is accelerating the timetable for both alignment and capabilities).
- When talking about generalization outside of the training distribution, I think Eliezer is generally pretty sloppy. I think many of the points are roughly right, but that it’s way too sloppy to reach reasonable conclusions after several steps of inference. I would love to see real discussion of these arguments, and in some sense it seems like Eliezer is a good person to push that discussion forward. Right now I think that relevant questions about ML generalization are in fact pretty subtle; we can learn a lot about them in advance but right now just mostly don’t know. Similarly, I think Eliezer’s reasoning about convergent incentives and the deep nature of consequentialism is too sloppy to get to correct conclusions and the resulting assertions are wildly overconfident.
- In particular, existing AI training strategies don’t need to handle a “drastic” distribution shift from low levels of intelligence to high levels of intelligence. There’s nothing in the foreseeable ways of building AI that would call for a big transfer like this, rather than continuously training as intelligence gradually increases. Eliezer seems to partly be making a relatively confident claim that the nature of AI is going to change a lot, which I think is probably wrong and is clearly overconfident. If he had been actually making concrete predictions over the last 10 years I think he would be losing a lot of them to people more like me.
- Eliezer strongly expects sharp capability gains, based on a combination of arguments that I think don’t make sense and an analogy with primate evolution which I think is being applied poorly. We’ve talked about this before, and I still think Eliezer’s position is probably wrong and clearly overconfident. I find Eliezer’s more detailed claims, e.g. about hard thresholds, to be much more implausible than his (already probably quantitatively wrong) claims about takeoff speeds.
- Eliezer seems confident about the difficulty of alignment based largely on his own experiences working on the problem. But in fact society has spent very little total effort working on the problem, and MIRI itself would probably be unable to solve or even make significant progress on the large majority of problems that existing research fields routinely solve. So I think right now we mostly don’t know how hard the problem is (but it may well be very hard, and even if it’s easy we may well fail to solve it). For example, the fact that MIRI tried and failed to find a “coherent formula for corrigibility” is not much evidence that corrigibility is “unworkable.”
- Eliezer says a lot of concrete things about how research works and about what kind of expectation of progress is unrealistic (e.g. talking about bright-eyed optimism in list of lethalities). But I don’t think this is grounded in an understanding of the history of science, familiarity with the dynamics of modern functional academic disciplines, or research experience. The Eliezer predictions most relevant to “how do scientific disciplines work” that I’m most aware of are incorrectly predicting that physicists would be wrong about the existence of the Higgs boson (LW bet registry) and expressing the view that real AI would likely emerge from a small group rather than a large industry (pg 436 but expressed many places).
- I think Eliezer generalizes a lot from pessimism about solving problems easily to pessimism about solving problems at all; or from the fact that a particular technique doesn’t immediately solve a problem to pessimism about the helpfulness of research on that technique. I disagree with Eliezer about how research progress is made, and don’t think he has any special expertise on this topic. Eliezer often makes objections to particular implementations of projects (like using interpretability tools for training). But in order to actually talk about whether a research project is likely to succeed, you really really need to engage with the existential quantifier where future researchers get to choose implementation details to make it work. At a minimum that requires engaging with the strongest existing versions of these proposals, and if you haven’t done that (as Eliezer hasn’t) then you need to take a different kind of approach. But even if you engage with the best existing concrete proposals, you still need to think carefully about whether your objections are the kind of thing that will be hard to overcome as people learn more details in the future. One way of looking at this is that Eliezer is appropriately open-minded about existential quantifiers applied to future AI systems thinking about how to cause trouble, but seems to treat existential quantifiers applied to future humans in a qualitatively rather than quantitatively different way (and as described throughout this list I think he overestimates the quantitative difference).
- As an example, I think Eliezer is unreasonably pessimistic about interpretability while being mostly ignorant about the current state of the field. This is true both for the level of understanding potentially achievable by interpretability, and the possible applications of such understanding. I agree with Eliezer that this seems like a hard problem and many people seem unreasonably optimistic, so I might be sympathetic if Eliezer was making claims with moderate confidence rather than high confidence. As far as I can tell most of Eliezer’s position here comes from general intuitions rather than arguments, and I think those are much less persuasive when you don’t have much familiarity with the domain.
- Early transformative AI systems will probably do impressive technological projects by being trained on smaller tasks with shorter feedback loops and then composing these abilities in the context of large collaborative projects (initially involving a lot of humans but over time increasingly automated). When Eliezer dismisses the possibility of AI systems performing safer tasks millions of times in training and then safely transferring to “build nanotechnology” (point 11 of list of lethalities) he is not engaging with the kind of system that is likely to be built or the kind of hope people have in mind.
- List of lethalities #13 makes a particular argument that we won’t see many AI problems in advance; I feel like I see this kind of thinking from Eliezer a lot but it seems misleading or wrong. In particular, it seems possible to study the problem that AIs may “change [their] outer behavior to deliberately look more aligned and deceive the programmers, operators, and possibly any loss functions optimizing over [them]” in advance. And while it’s true that if you fail to solve that problem then you won’t notice other problems, this doesn’t really affect the probability of solving alignment overall: if you don’t solve that problem then you die, and if you do solve that problem then then you can study the other problems.
- I don’t think list of lethalities is engaging meaningfully with the most serious hopes about how to solve the alignment problem. I don’t think that’s necessarily the purpose of the list, but it’s quite important if you want to assess the probability of doom or contribute meaningfully to solving the problem (or to complain about other people producing similar lists).
- I think that natural selection is a relatively weak analogy for ML training. The most important disanalogy is that we can deliberately shape ML training. Animal breeding would be a better analogy, and seems to suggest a different and much more tentative conclusion. For example, if humans were being actively bred for corrigibility and friendliness, it looks to me like like they would quite likely be corrigible and friendly up through the current distribution of human behavior. If that breeding process was continuously being run carefully by the smartest of the currently-friendly humans, it seems like it would plausibly break down at a level very far beyond current human abilities.
- Eliezer seems to argue that humans couldn’t verify pivotal acts proposed by AI systems (e.g. contributions to alignment research), and that this further makes it difficult to safely perform pivotal acts. In addition to disliking his concept of pivotal acts, I think that this claim is probably wrong and clearly overconfident. I think it doesn’t match well with pragmatic experience in R&D in almost any domain, where verification is much, much easier than generation in virtually every domain.
- Eliezer is relatively confident that you can’t train powerful systems by imitating human thoughts, because too much of human thinking happens under the surface. I think this is fairly plausible but it’s not at all obvious, and moreover there are plenty of techniques intermediate between “copy individual reasoning steps” and “optimize end-to-end on outcomes.” I think that the last 5 years of progress in language modeling have provided significant evidence that training AI to imitate human thought may be economically competitive at the time of transformative AI, potentially bringing us to something more like a 50-50 chance. I can’t tell if Eliezer should have lost Bayes points here, but I suspect he would have and if he wants us to evaluate his actual predictions I wish he would say something about his future predictions.
- These last two points (and most others from this list) aren’t aren’t actually part of my central alignment hopes or plans. Alignment hopes, like alignment concerns, can be disjunctive. In some sense they are even more disjunctive, since the existence of humans who are trying to solve alignment is considerably more robust than the existence of AI systems who are trying to cause trouble (such AIs only exist if humans have already failed at significant parts of alignment). Although my research is focused on cases where almost every factor works out against us, I think that you can get a lot of survival probability from easier worlds.
- Eliezer seems to be relatively confident that AI systems will be very alien and will understand many things about the world that humans don’t, rather than understanding a similar profile of things (but slightly better), or having weaker understanding but enjoying other advantages like much higher serial speed. I think this is very unclear and Eliezer is wildly overconfident. It seems plausible that AI systems will learn much of how to think by predicting humans even if human language is a uselessly shallow shadow of human thought, because of the extremely short feedback loops. It also seems quite possible that most of their knowledge about science will be built by an explicit process of scientific reasoning and inquiry that will proceed in a recognizable way to human science even if their minds are quite different. Most importantly, it seems like AI systems have huge structural advantages (like their high speed and low cost) that suggest they will have a transformative impact on the world (
and obsolete human contributions to alignmentretracted) well before they need to develop superhuman understanding of much of the world or tricks about how to think, and so even if they have a very different profile of abilities to humans they may still be subhuman in many important ways.
- AI systems reasoning about the code of other AI systems is not likely to be an important dynamic for early cooperation between AIs. Those AI systems look very likely to be messy, such that the only way AI systems will reason about their own or others’ code is by looking at behavior and using the same kinds of tools and reasoning strategies as humans. Eliezer has a consistent pattern of identifying important long-run considerations, and then flatly asserting that they are relevant in the short term without evidence or argument. I think Eliezer thinks this pattern of predictions isn’t yet conflicting with the evidence because these predictions only kick in at some later point (but still early enough to be relevant), but this is part of what makes his prediction track record impossible to assess and why I think he is greatly overestimating it in hindsight.
- Eliezer’s model of AI systems cooperating with each other to undermine “checks and balances” seems wrong to me, because it focuses on cooperation and the incentives of AI systems. Realistic proposals mostly don’t need to rely on the incentives of AI systems, they can instead rely on gradient descent selecting for systems that play games competitively, e.g. by searching until we find an AI which raises compelling objections to other AI systems’ proposals. (Note: I may be misunderstanding what he’s saying in places like list of lethalities 35; those may be a response to random things people say on the internet rather than engaging with alignment strategies that are being explored in practice. But even then I think he is responding to relatively weak versions of these arguments.) Eliezer equivocates between a line like “AI systems will cooperate” and “The verifiable activities you could use gradient descent to select for won’t function appropriately as checks and balances.” But Eliezer’s position is a conjunction that fails if either step fails, and jumping back and forth between them appears to totally obscure the actual structure of the argument.
- Eliezer seems to expect that by the time AI systems can achieve pivotal acts, they can also manipulate humans very well—such that for example it’s pointless to try to have debates between them or to try to play an adversarial game against them. But if we imagine minds with a human profile of abilities, it seems like they would be superhuman at R&D around the same time they were superhuman at persuasion, and could easily go either way. Moreover, there are a lot of reasons the AI seems much more likely to be superhuman at R&D if that’s what we want: it’s trained primarily to work on R&D, it’s using tools and structure designed to facilitate R&D, there is a large collaboration amongst AIs to advance R&D whereas manipulation is often being performed by individual AI systems trying to win a debate. And on top of all that, if anything I very weakly expect AI systems’ comparative advantage to be in R&D rather than human manipulation (since R&D is wildly out of distribution for humans).
- I don’t think surviving worlds have a plan in the sense Eliezer is looking for. Based on what Eliezer says I don’t feel like he has a clear or accurate picture of what successful “plans” look like in the real world. I don’t see any particular reason to defer to Eliezer at all on this point.
- Eliezer says that his list of lethalities is the kind of document that other people couldn’t write and therefore shows they are unlikely to contribute (point 41). I think that’s wrong. I think Eliezer’s document is mostly aimed at rhetoric or pedagogy rather than being a particularly helpful contribution to the field that others should be expected to have prioritized; I think that which ideas are “important” is mostly a consequence of Eliezer’s idiosyncratic intellectual focus rather than an objective fact about what is important; the main contributions are collecting up points that have been made in the past and ranting about them and so they mostly reflect on Eliezer-as-writer; and perhaps most importantly, I think more careful arguments on more important difficulties are in fact being made in other places. For example, ARC’s report on ELK describes at least 10 difficulties of the same type and severity as the ~20 technical difficulties raised in Eliezer’s list. About half of them are overlaps, and I think the other half are if anything more important since they are more relevant to core problems with realistic alignment strategies.
My take on Eliezer's takes
- Eliezer raises many good considerations backed by pretty clear arguments, but makes confident assertions that are much stronger than anything suggested by actual argument.
- Eliezer’s post (and most of his writing) isn’t bringing much new evidence to the table; it mostly either reasons a priori or draws controversial conclusions from uncontroversial evidence. I think that calls for a different approach than Eliezer has taken historically (if the goal was to productively resolve these disagreements).
- I think that these arguments mostly haven’t been written down publicly so that they can be examined carefully or subject to criticism. It’s not clear whether Eliezer has the energy to do that, but I think that people who think that Eliezer’s position is important should try to understand the arguments well enough to do that.
- I think that people with Eliezer’s views haven’t engaged very much productively with people who disagree (and have often made such engagement hard). I think that if you really dive into any of these key points you will quickly reach details where Eliezer cannot easily defend his view to a smart disinterested audience. And I don’t think that Eliezer could pass an ideological Turing test for people who disagree.
- I think those are valuable steps to take if you have a contrarian take of great importance, which remains controversial even within your weird corner of the world, and whose support comes almost entirely from reasoning and argument.
- A lot of the post seems to rest on intuitions and ways of thinking that Eliezer feels are empirically supported (rather than on arguments that can be explicitly stated). But I don’t feel like I actually have much evidence about that, so I think it really does just come down to the arguments.
- I think Eliezer would like to say that the last 20 years give a lot of evidence for his object-level intuitions and general way of thinking about the world. If that’s the case, I think we should very strongly expect that he can state predictions about the future that will systematically be better than those of people who don’t share his intuitions or reasoning strategies. I remain happy to make predictions about any questions he thinks would provide this kind of evidence, or to state a bunch of random questions where I’m happy to predict (where I think he will probably slightly underperform me). If there aren’t any predictions about the future where these intuitions and methodologies overperform, I think you should be very skeptical that they got a lot of evidence over the last 20 years (and that’s at least something that requires explanation).
- I think Eliezer could develop good intuition about these topics that is “backed up” by predicting the results of more complicated arguments using more broadly-accepted reasoning principles. Similarly, a mathematician might have great intuitions about the truth of a theorem, and those intuitions could come entirely from feedback loops involving formal proofs rather than empirical data. But if two mathematicians had differing intuitions about a theorem, and their intuitions both came from formally proving a bunch of similar theorems, then I think the way to settle the disagreement is by using the normal rules of logic governing proofs. So this brings us back to the previous bullet point, and I think Eliezer should be more interested in actually making arguments and engaging with legitimate objections.
- I don’t think Eliezer has any kind of track record of exhibiting understanding in other ways (e.g. by accomplishing technological goals or other projects that require engaging with details of the world or making good day-to-day predictions). I think that’s OK, but it means that I more strongly expect any empirically-backed intuitions to be cashed out as either predictions from afar or more careful arguments.
Ten examples off the top of my head, that I think are about half overlapping and where I think the discussions in the ELK doc are if anything more thorough than the discussions in the list of lethalities:
- Goals defined in terms of sense data are manipulable by an AI who can compromise sensors, and this is a serious obstruction to using ML to optimize what we actually care about.
- An AI may manipulate sensors by exploiting facts about the world or modes of heuristic reasoning that humans are totally unfamiliar with, such that humans couldn’t recognize such tampering even if they spent a very long time examining proposed actions.
- The human process of scientific understanding, even if automated, may end up being significantly less efficient than the use of gradient descent to find opaque models of the world. In this case, it may be inevitable that AI systems understand things about the world we don’t even if they try to help us do science.
- If an AI is trained to predict human judgments or optimize scores as assessed by humans, then humans are likely to make errors. An AI system will eventually learn these errors rather than learning the intended behavior. Even if these errors aren’t themselves important, it will then predictably copy human errors out of distribution leading to catastrophic outcomes.
- Even if humans make no errors in the training set, an AI which understands the world already has a model of a human which can be quickly repurposed to make good predictions about human judgments, and so it will tend to do this and therefore copy human errors off distribution.
- Even if the AI has no model of a human, in the limit where the AI’s model is very complex and alien it is still faster and simpler for the AI to learn a model of “what a human would say” from scratch then to learn the intended ontology identification. So we can't count on SGD.
- There are many training strategies that can train an AI to answer questions even in cases where humans could not answer correctly. However most of the approaches we know now, including those being explored in practice, seem to consistently top out at “questions that humans could answer if they have a lot more compute" which does not always seem good enough.
- We could imagine more elaborate games where the easiest strategy for the AI is honesty, and then to regularize on computation time in order to learn an honest policy, but those require us to be careful about the construction of the training data in order ensure that the task is sufficiently hard, and there are no existing proposals that have that property. It's very hard to even set up games for which no strategy can outperform honesty.
- Even if you were optimizing based on reliable observations of the real world, there are many bad actions that have no human-legible consequences for many years. At the point when legible consequences materialize it may be in a world that is too complex for existing humans to evaluate whether they are good or bad. If we don’t build an AI that understands our preferences about this kind of subtle bad behavior, then a competitive world will push us into a bad outcome.
- If the simplest policy to succeed at our task is a learned optimizer, and we try to regularize our AI to e.g. answer questions quickly, then its best strategy may be to internally searching for a policy which answers questions slowly (because it’s quicker to find such a policy, and the time taken by the search is larger than the time taken by the mesapolicy). This makes it difficult to lean on regularization strategies to incentivize honesty.
Strong +1s to many of the points here. Some things I'd highlight:
... (read more)
- Eliezer is not doing the type of reasoning that can justifiably defend the level of confidence he claims to have. If he were, he'd have much more to say about the specific details of consequentialism, human evolution, and the other key intuitions shaping his thinking. In my debate with him he mentioned many times how difficult he's found it to explain these ideas to people. I think if he understood these ideas well enough to justify the confidence of his claims, then he wouldn't have found that as difficult. (I'm sympathetic about Eliezer having in the past engaged with many interlocutors who were genuinely very bad at understanding his arguments. However, it does seem like the lack of detail in those arguments is now a bigger bottleneck.)
- I think that the intuitions driving Eliezer's disagreements with many other alignment researchers are interesting and valuable, and would love to have better-fleshed-out explanations of them publicly available. Eliezer would probably have an easier time focusing on developing his own ideas if other people in the alignment community who were pessimistic about various research directio
But what makes you so confident that it's not possible for subject-matter experts to have correct intuitions that outpace their ability to articulate legible explanations to others?
Of course, it makes sense for other people who don't trust the (purported) expert to require an explanation, and not just take the (purported) expert's word for it. (So, I agree that fleshing out detailed examples is important for advancing our collective state of knowledge.) But the (purported) expert's own confidence should track correctness, not how easy it is to convince people using words.
Yepp, this is a judgement call. I don't have any hard and fast rules for how much you should expect experts' intuitions to plausibly outpace their ability to explain things. A few things which inform my opinion here:
I don't think Eliezer is doing particularly well on any of these criteria. In particular, the last one was why I pressed Eliezer to make predictions rather than postdictions in my debate with him. The extent to which Eliezer seemed confused that I cared about this was a noticeable update for me in the direction of believing that Eliezer's intuitions are less solid than he thinks.
It may be the case that Eliezer has strong object-level intuitions about the details of how intelligence works which he's not willing to share publicly, but which significantly increase his confidence in his public claims. If so, I think the onus is on him to highlight that so people can make a meta-level update on it.
I agree that intuitions might get you to high confidence without the ability to explain ideas legibly.
That said, I think expert intuitions still need to usually (always?) be grounded out in predictions about something (potentially including the many implicit predictions that are often required to do stuff). It seems to me like Eliezer is probably relying on a combination of:
... (read more)
- Predicting stuff from afar. I think that can usually be made legible with a few years' lead time. I'm sympathetic to the difficulty of doing this (despite my frequent snarky tone), though without doing it I think Eliezer himself should have more doubts about the possibility of hindsight bias if this is really his main source of evidence. In theory this could also be retrodictions about history which would make things more complicated in some ways but faster in others.
- Testing intuitions against other already-trusted forms of reasoning, and particularly concrete arguments. In this regime, I don't think it's necessarily the case that Eliezer ought to be able to easily write down a convincing version of the arguments, but I do think we should expect him to systematically be right more often when we dig into argument
Fantastic post! I agree with most of it, but I notice that Eliezer's post has a strong tone of "this is really actually important, the modal scenario is that we literally all die, people aren't taking this seriously and I need more help". More measured or academic writing, even when it agrees in principle, doesn't have the same tone or feeling of urgency. This has good effects (shaking people awake) and bad effects (panic/despair), but it's a critical difference and my guess is the effects are net positive right now.
I definitely agree that Eliezer's list of lethalities hits many rhetorical and pedagogical beats that other people are not hitting and I'm definitely not hitting. I also agree that it's worth having a sense of urgency given that there's a good chance of all of us dying (though quantitatively my risk of losing control of the universe though this channel is more like 20% than 99.99%, and I think extinction is a bit less less likely still).
I'm not totally sure about the net effects of the more extreme tone, I empathize with both the case in favor and the case against. Here I'm mostly just trying to contribute to the project of "get to the bottom of what's likely to happen and what should be done."
I did start the post with a list of 19 agreements with Eliezer, including many of the claims that are most relevant to the urgency, in part so that I wouldn't be misconstrued as arguing that everything is fine.
I really appreciate your including a number here, that's useful info. Would love to see more from everyone in the future - I know it takes more time/energy and operationalizations are hard, but I'd vastly prefer to see the easier versions over no versions or norms in favor of only writing up airtight probabilities.
(I also feel much better on an emotional level hearing 20% from you, I would've guessed anywhere between 30 and 90%. Others in the community may be similar: I've talked to multiple people who were pretty down after reading Eliezer's last few posts.)
The problem with Eliezer's recent posts (IMO) is not in how pessimistic they are, but in how they are actively insulting to the reader. EY might not realize that his writing is insulting, but in that case he should have an editor who just elides those insulting points. (And also s/Eliezer/I/g please.)
Solid contribution, thank you.
Agreed explicitly for the record.
When "List of Lethalities" was posted, I privately wrote a list of where I disagreed with Eliezer, and I'm quite happy to see that there's a lot of convergence between my private list and Paul's list here.
I thought it would be a useful exercise to diff my list with Paul's; I'll record the result in the rest of this comment without the expectation that it's useful to anyone else.
Points on both lists:
... (read more)
- Eliezer's "first critical try" framing downplays the importance of trial-and-error with non-critical tries.
- It's not clear that a "pivotal act" by an aligned AI is the only way to prevent unaligned AI systems from being created.
- Eliezer badly equivocates between "alignment is hard"/"approach X to alignment doesn't obviously solve it" and "alignment is impossible to solve within our time limit"/"approach X to alignment is doomed."
- Deceptive behavior may arise from AI systems before they are able to competently deceive us, giving us some chances to iterate.
- Eliezer's arguments for fast takeoffs aren't precise enough to warrant his confidence.
- Eliezer's reasoning on generalization across distributional shift seems sloppy. Paul doesn't dig into this much, but I would add that there are appr
Why privately?! Is there a phenomenon where other people feel concerned about the social reception of expressing disagreement until Paul does? This is a phenomenon common in many other fields - and I'd invoke it to explain how the 'tone' of talk about AI safety shifted so quickly once I came right out and was first to say everybody's dead - and if it's also happening on the other side then people need to start talking there too. Especially if people think they have solutions. They should talk.
It seems to me like you have a blind spot regarding how your position as a community leader functions. If you, very well respected high status rationalist, write a long, angry post dedicated to showing everyone else that they can't do original work and that their earnest attempts at solving the problem are, at best, ineffective & distracting and you're tired of having to personally go critique all of their action plans... They stop proposing action plans. They don't want to dilute the field with their "noise", and they don't want you and others to think they're stupid for not understanding why their actions are ineffective or not serious attempts in the first place. I don't care what you think you're saying - the primary operative takeaway for a large proportion of people, maybe everybody except recurring characters like Paul Christiano, is that even if their internal models say they have a solution, they should just shut up because they're not you and can't think correctly about these sorts of issues.
[Redacted rant/vent for being mean-spirited and unhelpful]
I think this is, unfortunately, true. One reason people might feel this way is because they view LessWrong posts through a social lens. Eliezer posts about how doomed alignment is and how stupid everyone else's solution attempts are, that feels bad, you feel sheepish about disagreeing, etc.
But despite understandably having this reaction to the social dynamics, the important part of the situation is not the social dynamics. It is about finding technical solutions to prevent utter ruination. When I notice the status-calculators in my brain starting to crunch and chew on Eliezer's posts, I tell them to be quiet, that's not important, who cares whether he thinks I'm a fool. I enter a frame in which Eliezer is a generator of claims and statements, and often those claims and statements are interesting and even true, so I do pay attention to... (read more)
Sounds like same way we had a dumb questions post we need somewhere explicitly for posting dumb potential solutions that will totally never work, or something, maybe?
I have now posted a "Half-baked AI safety ideas thread" (LW version, EA Forum version) - let me know if that's more or less what you had in mind.
I think it's unwise to internally label good-faith thinking as "dumb." If I did that, I feel that I would not be taking my own reasoning seriously. If I say a quick take, or an uninformed take, I can flag it as such. But "dumb potential solutions that will totally never work"? Not to my taste.
That said, if a person is only comfortable posting under the "dumb thoughts incoming" disclaimer—then perhaps that's the right move for them.
Saying that people should not care about social dynamics and only about object level arguments is a failure at world modelling. People do care about social dynamics, if you want to win, you need to take that into account. If you think that people should act differently, well, you are right, but the people who counts are the real one, not those who live in your head.
Incentives matters. In today's lesswrong, the threshold of quality for having your ideas heard (rather than everybody ganging up on you to explain how wrong you are) is much higher for people who disagree with Eliezer than for people who agree with him. Unsurprisingly, that means that people filter what they say at a higher rate if they disagree with Eliezer (or any other famous user honestly - including you.).
I wondered whether people would take away the message that "The social dynamics aren't important." I should have edited to clarify, so thanks for bringing this up.
Here was my intended message: The social dynamics are important, and it's important to not let yourself be bullied around, and it's important to make spaces where people aren't pressured into conformity. But I find it productive to approach this situation with a mindset of "OK, whatever, this Eliezer guy made these claims, who cares what he thinks of me, are his claims actually correct?" This tactic doesn't solve the social dynamics issues on LessWrong. This tactic just helps me think for myself.
So, to be clear, I agree that incentives matter, I agree that incentives are, in one way or another, bad around disagreeing with Eliezer (and, to lesser extents, with other prominent users). I infer that these bad incentives spring both from Eliezer's condescension and rudeness, and also a range of other failures.
For example, if many people aren't just doing their best to explain why they best-guess-of-the-facts agree with Eliezer—if those people are "ganging up" and rederiving the bottom line of "Eliezer has to be right"—th... (read more)
Seems to be sort of an inconsistent mental state to be thinking like that and writing up a bullet-point list of disagreements with me, and somebody not publishing the latter is, I'm worried, anticipating social pushback that isn't just from me.
Respectfully, no shit Sherlock, that's what happens when a community leader establishes a norm of condescending to inquirers.
I feel much the same way as Citizen in that I want to understand the state of alignment and participate in conversations as a layperson. I too, have spent time pondering your model of reality to the detriment of my mental health. I will never post these questions and criticisms to LW because even if you yourself don't show up to hit me with the classic:
then someone else will, having learned from your example. The site culture has become noticeably more hostile in my opinion ever since Death with Dignity, and I lay that at least in part at your feet.
Yup, I've been disappointed with how unkindly Eliezer treats people sometimes. Bad example to set.
EDIT: Although I note your comment's first sentence is also hostile, which I think is also bad.
Let me make it clear that I'm not against venting, being angry, even saying to some people "dude, we're going to die", all that. Eliezer has put his whole life into this field and I don't think it's fair to say he shouldn't be angry from time to time. It's also not a good idea to pretend things are better than they actually are, and that includes regulating your emotional state to the point that you can't accurately convey things. But if the linchpin of LessWrong says that the field is being drowned by idiots pushing low-quality ideas (in so many words), then we shouldn't be surprised when even people who might have something to contribute decide to withhold those contributions, because they don't know whether or not they're the people doing the thing he's explicitly critiquing.
You (and probably I) are doing the same thing that you're criticizing Eliezer for. You're right, but don't do that. Be the change you wish to see in the world.
That sort of thinking is why we're where we are right now.
I have no idea how that cashes out game theoretically. There is a difference between moving from the mutual cooperation square to one of the exploitation squares, and moving from an exploitation square to mutual defection. The first defection is worse because it breaks the equilibrium, while the defection in response is a defensive play.
swarriner's post, including the tone, is True and Necessary.
Chapter 7 in this book had a few good thoughts on getting critical feedback from subordinates, specifically in the context of avoiding disasters. The book claims that merely encouraging subordinates to give critical feedback is often insufficient, and offers ideas for other things to do.
Power makes you dumb, stay humble.
Tell everyone in the organization that safety is their responsibility, everyone's views are important.
Try to be accessible and not intimidating, admit that you make mistakes.
Schedule regular chats with underlings so they don't have to take initiative to flag potential problems. (If you think such chats aren't a good use of your time, another idea is to contract someone outside of the organization to do periodic informal safety chats. Chapter 9 is about how organizational outsiders are uniquely well-positioned to spot safety problems. Among other things, it seems workers are sometimes more willing to share concerns frankly with an outsider than they are with their boss.)
Accept that not all of the critical feedback you get will be good quality.
The book disrecommends anonymous surveys on the grounds that they communicate the subtext that sharing your views openly is unsafe. I think anonymous surveys might be a good idea in the EA community though -- retaliation against critics seems fairly common here (i.e. the culture of fear didn't come about by chance). Anyone who's been around here long enough will have figured out that shari... (read more)
I think it is very true that the pushback is not just from you, and that nothing you could do would drive it to zero, but also that different actions from you would lead to a lot less fear of bad reactions from both you and others.
(Treating this as non-rhetorical, and making an effort here to say my true reasons rather than reasons which I endorse or which make me look good...)
In order of importance, starting from the most important:
... (read more)
- It would take a lot of effort to turn the list of disagreements I wrote for myself into a proper post, and I decided the effort wasn't worth it. I'm impressed how quickly Paul wrote this response, and it wouldn't surprise me if there are some people reading this who are now wondering if they should still post their rebuttals they've been drafting for the last week.
- As someone without name recognition, I have a general fear -- not unfounded, I think -- of posting my opinions on alignment publicly, lest they be treated as the ramblings of a self-impressed newcomer with a shallow understanding of the field. Some important context is that I'm a math grad student in the process of transitioning into a career in alignment, so I'm especially sensitive right now about safeguarding my reputation.
- I expected (rightly) that someone more established than me would end up posting a rebuttal better than mine.
- General anxiety around posting my thoughts (what if my ideas
OK, sure. First, I updated down on alignment difficulty after reading your lethalities post, because I had already baked in the expected-EY-quality doompost into my expectations. I was seriously relieved that you hadn't found any qualitatively new obstacles which might present deep challenges to my new view on alignment.
Here's one stab at my disagreement with your list: Human beings exist, and our high-level reasoning about alignment has to account for the high-level alignment properties of the only general intelligences we have ever found to exist ever. If ontological failure is such a nasty problem in AI alignment, how come very few people do terrible things because they forgot how to bind their "love" value to configurations of atoms? If it's really hard to get intelligences to care about reality, how does the genome do it millions of times each day?
Taking an item from your lethalities post:... (read more)
Yes, human beings exist and build world models beyond their local sensory data, and have values over those world models not just over the senses.
But this is not addressing all of the problem in Lethality 19. What's missing is how we point at something specific (not just at anything external).
The important disanalogy between AGI alignment and humans as already-existing (N)GIs is:
I addressed this distinction previously, in one of the links in OP. AFAIK we did not know how to reliably ensure the AI is pointed towards anything external, as long as it's external. But also, humans are reliably pointed to particular kinds of external things. See the linked thread for more detail.
I am not attempting to make an analogy. Genome->human values is, mechanistically, an instance of value formation within a generally intelligent mind. For all of our thought experiments, genome->human values is the only instance we have ever empirically observed.
Huh? I think I misunderstand you. I perceive you as saying: "There is not a predictable mapping from whatever-is-in-the-genome+environmental-factors to learned-values."
If so, I strongly disagree. Like, in the world where that is true, wouldn't parents be extremely uncertain whether their children will care about hills or dogs or paperclips or door hinges? Our values are not "whatever", human values are generally formed over predictable kinds of real-world objects like dogs and people and tasty food.... (read more)
I basically agree with you. I think you go too far in saying Lethailty 19 is solved, though. Using the 3 feats from your linked comment, which I'll summarise as "produce a mind that...":
(clearly each one is strictly harder than the previous) I recognise that Lethality 19 concerns feat 3, though it is worded as if being about both feat 2 and feat 3.
I think I need to distinguish two versions of feat 3:
Humans show that feat 2 at least has been accomplished, but also 3a, as I take you to be pointing out. I maintain that 3b is not demonstrated by humans and is probably something we need.
One reason you might do something like "writing up a list but not publishing it" is if you perceive yourself to be in a mostly-learning mode rather than a mostly-contributing one. You don't want to dilute the discussion with your thoughts that don't have a particularly good chance of adding anything, and you don't want to be written off as someone not worth listening to in a sticky way, but you want to write something down develop your understanding / check against future developments / record anything that might turn out to have value later after all once you understand better.
Of course, this isn't necessarily an optimal or good strategy, and people might still do it when it isn't - I've written down plenty of thoughts on alignment over the years, I think many of the actual-causal-reasons I'm a chronic lurker are pretty dumb and non-agentic - but I think people do reason like this, explicitly or implicitly.
There's a connection here to concernedcitizen64's point about your role as a community leader, inasmuch as your claims about the quality of the field can significantly influence people's probabilities that their ideas are useful / that they should be in a contributing mode, but IMO it's more generally about people's confidence in their contributions.
Overall I'd personally guess "all the usual reasons people don't publish their thoughts" over "fear of the reception of disagreement with high-status people" as the bigger factor here; I think the culture of LW is pretty good at conveying that high-quality criticism is appreciated.
I read the "List of Lethalities", think I understood it pretty well, and I disagree with it in multiple places. I haven't written those disagreements up like Paul did because I don't expect that doing so would be particularly useful. I'll try to explain why:
The core of my disagreement is that I think you are using a deeply mistaken framing of agency / values and how they arise in learning processes. I think I've found a more accurate framing, from which I've drawn conclusions very different to those expressed in your list, such as:
... (read more)
- Human values are not as fragile as they introspectively appear. The felt sense of value fragility is, in large part, due to a type mismatch between the cognitive processes which form, implement, and store our values on the one hand and the cognitive processes by which we introspect on our current values on the other.
- The processes by which we humans form/reflect on/generalize our values are not particularly weird among the space of processes able to form/reflect on/generalize values. Evolution pretty much grabbed the most accessible such process and minimally modified it in ways that are mostly irrelevant to alignment. E.g., I think we're more inclined to
I agree with almost all of this, in the sense that if you gave me these claims without telling me where they came from, I'd have actively agreed with the claims.
Things that don't meet that bar:
General: Lots of these points make claims about what Eliezer is thinking, how his reasoning works, and what evidence it is based on. I don't necessarily have the same views, primarily because I've engaged much less with Eliezer and so don't have confident Eliezer-models. (They all seem plausible to me, except where I've specifically noted disagreements below.)
Agreement 14: Not sure exactly what this is saying. If it's "the AI will probably always be able to seize control of the physical process implementing the reward calculation and have it output the maximum value" I agree.
Agreement 16: I agree with the general point but I would want to know more about the AI system and how it was trained before evaluating whether it would learn world models + action consequences instead of "just being nice", and even with the details I expect I'd feel pretty uncertain which was more likely.
Agreement 17: It seems totally fine to focus your attention on a specific subset of "easy-alignment" worlds and ensuri... (read more)
On 22, I agree that my claim is incorrect. I think such systems probably won't obsolete human contributions to alignment while being subhuman in many ways. (I do think their expected contribution to alignment may be large relative to human contributions; but that's compatible with significant room for humans to add value / to have made contributions that AIs productively build on, since we have different strengths.)
Broadly agree with this in most points of disagreement with Eliezer, and also agree with many points of agreement.
Few points where I sort of disagree with both, although this is sometimes unclear
I literally agree with this, but at the same time, in contrast to Eliezer's original point, I also think there is a decent chance the world would respond in a somewhat productive way, and this is a mayor point of leverage.
For people who doubt this, I'd point to variance in initial governmental-level response to COVID19, which ranged from "highly incompetent" (eg. early US) to "quite competent" (eg Taiwan). (I also have some intuitions around this based on non-trivial amounts of first-hand experience with how governments actually internally worked and made decisions - which you certainly don't need to trust, but if you are high... (read more)
It sounds like we are broadly on the same page about 1 and 2 (presumably partly because my list doesn't focus on my spiciest takes, which might have generated more disagreement).
Here are some extremely rambling thoughts on point 3.
I agree that the interaction between AI and existing conflict is a very important consideration for understanding or shaping policy responses to AI, and that you should be thinking a lot about how to navigate (and potentially leverage) those dynamics if you want to improve how well we handle any aspect of AI. I was trying to mostly point to differences in "which problems related to AI are we trying to solve?" We could think about technical or institutional or economic approaches/aspects of any problem.
With respect to "which problem are we trying to solve?": I also think potential undesirable effects of AI on the balance of power are real and important, both because it affects our long term future and because it will affect humanity's ability to cope with problems during the transition to AI. I think that problem is at least somewhat less important than alignment, but will probably get much more attention by default. I think this is especially true from a ... (read more)
Not very coherent response to #3. Roughly
... (read more)
- Caring about visible power is a very human motivation, and I'd expect will draw many people to care about "who are the AI principals", "what are the AIs actually doing", and few other topics, which have significant technical components
- Somewhat wild datapoints in this space: nuclear weapons, space race. in each case, salient motivations such as "war" led some of the best technical people to work on hard technical problems. in my view, the problems the technical people ended up working on were often "vs. nature" and distant from the original social motivations
- Another take on this is, some people want to technically interesting and import problems, but some of them want to work on "legibly important" or "legibly high-status" problems
- I do believe there are some opportunities in steering some fraction of this attention toward some of the core technical problems (not toward all of them, at this moment).
- This can often depend on framing; while my guess is e.g. you shouldn't probably work on this, my guess is some people who understand alignment technical problems should
- This can also depend on social dynamics; your "naive guess" seem a good sta
Seems worth noting that Taiwan is an outlier in terms of average IQ of its population. Given this, I find it pretty unlikely that typical governmental response to AI would be more akin to Taiwan than the US.
I absolutely agree. Australia has done substantially better than most other nations regarding COVID from all of economic, health, and lifestyle points of view. The two largest cities did somewhat worse in lifestyle for some periods, but most other places had far fewer and less onerous restrictions than most other countries for nearly 2 years. I personally was very happy to have lived with essentially zero risk of COVID and essentially zero restrictions both personal or economic for more than a year and a half.
A conservative worst-case estimate for costs of an uncontrolled COVID outbreak in Australia was on the order of 300,000 deaths and about $600 billion direct economic loss over 2 years, along with even larger economic impacts from higher-order effects.
We did very much better than that, especially in health outcomes. We had 2,000 deaths up until giving up on elimination in December last year, which was about 0.08 deaths per thousand. Even after giving up on local elimination, we still only have 0.37 per thousand compared with United States at 3.0 per thousand.
Economic losses are also substantially less than US in terms of comparison with the pre-pandemic economy, but the attribution of causes there is much more contentious as with everything to do with economics.
This is a thread for anyone who wants to give a high-level take or reaction that isn't contributing much to the discussion (and thus isn't worth a top-level comment).
I broadly agree with this much more than Eliezer's and think this did a good job of articulating a bunch of my fuzzy "this seems off". Most notably, Eliezer underrating the Importance and tractability of interpretability, and overrating the discontinuity of AI progress
I found it really helpful to have a list of places where Eliezer and Paul agree. It's interesting to see that there is a lot of similarity on big picture stuff like AI being extremely dangerous.
Do you think that some of my disagreements should change if I had shorter timelines?
(As mentioned last time we talked, but readers might not have seen: I'm guessing ~15% on singularity by 2030 and ~40% on singularity by 2040.)
I think most of your disagreements on this list would not change.
However, I think if you conditioned on 50% chance of singularity by 2030 instead of 15%, you'd update towards faster takeoff, less government/societal competence (and thus things more likely to fail at an earlier, less dignified point), more unipolar/local takeoff, lower effectiveness of coordination/policy/politics-style strategies, less interpretability and other useful alignment progress, less chance of really useful warning shots... and of course, significantly higher p(doom).
To put it another way, when I imagine what (I think) your median future looks like, it's got humans still in control in 2035, sitting on top of giant bureaucracies of really cheap, really smart proto-AGIs that fortunately aren't good enough at certain key skills (like learning-to-learn, or concept formation, or long-horizon goal-directedness) to be an existential threat yet, but are definitely really impressive in a bunch of ways and are reshaping the world economy and political landscape and causing various minor disasters here and there that serve as warning shots. So the whole human world is super interested in AI stuff and policymakers ar... (read more)
Curated. Eliezer's List of Lethalities post has received an immense amount of attention, rightly so given the content, and I am extremely glad to see this response go live since Eliezer's views do not reflect a consensus, and it would be sad to have only one set of views be getting all the attention when I do think many of the questions are non-obvious.
I am very pleased to see public back-and-forth on questions of not just "how and whether we are doomed", but the specific gears behind them (where things will work vs cannot work). These questions bear on the enormous resources poured into AI safety work right now. Ensuring those resources get allocated in a way that actually the improve odds of our success is key.
I hope that others continue to share and debate their models of the world, Alignment, strategy, etc. in a way that is both on record and easily findable by others. Hopefully, we can look back in 10, 20, 50, etc years and reflect on how well we reasoned in these cloudy times.
RE discussion of gradual-ness, continuity, early practice, etc.:
FWIW, here’s how I currently envision AGI developing, which seems to be in a similar ballpark as Eliezer’s picture, or at least closer than most people I think? (Mostly presented without argument.)
There’s a possible R&D path that leads to a model-based RL AGI. It would very agent-y, and have some resemblance to human brain algorithms (I claim), and be able to “figure things out” and “mull things over” and have ideas and execute on them, and understand the world and itself, etc., akin to how humans do all those things.
Large language models (LLMs) trained mainly by self-supervised learning (SSL), as built today, are not that path (although they might include some ingredients which would overlap with that path). In my view, those SSL systems are almost definitely safer, and almost definitely much less capable, than the agent-y model-based RL path. For example, I don’t think that the current SSL-LLM path is pointing towards “The last invention that man need ever make”. I won’t defend that claim here.
But meanwhile, like it or not, lots of other people are as we speak racing down the road towards the more brain-like, mor... (read more)
My expectation is that people will turn SSL models into agentic reasoners. I think this will happen through refinements to “chain of thought”-style reasoning approaches. See here. Such approaches absolutely do let LLMs “mull things over” to a limited degree, even with current very crude methods to do chain of thought with current LLMs. I also think future RL advancements will be more easily used to get better chain of thought reasoners, rather than accelerating a new approach to the SOTA.
Liked this post a lot. In particular I think I strongly agree with "Eliezer raises many good considerations backed by pretty clear arguments, but makes confident assertions that are much stronger than anything suggested by actual argument" as the general vibe of how I feel about Eliezer's arguments.
A few comments on the disagreements:
An in-between position would be to argue that even if we're maximally competent at the institutional problem, and can extract all the information we possibly can through experimentation before the first critical try, that just prevents the really embarrassing failures. Irrecoverable failures could still pop up every once in a while after entering the critical regime that we just could not have been prepared for, unless we have a full True Name of alignment. I think the crux here depends on your view on the Murphy-constant of the world (i.e how likely we are to get unknown unknown failures), and how long you think we need to spend in the critical reg... (read more)
Yup. You can definitely train powerful systems on imitation of human thoughts, and in the limit this just gets you a powerful mesa-optimizer that figures out how to imitate them.
The question is when you get a misaligned mesaoptimizer relative to when you get superhuman behavior.
I think it's pretty clear that you can get an optimizer which is upstream of the imitation (i.e. whose optimization gives rise to the imitation), or you can get an optimizer which is downstream of the imitation (i.e. which optimizes in virtue of its imitation). Of course most outcomes are messier than those two extremes, but the qualitative distinction still seems really central to these arguments.
I don't think you've made much argument about when the transition occurs. Existing language models strongly appear to be "imitation upstream of optimization." For example, it is much easier to get optimization out of them by having them imitate human optimization, than by setting up a situation where solving a hard problem is necessary to predict human behavior.
I don't know when you expect this situation to change; if you want to make some predictions then you could use empirical data to help support your view. By default I would interpret each stronger system with "imitation upstream of optimization" to be weak evidence that the transition will be later than you would have thought. I'm no... (read more)
Epistemic status: some of these ideas only crystallized today, normally I would take at least a few days to process before posting to make sure there are no glaring holes in the reasoning, but I saw this thread and decided to reply since it's topical.
Suppose that your imitator works by something akin to Bayesian inference with some sort of bounded simplicity prior (I think it's true of transformers). In order for Bayesian inference to converge to exact imitation, you usually need realizability. Obviously today we don't have realizability because the ANNs currently in use are not big enough to contain a brain, but we're gradually getting closer there.
More precisely, as ANNs grow in size we're approaching a regime I dubbed "pseudorealizability": on the one hand, the brain is in the prior, one the other hand, its description complexity is pretty high and therefore its prior probability is pretty low. Moreover, a more sophisticated agent (e.g. infra-Bayesian RL / Turing RL / infra-Bayesian physicalist) would be able to use the rest of world as useful evidence to predict some features of the human brain (i.e. even though human brains are complex, they are not random, there are rea... (read more)
I notice that as someone without domain specific knowledge of this area, that Paul's article seems to fill my model of a reality-shaped hole better than Eliezer's. This may just be an artifact of the specific use of language and detail that Paul provides which Eliezer does not, and Eliezer may have specific things he could say about all of these things and is not choosing to do so. Paul's response at least makes it clear to me that people, like me, without domain specific knowledge are prone to being pulled psychologically by use of language in various directions and should be very careful about making important life decisions based on concerns of AI safety without first educating themselves much further on the topic, especially since giving attention and funding to the issue at least has the capacity to cause harm.
I skimmed through the report and didn't find anything that looked like a centralized bullet point list of difficulties. I think it's valuable in general if people say what the problems are that they're trying to solve, and then collect them into a place so people can look them over simultaneously. I realize I haven't done enough of this myself, but if you've already written up the component pieces, that can make it easier to collect the bullet list.
I'm not sure if you are saying that you skimmed the report right now and couldn't find the list, or that you think that it was a mistake for the report not to contain a "centralized bullet point list of difficulties."
If you are currently looking for the list of difficulties: see the long footnote.
If you think the ELK report should have contained such a list: I definitely don't think we wrote this report optimally, but we tried our best and I'm not convinced this would be an improvement. The report is about one central problem that we attempt to state at the very top. Then there are a series of sections organized around possible solutions and the problems with those solutions, which highlight many of the general difficulties. I don't intuitively feel like a bulleted list of difficulties would have been a better way to describe the difficulties.
> The difference is that reality doesn’t force us to solve the problem, or tell us clearly which analogies are the right ones,
> does not have such a large effect on the scientific problem.
Another major difference is that we're forced to solve the problem using only analogies (and reasoning), as opposed to also getting to study the actual objects in question. And, there's a big boundary between AIs that would lose vs. win a fight with humanity, which causes big disanalogies between AIs, and how alignment strategies apply to AIs, before and after that boundary. (Presumably there's major disagreement about how important these disanalogies are / how difficult they are to circumvent with other analogies.)
> AI is accelerating the timetable for both alignment and capabilities
AI accelerates the timetable for things we know how to point AI at (which shades into convergently instrumental things that we point at just by training an AI to do anything). We know how to point AI at things that can be triangulated with clear metrics, like "how well does the sub-AI you programmed perform at such and such tasks". We much less know how to point AI at alignment, or at more general things like... (read more)
This seems like a crux for the Paul-Eliezer disagreement which can explain many of the other disagreements (it's certainly my crux). In particular, conditional on taking Eliezer's side on this point, a number of Eliezer's other points all seem much more plausible e.g. nanotech, advanced deception/treacherous turns, and pessimism regarding the pace of alignment research.
There's been a lot of debate on this point, and some of it was distilled by Rohin. Seems to me that the most productive way to move forward on this disagreement would be to distill the rest of the relevant MIRI conversations, and solicit arguments on the relevant cruxes.
RE Disagreement 5: Some examples where the aligned AIs will not consume the “free energy” of an out-of-control unaligned AI are:
1. Exploiting the free energy requires humans trusting the AIs more than they actually do. For example, humans with a (supposedly) aligned AGI may not trust the AGI to secure their own nuclear weapons systems, or to hack into its enemies’ nuclear weapons systems, or to do recursive self-improvement, or to launch von Neumann probes that can never be called back. But an out-of-control AGI would presumably be willing to do all those things.
2. Exploiting the free energy requires violating human laws, norms, Overton Windows, etc., or getting implausibly large numbers of human actors to agree with each other, or suffering large immediate costs for uncertain benefits, etc., such that humans don’t actually let their aligned AGIs do that. For example, maybe the only viable gray goo defense system consists of defensive nanobots that go proliferate in the biosphere, harming wildlife and violating national boundaries. Would people + aligned AGIs actually go and deploy that system? I’m skeptical. Likewise, if there’s a neat trick to melt all the non-whitelisted GP... (read more)
I, personally, would like 5 or 10 examples, from disparate fields, of verification being easier than generation.
And also counterexamples, if anyone has any.
I'm just going to name random examples of fields, I think it's true essentially all the time but I only have personal experience in a small number of domains where I've actually worked:
... (read more)
- It's easier to recognize a good paper in computer science or ML than to write one. I'm most familiar with theoretical computer science, where this is equally true in domains that are not yet formalized, e.g. a mediocre person in the field is still able to recognize important new conceptual ideas without being able to generate them. In ML it requires more data than is typically present in a paper (but e.g. can be obtained by independent replications or by being able to inspect code).
- Verifying that someone has done a good job writing software is easier than writing it yourself, if you are able to e.g. interact with the software, get clear explanations of what they did and why, and have them also write good tests.
- Verifying a theory in physics is easier than generating it. Both in the sense that it's much easier to verify that QM or the standard model or general relativity is a good explanation of existing phenomena than it is to come up with those models from scratch, and in the sense that e.g. verifyin
I expect there will probably be a whole debate on this at some point, but as counterexamples I would give basically all the examples in When Money is Abundant, Knowledge is the Real Wealth and What Money Cannot Buy. The basic idea in both of these is that expertise, in most fields, is not easier to verify than to generate, because most of the difficulty is in figuring out what questions to ask and what to pay attention to, which itself require expertise.
More generally, I expect that verification is not much easier than generation in any domain where figuring out what questions to ask and what to pay attention to is itself the bulk of the problem. Unfortunately, this is very highly correlated with illegibility, so legible examples are rare.
One particularly difficult case is when the thing you're trying to verify has a subtle flaw.
Consider Kempe's proof of the four colour theorem, which was generally accepted for eleven years before being refuted. (It is in fact a proof of the five-colour theorem)
And of course, subtle flaws are much more likely in things that someone has designed to deceive you.
Against an intelligent adversary, verification might be much harder than generation. I'd cite Marx and Freud as world-sweeping obviously correct theories that eventually turned out to be completely worthless. I can remember a time when both were taken very seriously in academic circles.
One different way I've been thinking about this issue recently is that humans have fundamental cognitive limits e.g. brain size that AGI wouldn't have. There are possible biotech interventions to fix these but the easiest ones (e.g. just increase skull size) still require decades to start up. AI, meanwhile, could be improved (by humans and AIs) on much faster timescales. (How important something like brain size is depends on how much intellectual progress is explained by max intelligence than total intelligence; a naive reading of intellectual history would say max intelligence is important g... (read more)
My sense is that we are on broadly the same page here. I agree that "AI improving AI over time" will look very different from "humans improving humans over time" or even "biology improving humans over time." But I think that it will look a lot like "humans improving AI over time," and that's what I'd use to estimate timescales (months or years, most likely years) for further AI improvements.
This seems wrong to me, could you elaborate? Prompt: Presumably you think we do have a plan, it just doesn't meet Eliezer's standards. What is that plan?... (read more)
I think most worlds, surviving or not, don't have a plan in the sense that Eliezer is asking about.
I do agree that in the best worlds, there are quite a lot of very good plans and extensive analysis of how they would play out (even if it's not the biggest input into decision-making). Indeed, I think there are a lot of things that the best possible world would be doing that we aren't, and I'd give that world a very low probability of doom even if alignment was literally impossible-in-principle.
ETA: this is closely related to Richard's point in the sibling.
I think it's less about how many holes there are in a given plan, and more like "how much detail does it need before it counts as a plan?" If someone says that their plan is "Keep doing alignment research until the problem is solved", then whether or not there's a hole in that plan is downstream of all the other disagreements about how easy the alignment problem is. But it seems like, separate from the other disagreements, Eliezer tends to think that having detailed plans is very useful for making progress.
Analogy for why I don't buy this: I don't think that the Wright brothers' plan to solve the flying problem would count as a "plan" by Eliezer's standards. But it did work.
I think we don't know whether various obvious-to-us-now things will work with effort. I think we don't really have a plan that would work with an acceptably high probability and stand up to scrutiny / mildly pessimistic assumptions.
I would guess that if alignment is hard, then whatever we do ultimately won't follow any existing plan very closely (whether we succeed or not). I do think it's reasonably likely to agree at a very high level. I think that's also true even in the much better worlds that do have tons of plans.
I wouldn't say there is "a plan" to do that.
Many people have that hope, and have thought some about how we might establish sufficient consensus about risk to delay AGI deployment for 0.5-2 years if things look risky, and how to overcome various difficulties with implementing that kind of delay, or what kind of more difficult moves might be able to delay significantly longer than that.
I was just thinking about this. The central example that's often used here is "evolution optimized humans for inclusive genetic fitness, nonetheless humans do not try to actually maximize the amount of their surviving offspring, such... (read more)
Thanks for writing this!
Typo: "I see this kind of thinking from Eliezer a lot but it seems misleading or long" should be "...or wrong"
Regarding disagreement (2), I think many of Yudkowsky's "doom stories" are more intuition pumps / minimum bounds for demonstrating properties of superintelligence.
E.g. nanotech isn't there because he necessarily thinks it's what an unaligned AGI would do. Instead, it's to demonstrate how high the relative tech capabilities are of the AGI.
His point (which he stresses in different ways), is "don't look at the surface details of the story, look instead at the implied capabilities of the system".
Similar with "imagine it self-improving in minutes". It may or ma... (read more)
Regarding disagreement (7): I'd like to see more people using AI to try and make useful contributions to alignment.
More broadly, I think the space of alignment working methods, literally the techniques researchers would use day-to-day, has been under-explored.
If the fate of the world is at stake, shouldn't someone at least try hokey idea-generation techniques lifted from corporations? Idea-combinations generators? Wacky proof-helper softwares? Weird physical-office setups like that 10-chambered linear room thing I saw somewhere but can't find now? I don't ... (read more)
Russian translation by me
Wouldn't demonstrating the risk increase motivation for capability gains for everyone else?
"I think it doesn’t match well with pragmatic experience in R&D in almost any domain, where verification is much, much easier than generation in virtually every domain."
This seems like a completely absurd claim to me, unless by verification you mean some much weaker claim like that you can show something sometimes works.
Coming from the world of software, generating solutions that seem to work is almost always far easier than any sort of formal verification that they work. I think this will be doubly true in any sort of adversarial situation where any f... (read more)
I've long interpreted Eliezer, in terms of your disagreements [2-6], as offering deliberately exaggerated examples.
I do think you might be right about this [from disagreement 2]:
I do like your points overall for disagreements  and .
I feel like there's still something being 'lost in translation'. When I think the of the Eliezer-AGI and why ... (read more)
Excellent post, thank you Paul. This is an important message that the community needs to hear right now.
Posting this comment to start some discussion about generalization and instrumental convergence (disagreements #8 and #9).
So my general thoughts here are that ML generalization is almost certainly not good enough for alignment. (At least in the paradigm of deep learning.) I think it's true with high confidence that if we're trying to train a neural net to imitate some value function, and that function takes a high-dimensional input, then it will be possible to find lots of inputs that cause the network to produce a high value when the value function produc... (read more)
What do you think these technical problems are?
Number 22:... (read more)
Magically given a very powerful, unaligned, AI. (This 'the utility function is in code, in one place, and can be changed' assumption needs re-examination. Even if we assert it exists in there*, it might be hard to change in, say, a NN.)
* Maybe this is overgeneralizing from people, but what reason do we have to think an 'AI' will be really good at figuring out its utility function (so it can make changes without changing it, if it so desires). ... (read more)
I wonder what, if any, scientific/theoretical problems have been solved right "on the first try" in human history. I know MIRI and others have done studies of history to find examples of e.g. technological discontinuities. Perhaps a study could be made of this?
An example Yudkowsky brings up in the Sequences often, is Einstein's discovery of General Relativity. I think this is informative and helpful for alignment. Einstein did lots of thought experi... (read more)
This has overtaken the post it's responding to as the top-karma post of all time.
Yes, it's never an equilibrium state for Eliezer communicating key points about AI to be the highest karma post on LessWrong. There's too much free energy to be eaten by a thoughtful critique of his position. On LW 1.0 it was Holden's Thoughts on the Singularity Institute, and now on LW 2.0 it's Paul's list of agreements and disagreements with Eliezer.
Finally, nature is healing.
"By the time AI systems can double the pace of AI research, it seems like they can greatly accelerate the pace of alignment research."
I think this assumption is unlikely. From what we know of human-lead research, accelerating AI capabilities is much easier than accelerating progress in alignment. I don't see why it would be different for an AI.
I think there will be substantial technical hurdles along the lines of getting in-principle highly capable AI systems to reliably do what we want them to, th... (read more)
I found this post very useful! I went through the list and wrote down my thoughts on the points, posting them here in case they are of interest to others.
Some high-level comments first.
Disclaimer: I'm not senior enough to have consistent inside-views. I wrote up a similar list a few days ago in response to Yudkowsky's post, and some of my opinions have changed.
In particular, I note that I have been biased to agree with Yudkowsky for reasons unrelated to actual validity of arguments, such as "I have read more texts by him than any other single person".
So... (read more)
(26) I think by "a plan", Yudkowsky partially means "a default paradigm and relevant concrete problems". There's no consensus on the first one, and Yudkowsky would disagree on the second one (since he thinks most current concrete problems are irrelevant to the core/eventual problem).
Disagreement (4): I think Yudkowsky maybe expects AGI to recursively self-improve on the way to becoming human-level.
Mostly just here to say "I agree", especially regarding
A lot of EY's points follow naturally if you think that the first AGI will be a recursively self improving maximally Bayesian rei... (read more)
I hope you're right.