This is mostly in response to stuff written by Richard, but I'm interested in everyone's read of the situation.
While I don't find Eliezer's core intuitions about intelligence too implausible, they don't seem compelling enough to do as much work as Eliezer argues they do. As in the Foom debate, I think that our object-level discussions were constrained by our different underlying attitudes towards high-level abstractions, which are hard to pin down (let alone resolve).
Given this, I think that the most productive mode of intellectual engagement with Eliezer's worldview going forward is probably not to continue debating it (since that would likely hit those same underlying disagreements), but rather to try to inhabit it deeply enough to rederive his conclusions and find new explanations of them which then lead to clearer object-level cruxes.
I'm not sure yet how to word this as a question without some introductory paragraphs. When I read Eliezer, I often feel like he has a coherent worldview that sees lots of deep connections and explains lots of things, and that he's actively trying to be coherent / explain everything. [This is what I think you're pointing to with his 'attitude toward...
I feel like I have a broad distribution over worlds and usually answer questions with probability distributions, that I have a complete mental universe (which feels to me like it outputs answers to a much broader set of questions than Eliezer's, albeit probabilistic ones, rather than bailing with "the future is hard to predict"). At a high level I don't think "mainline" is a great concept for describing probability distributions over the future except in certain exceptional cases (though I may not understand what "mainline" means), and that neat stories that fit everything usually don't work well (unless, or often even if, generated in hindsight).
In answer to your "why is this," I think it's a combination of moderate differences in functioning and large differences in communication style. I think Eliezer has a way of thinking about the future that is quite different from mine and I'm somewhat skeptical of and feel like Eliezer is overselling (which is what got me into this discussion), but that's probably smaller than a large difference in communication style (driven partly by different skills, different aesthetics, and different ideas about what kinds of standards discourse should aspire to).
I think I may not understand well the basic lesson / broader point, so will probably be more helpful on object level points and will mostly go answer those in the time I have.
I feel like I have a broad distribution over worlds and usually answer questions with probability distributions, that I have a complete mental universe (which feels to me like it outputs answers to a much broader set of questions than Eliezer's, albeit probabilistic ones, rather than bailing with "the future is hard to predict").
Sometimes I'll be tracking a finite number of "concrete hypotheses", where every hypothesis is 'fully fleshed out', and be doing a particle-filtering style updating process, where sometimes hypotheses gain or lose weight, sometimes they get ruled out or need to split, or so on. In those cases, I'm moderately confident that every 'hypothesis' corresponds to a 'real world', constrained by how well as I can get my imagination to correspond to reality. [A 'finite number' depends on the situation, but I think it's normally something like 2-5, unless it's an area I've built up a lot of cache about.]
Sometimes I'll be tracking a bunch of "surface-level features", where the distributions on the features don't always imply coherent underlying worlds, either on their own or in combination with other features. (For example, I might have guesses about the probability th...
I think my way of thinking about things is often a lot like "draw random samples," more like drawing N random samples rather than particle filtering (I guess since we aren't making observations as we go---if I notice an inconsistency the thing I do is more like backtrack and start over with N fresh samples having updated on the logical fact).
The main complexity feels like the thing you point out where it's impossible to make them fully fleshed out, so you build a bunch of intuitions about what is consistent (and could be fleshed out given enough time) and then refine those intuitions only periodically when you actually try to flesh something out and see if it makes sense. And often you go even further and just talk about relationships amongst surface level features using intuitions refined from a bunch of samples.
I feel like a distinctive feature of Eliezer's dialog w.r.t. foom / alignment difficulty is that he has a lot of views about strong regularities that should hold across all of these worlds. And then disputes about whether worlds are plausible often turn on things like "is this property of the described world likely?" which is tough because obviously everyone agrees that ev...
EDIT: I wrote this before seeing Paul's response; hence a significant amount of repetition.
They often seem to emit sentences that are 'not absurd', instead of 'on their mainline', because they're mostly trying to generate sentences that pass some shallow checks instead of 'coming from their complete mental universe.'
Why is this?
Well, there are many boring cases that are explained by pedagogy / argument structure. When I say things like "in the limit of infinite oversight capacity, we could just understand everything about the AI system and reengineer it to be safe", I'm obviously not claiming that this is a realistic thing that I expect to happen, so it's not coming from my "complete mental universe"; I'm just using this as an intuition pump for the listener to establish that a sufficiently powerful oversight process would solve AI alignment.
That being said, I think there is a more interesting difference here, but that your description of it is inaccurate (at least for me).
From my perspective I am implicitly representing a probability distribution over possible futures in my head. When I say "maybe X happens", or "X is not absurd", I'm saying that my probability distribution assign...
In response to your last couple paragraphs: the critique, afaict, is not "a real human cannot keep multiple concrete scenarios in mind and speak probabilistically about those", but rather "a common method for representing lots of hypotheses at once, is to decompose the hypotheses into component properties that can be used to describe lots of concrete hypotheses. (toy model: instead of imagining all numbers, you note that some numbers are odd and some numbers are even, and then think of evenness and oddness). A common failure mode when attempting this is that you lose track of which properties are incompatible (toy model: you claim you can visualize a number that is both even and odd). A way to avert this failure mode is to regularly exhibit at least one concrete hypothesis that simultaneousy posseses whatever collection of properties you say you can simultaneously visualize (toy model: demonstrating that 14 is even and 7 is odd does not in fact convince me that you are correct to imagine a number that is both even and odd)."
On my understanding of Eliezer's picture (and on my own personal picture), almost nobody ever visibly tries to do this (never mind succeeding), when it comes to hopeful AGI scenarios.
Insofar as you have thought about at least one specific hopeful world in great detail, I strongly recommend, spelling it out, in all its great detail, to Eliezer, next time you two chat. In fact, I personally request that you do this! It sounds great, and I expect it to constitute some progress in the debate.
Relevant Feynman quote:
I had a scheme, which I still use today when somebody is explaining something that I’m trying to understand: I keep making up examples.
For instance, the mathematicians would come in with a terrific theorem, and they’re all excited. As they’re telling me the conditions of the theorem, I construct something which fits all the conditions. You know, you have a set (one ball)-- disjoint (two balls). Then the balls turn colors, grow hairs, or whatever, in my head as they put more conditions on.
Finally they state the theorem, which is some dumb thing about the ball which isn’t true for my hairy green ball thing, so I say “False!” [and] point out my counterexample.
As I understand it, when you "talk about the mainline", you're supposed to have some low-entropy (i.e. confident) view on how the future goes, such that you can answer very different questions X, Y and Z about that particular future, that are all correlated with each other, and all get (say) > 50% probability. (Idk, as I write this down, it seems so obviously a bad way to reason that I feel like I must not be understanding it correctly.)
But to the extent this is right, I'm actually quite confused why anyone thinks "talk about the mainline" is an ideal to which to aspire. What makes you expect that?
I'll try to explain the technique and why it's useful. I'll start with a non-probabilistic version of the idea, since it's a little simpler conceptually, then talk about the corresponding idea in the presence of uncertainty.
Suppose I'm building a mathematical model of some system or class of systems. As part of the modelling process, I write down some conditions which I expect the system to satisfy - think energy conservation, or Newton's Laws, or market efficiency, depending on what kind of systems we're talking about. My hope/plan is to derive (i.e. prove) some predictions from these...
The most recent post has a related exchange between Eliezer and Rohin:
Eliezer: I think the critical insight - though it has a format that basically nobody except me ever visibly invokes in those terms, and I worry maybe it can only be taught by a kind of life experience that's very hard to obtain - is the realization that any consistent reasonable story about underlying mechanisms will give you less optimistic forecasts than the ones you get by freely combining surface desiderata
Rohin: Yeah, I think I do not in fact understand why that is true for any consistent reasonable story.
If I'm being locally nitpicky, I argue that Eliezer's thing is a very mild overstatement (it should be "≤" instead of "<") but given that we're talking about forecasts, we're talking about uncertainty, and so we should expect "less" optimism instead of just "not more" optimism, and so I think Eliezer's statement stands as a general principle about engineering design.
This also feels to me like the sort of thing that I somehow want to direct attention towards. Either this principle is right and relevant (and it would be good for the field if all the AI safety thinkers held it!), or there's some deep confusion of mine that I'd like cleared up.
Sorry, I probably should have been more clear about the "this is a quote from a longer dialogue, the missing context is important." I do think that the disagreement about "how relevant is this to 'actual disagreement'?" is basically the live thing, not whether or not you agree with the basic abstract point.
My current sense is that you're right that the thing you're doing is more specific than the general case (and one of the ways you can tell is the line of argumentation you give about chance of doom), and also Eliezer can still be correctly observing that you have too many free parameters (even if the number of free parameters is two instead of arbitrarily large). I think arguments about what you're selecting for either cash out in mechanistic algorithms, or they can deceive you in this particular way.
Or, to put this somewhat differently, in my view the basic abstract point implies that having one extra free parameter allows you to believe in a 5% chance of doom when in fact there's 100% chance of doom, and so in order to get estimations like that right this needs to be one of the basic principles shaping your thoughts, tho ofc your prior should come from many examples instead of ...
[I think there's a thing Eliezer does a lot, which I have mixed feelings about, which is matching people's statements to patterns and then responding to the generator of the pattern in Eliezer's head, which only sometimes corresponds to the generator in the other person's head.]
I want to add an additional meta-pattern – there was a once a person who thought I had a particular bias. They'd go around telling me "Ray, you're exhibiting that bias right now. Whatever rationalization you're coming up with right now, it's not the real reason you're arguing X." And I was like "c'mon man. I have a ton of introspective access to myself and I can tell that this 'rationalization' is actually a pretty good reason to believe X and I trust that my reasoning process is real."
But... eventually I realized I just actually had two motivations going on. When I introspected, I was running a check for a positive result on "is Ray displaying rational thought?". When they extrospected me (i.e. reading my facial expressions), they were checking for a positive result on "does Ray seem biased in this particular way?".
And both checks totally returned 'true', and that was an accurate assessment.
The partic...
(For object-level responses, see comments on parallel threads.)
I want to push back on an implicit framing in lines like:
there's some value to more people thinking thru / shooting down their own edge cases [...], instead of pushing the work to Eliezer.
people aren't updating on the meta-level point and continue to attempt 'rolling their own crypto', asking if Eliezer can poke the hole in this new procedure
This makes it sound like the rest of us don't try to break our proposals, push the work to Eliezer, agree with Eliezer when he finds a problem, and then not update that maybe future proposals will have problems.
Whereas in reality, I try to break my proposals, don't agree with Eliezer's diagnoses of the problems, and usually don't ask Eliezer because I don't expect his answer to be useful to me (and previously didn't expect him to respond). I expect this is true of others (like Paul and Richard) as well.
But also my sense is that there's some deep benefit from "having mainlines" and conversations that are mostly 'sentences-on-mainline'?
I agree with this. Or, if you feel ~evenly split between two options, have two mainlines and focus a bunch on those (including picking at cruxes and revising your mainline view over time).
But:
Like, it feels to me like Eliezer was generating sentences on his mainline, and Richard was responding with 'since you're being overly pessimistic, I will be overly optimistic to balance', with no attempt to have his response match his own mainline.
I do note that there are some situations where rushing to tell a 'mainline story' might be the wrong move:
These conversations are great and I really admire the transparency. It's really nice to see discussions that normally happen in private happen instead in public where everyone can reflect, give feedback, and improve their own thoughts. On the other hand, the combined conversations combined to a decent-sized novel - LW says 198,846 words! Is anyone considering investing heavily in summarizing the content for people to get involved without having to read all that content?
Echoing that I loved these conversations and I'm super grateful to everyone who participated — especially Richard, Paul, Eliezer, Nate, Ajeya, Carl, Rohin, and Jaan, who contributed a lot.
I don't plan to try to summarize the discussions or distill key take-aways myself (other than the extremely cursory job I did on https://intelligence.org/late-2021-miri-conversations/), but I'm very keen on seeing others attempt that, especially as part of a process to figure out their own models and do some evaluative work.
I think I'd rather see partial summaries/responses that go deep, instead of a more exhaustive but shallow summary; and I'd rather see summaries that center the author's own view (what's your personal take-away? what are your objections? which things were small versus large updates? etc.) over something that tries to be maximally objective and impersonal. But all the options seem good to me.
Question for Richard, Paul, and/or Rohin: What's a story, full of implausibly concrete details but nevertheless a member of some largish plausible-to-you cluster of possible outcomes, in which things go well? (Paying particular attention to how early AGI systems are deployed and to what purposes, or how catastrophic deployments are otherwise forstalled.)
I wrote this doc a couple of years ago (while I was at CHAI). It's got many rough edges (I think I wrote it in one sitting and never bothered to rewrite it to make it better), but I still endorse the general gist, if we're talking about what systems are being deployed to do and what happens amongst organizations. It doesn't totally answer your question (it's more focused on what happens before we get systems that could kill everyone), but it seems pretty related.
(I haven't brought it up before because it seems to me like the disagreement is much more in the "mechanisms underlying intelligence", which that doc barely talks about, and the stuff it does say feels pretty outdated; I'd say different things now.)
Eliezer and Nate, my guess is that most of your perspective on the alignment problem for the past several years has come from the thinking and explorations you've personally done, rather than reading work done by others.
But, if you have read interesting work by others that's changed your mind or given you helpful insights, what has it been? Some old CS textbook? Random Gwern articles? An economics textbook? Playing around yourself with ML systems?
One thing in the posts I found surprising was Eliezers assertion that you needed a dangerous superintelligence to get nanotech. If the AI is expected to do everything itself, including inventing the concept of nanotech, I agree that this is dangerously superintelligent.
However, suppose Alpha Quantum can reliably approximate the behaviour of almost any particle configuration. Not literally any, it can't run a quantum computer factorizing large numbers better than factoring algorithms, but enough to design a nanomachine. (It has been trained to approximate the ground truth of quantum mechanics equations, and it does this very well.)
For example, you could use IDA, start training to imitate a simulation of a handful of particles, then compose several smaller nets into one large one.
Add a nice user interface and we can drag and drop atoms.
You can add optimization, gradient descent trying to maximize the efficiency of a motor, or minimize the size of a logic gate. All of this is optimised to fit a simple equation, so assuming you don't have smart general mesaoptimizers forming, and deducing how to manipulate humans based on very little info about humans, you shoul...
I wrote Consequentialism & Corrigibility shortly after and partly in response to the first (Ngo-Yudkowsky) discussion. If anyone has an argument or belief that the general architecture / approach I have in mind (see the “My corrigibility proposal sketch” section) is fundamentally doomed as a path to corrigibility and capability—as opposed to merely “reliant on solving lots of hard-but-not-necessarily-impossible open problems”—I'd be interested to hear it. Thanks in advance. :)
After reading some of the newer MIRI dialogues, I'm less convinced than I once was that I know what "corrigibility" actually is. Could you say a few words about what kind of behavior you concretely expect to see from a "corrigible" agent, followed by how [you expect] those behaviors [to] fit into the "trajectory-constraining" framework you propose in your post?
EDIT: This is not purely a question for Steven, incidentally (or at least, the first half isn't); anyone else who wants to take a shot at answering should feel free to do so. In particular I'd be interested in hearing answers from Eliezer or anyone else historically involved in the invention of the term.
Question for anyone, but particularly interested in hearing from Christiano, Shah, or Ngo: any thoughts on what happens when alignment schemes that worked in lower-capability regimes fail to generalize to higher-capability regimes?
For example, you could imagine a spectrum of outcomes from "no generalization" (illustrative example: galaxies tiled with paperclips) to "some generalization" (illustrative example: galaxies tiled with "hedonium" human-ish happiness-brainware) to "enough generalization that existing humans recognizably survive, but something still went wrong from our current perspective" (illustrative examples: "Failed Utopia #4-2", Friendship Is Optimal, "With Folded Hands"). Given that not every biological civilization solves the problem, what does the rest of the multiverse look like? (How is measure distributed on something like my example spectrum, or whatever I should have typed instead?)
(Previous work: Yudkowsky 2009 "Value Is Fragile", Christiano 2018 "When Is Unaligned AI Morally Valuable?", Grace 2019 "But Exactly How Complex and Fragile?".)
When alignment schemes fail to scale, I think it typically means that they work while the system is unable to overpower/outsmart the oversight process, and then break down when the system becomes able to do so. I think that this usually results in the AI shifting from behavior that is mostly constrained by the training process to behavior that is mostly unconstrained (once they effectively disempower humans).
I think the results are relatively unlikely to be good in virtue of "the AI internalized something about our values, just not everything", and I'm pretty skeptical of recognizable "near miss" scenarios rather than AI gradually careening in very hard-to-predict directions with minimal connection with the surface features of the training process.
Overall I think that the most likely outcome is a universe that is orthogonal to anything we directly care about, maybe with a vaguely similar flavor owing to convergence depending on how AI motivations shake out. (But likely not close enough to feel great, and quite plausibly with almost no visible relation. Probably much more different from us than we are from aliens.)
I think it's fairly plausible that the results are OK just beca...
Basically agree with Paul, and I especially want to note that I've barely thought about it and so this would likely change a ton with more information. To put some numbers of my own:
These are from my own perspective of what these categories mean, which I expect are pretty different from yours -- e.g. maybe I'm at ~2% that upon reflection I'd decide that hedonium is great and so that's actually perfect generalization; in the last category I include lots of worlds that I wouldn't describe as "existing humans recognizably survive", e.g. we decide to become digital uploads, then get lots of cognitive enhancements, throw away a bunch of evolutionary baggage, but also we never expand to the stars because AI has taken control of it and given us only Earth.
I think the biggest avenues for improving the answers would be to reflect more on the kindness + cooperation and acausal trade stories Paul mentions, as well as the possibility that a few AIs end up generalizing close to correctly and working ...
I finished reading all the conversations a few hours ago. I have no follow-up questions (except maybe "now what?"), I'm still updating from all those words.
One except in particular, from the latest post, jumped at me (from Eliezer Yudkowsky, emphasis mine):
This is not aimed particularly at you, but I hope the reader may understand something of why Eliezer Yudkowsky goes about sounding so gloomy all the time about other people's prospects for noticing what will kill them, by themselves, without Eliezer constantly hovering over their shoulder every minute prompting them with almost all of the answer.
The past years or reading about alignment have left me with an intense initial distrust of any alignment research agenda. Maybe it's ordinary paranoia, maybe something more. I've not come up with any new ideas myself, and I'm not particularly confident in my ability to find flaws in someone else's proposal (what if I'm not smart enough to understand them properly? What if I make things even more confused and waste everyone's time?)
After thousands and thousands of lengthy conversations where it takes everyone ages to understand where threat models disagree, why some avenue of research is p...
Not sure if it's a right place to ask, instead of just googling it, but anyway: does anyone know what's the current state of AI security practices at DeepMind, OpenAI and other such places? Like, did they estimate probability of GPT-3 killing everyone before turning it on, do they have procedures for not turning something on, did they test these procedures by someone impersonating unaligned GPT and trying to manipulate researchers, things like that?
Questions about the standard-university-textbook from the future that tells us how to build an AGI. I'll take answers on any of these!
I'm going to try and write a table of contents for the textbook, just because it seems like a fun exercise.
Epistemic status: unbridled speculation
Volume I: Foundation
Part I: Statistical Learning Theory
Part II: Computational Learning Theory
Part III: Universal Priors
I don't think there is an "AGI textbook" any more than there is an "industrialization textbook." There are lots of books about general principles and useful kinds of machines. That said, if I had to make wild guesses about roughly what that future understanding would look like:
Eliezer, do you have any advice for someone wanting to enter this research space at (from your perspective) the eleventh hour? I’ve just finished a BS in math and am starting a PhD in CS, but I still don’t feel like I have the technical skills to grapple with these issues, and probably won’t for a few years. What are the most plausible routes for someone like me to make a difference in alignment, if any?
I don't have any such advice at the moment. It's not clear to me what makes a difference at this point.
We'd absolutely pay him if he showed up and said he wanted to work on the problem. Every time I've asked about trying anything like this, all the advisors claim that you cannot pay people at the Terry Tao level to work on problems that don't interest them. We have already extensively verified that it doesn't particularly work for eg university professors.
Every time I've asked about trying anything like this, all the advisors claim that you cannot pay people at the Terry Tao level to work on problems that don't interest them.
As I am sure you would agree, Neumann/Tao-level people are a very different breed from even very, very, very good professors. It is plausible they are significantly more sane than the average genius.
Given the enormous glut of money in EA trying to help here and the terrifying thing where a lot of the people who matter have really short timelines, I think it is worth testing this empirically with Tao himself and Tao-level people.
It is worth noting that Neumann occasionally did contract work for extraordinary sums.
I'm not sure whether the unspoken context of this comment is "We tried to hire Terry Tao and he declined, citing lack of interest in AI alignment" vs "we assume, based on not having been contacted by Terry Tao, that he is not interested in AI alignment."
If the latter: the implicit assumption seems to be that if Terry Tao would find AI alignment to be an interesting project, we should strongly expect him to both know about it and have approached MIRI regarding it, neither which seems particularly likely given the low public profile of both AI alignment in general and MIRI in particular.
If the former: bummer.
With the release of Rohin Shah and Eliezer Yudkowsky's conversation, the Late 2021 MIRI Conversations sequence is now complete.
This post is intended as a generalized comment section for discussing the whole sequence, now that it's finished. Feel free to:
In particular, Eliezer Yudkowsky, Richard Ngo, Paul Christiano, Nate Soares, and Rohin Shah expressed active interest in receiving follow-up questions here. The Schelling time when they're likeliest to be answering questions is Wednesday March 2, though they may participate on other days too.