JDP Reviews IABIED

This post from Gillen and Barnett that I always struggle to find every time I search for it is a decent overview.

In retrospect the title is ridiculous, I have trouble remembering it. I apologise.

overinvesting in concepts like deceptive mesaoptimizers and recipes for ruin to create almost unfalsifiable, obscurantist shoggoth of the gaps arguments against neural gradient methods
[...]
The word "mesaoptimizer" does not appear anywhere in the book.

I think there's a really common misunderstanding about deceptive mesaoptimizers that maybe you have. The kind of reasoning agent that the book describes is exactly a deceptive mesaoptimizer. It's a mesaoptimizer because it does planning (a kind of optimization) and was created using gradient descent (the outer optimizer), and it is deceptive in the book scenario when it hides its nature from its creators.

My guess is that you're thinking of thinking of stuff like optimization daemons and the malign prior, where there's an agent that shows up in a place where it wasn't intended to show up. I think the similarities caused a bunch of mixed up ideas on lesswrong and elsewhere.

I think when you say that the book brings the authors closer to your threat model, some of this might be that they were always closer to your threat model and you misunderstood them?

These thinkers gambled everything on a vast space of minds that doesn't actually exist in practice and lost.

Like I don't think they ever meant anything different by the "vast space of minds" than what is described in the book.

My meta-critique of the book would be that Yudkowsky already has an extensive corpus of writing about AGI ruin, much of it quite good. I do not just mean The Sequences, I am talking about his Arbital posts, his earlier whitepapers at MIRI like Intelligence Explosion Microeconomics, and other material which he has spent more effort writing than advertising and as a result almost nobody has read it besides me.

Yeah agreed, it's bizarre how many alignment researchers haven't read arbital or old MIRI papers. They mostly do hold up well, imo, if you're trying to understand the highest hurdles of alignment. I am a bit sad that the book didn't go into arbital-style alignment theory very deeply, but I can see why.

[-]jdp1mo181

My guess is that you’re thinking of thinking of stuff like optimization daemons and the malign prior, where there’s an agent that shows up in a place where it wasn’t intended to show up. I think the similarities caused a bunch of mixed up ideas on lesswrong and elsewhere.

I honestly just remember a lot of absurd posts spending their time thinking about daemons in the weights which were based on a model of gradient descent as being evolution-like in ways which it is not and the absurdity of said posts absolutely contributed to the alignment winter by giving people the impression that they're blocked on impossible seeming problems that don't actually exist and then focusing their attention somewhere else. MIRI cluster very much contributed to this and I consider this book, to the extent it's talking about something with the same name to be a retcon.

I agree that the strict literal words "deceptive mesaoptimizer" mean what you say they do, but also that is not really what people meant by it until fairly recently when they had to retcon the embarrassing alien shoggoth stuff. It almost always meant deceptive mesaoptimization daemon as subset of the network undermining the training goal.

Like I don’t think they ever meant anything different by the “vast space of minds” than what is described in the book.

I am quite certain they did. In any case there does exist a large space of output heads on top of the shared ontology so it doesn't really matter that much. I think the alienness of the minds involved is a total red herring, they could be very hominid-like and it wouldn't matter much if they include superintelligent planners doing argmax(p(problem_solved)).

[-]Jeremy Gillen1mo120

I think the alienness of the minds involved is a total misnomer, they could be very hominid-like and it wouldn't matter much if they include superintelligent planners doing argmax(p(problem_solved)).

Yeah I agree with this. Although I think focusing on argmax confused a lot of people (including me) and I'm glad they didn't do that in the book. When I was new to the community, I thought that implementing soft optimization would solve the main problems. I didn't grok how large the reflective instability and pointer problems were.

I honestly just remember a lot of absurd posts spending their time thinking about daemons in the weights which were based on a model of gradient descent as being evolution-like in ways which it is not and the absurdity of said posts absolutely contributed to the alignment winter by giving people the impression that they're blocked on impossible seeming problems that don't actually exist and then focusing their attention somewhere else.

Yeah I agree that this happened. But if there was a retcon, then it would be in RFLO, not in the book, because RFLO defined mesaoptimization in a way that doesn't match "daemons in the weights". I think what happened was maybe closer to "lots of wild speculation about weird ways that overpowered optimizers might go wrong", which, as people became less confused, was consolidated into something much more reasonable and less wild (which was RFLO). But then lots of people mentally attached the word mesaoptimizer to older ideas.

I think the issue is exacerbated by the way that when people post about alignment, they often have a detailed AGI design in their mind, and they are talking about alignment issues with that AGI design. But the AGI design isn't described in much detail or at all. And over the last two decades the AGI designs that people have had in mind have varied wildly, and many of them have been pretty silly.

[-]jdp1mo70

I think the issue is exacerbated by the way that when people post about alignment, they often have a detailed AGI design in their mind, and they are talking about alignment issues with that AGI design. But the AGI design isn’t described in much detail or at all. And over the last two decades the AGI designs that people have had in mind have varied wildly, and many of them have been pretty silly.

I agree with this and don't mind saying for future reference that my current AGI model is in fact a traditional RL agent with a planner and a policy where the policy is some LLM-like foundation model and the planner is something MCTS-like over ReAct-like blocks. The agent rewards itself by taking motor actions and then checking whether the action succeeded with evaluation actions that return a boolean result to assess subgoal completion.

So, MuZero but with LLMs basically.

[-]Nina Panickssery1mo181

This review really misses the mark I think.

The word "paperclip" does not appear anywhere in the book
The word "mesaoptimizer" does not appear anywhere in the book

Sure, but the same arguments are being made in different words. I agree that avoiding rationalist jargon makes it a better read for laypeople, but it doesn't change the validity of the argument or the extent to which it reflects newer evidence. The book is about a deceptive mesaoptimizer that relentlessly steers the world towards a target as meaningless to us as paperclips, at its core.

In general the book moves somewhat away from abstraction and comments more on the empirical strangeness of AI

The way in which it comments on the "empirical strangeness of AI" is very biased. For instance, it fails to mention the many ways in which today's rather general AIs don't engage in weird, maximizing behavior or pursue unpredictable goals. Instead it mentions a few cases where AI systems did things we didn't expect, like glitch tokens, which is incredibly weak empirical evidence for their claims.

[-]jdp1mo93

Okay but they're not actually using those things as evidence for their claims about generalization in the limit, which is explained through evolutionary metaphors. I agree that the argument itself is not very well explained but if you can't see the ways that a MCTS searching over paths to an outcome where the policy has complications like glitch tokens could lead to bad outcomes I'm not really sure what to tell you. Like, if your policy thinks a weird string is the highest scoring thing (a category of error you absolutely see in real reward models) then that's going to distort any search process that uses it as a policy. So if you just assume ASI is a normal AI agent with a policy and a planner (not an insane assumption) and it has things like glitch tokens you're likely in for a bad time.

I was giving an inside baseball review for the sort of person who has been following this for a while and wants to know if EY updated at all. And the answer is yeah he threw out a lot of the dumbest rhetoric.

"Okay but is the book good?"

Oh hell no.

[-]Nina Panickssery1mo41

Okay but they're not actually using those things as evidence for their claims about generalization in the limit

Of course, because those things themselves are the claims about generalization in the limit that require justification

which is explained through evolutionary metaphors

Evolutionary metaphors don't constitute an argument, and also don't reflect the authors' tendency to update, seeing as they've been using evolutionary metaphors since the beginning

[-]Garrett Baker1mo60

don't reflect the authors' tendency to update, seeing as they've been using evolutionary metaphors since the beginning

This seems locally invalid. Eliezer at least has definitely used evolution in different ways and to make different points throughout the years. Originally using the “alien god” analogy to show optimization processes do not lead to niceness in general (in particular, no chaos or unpredictability required), now they use evolution for an “inner alignment is hard” analogy, mainly arguing it implies a big problem is that objective functions do not constrain generalization behavior enough to be useful for AGI alignment. Therefore the goals of your system will be very chaotic.

I think this definitely constitutes an update, “inner alignment” concerns were not a thing in 2008.

[-]Nina Panickssery1mo40

I don’t see a big difference between

optimization processes do not lead to niceness in general

and

objective functions do not constrain generalization behavior enough

[-]Garrett Baker1mo20

Its the difference between outer an inner alignment. The former makes the argument that it is possible, for some intelligent optimizer to be misaligned with humans, and likely for "alien gods" such as evolution or your proposed AGI. Its an argument about outer alignment not being trivial. It analogizes evolution to the AGI itself. Here is a typical example:

Why is Nature cruel? You, a human, can look at an Ichneumon wasp, and decide that it's cruel to eat your prey alive. You can decide that if you're going to eat your prey alive, you can at least have the decency to stop it from hurting. It would scarcely cost the wasp anything to anesthetize its prey as well as paralyze it. Or what about old elephants, who die of starvation when their last set of teeth fall out? These elephants aren't going to reproduce anyway. What would it cost evolution—the evolution of elephants, rather—to ensure that the elephant dies right away, instead of slowly and in agony? What would it cost evolution to anesthetize the elephant, or give it pleasant dreams before it dies? Nothing; that elephant won't reproduce more or less either way.
If you were talking to a fellow human, trying to resolve a conflict of interest, you would be in a good negotiating position—would have an easy job of persuasion. It would cost so little to anesthetize the prey, to let the elephant die without agony! Oh please, won't you do it, kindly... um...
There's no one to argue with.
Human beings fake their justifications, figure out what they want using one method, and then justify it using another method. There's no Evolution of Elephants Fairy that's trying to (a) figure out what's best for elephants, and then (b) figure out how to justify it to the Evolutionary Overseer, who (c) doesn't want to see reproductive fitness decreased, but is (d) willing to go along with the painless-death idea, so long as it doesn't actually harm any genes.
There's no advocate for the elephants anywhere in the system.

The latter analogizes evolution to the training process of you AGI. It doesn't focus on the perfectly reasonable (for evolution) & optimal decisions your optimization criteria will make, it focuses on the staggering weirdness that happens to the organisms evolution creates outside their ancestral environment. Like humans' taste for ice cream over "salted and honeyed raw bear fat". This is not evolution coldly finding the most optimal genes for self-propagation, this is evolution going with the first "idea" it has which is marginally more fit in the ancestral environment, then ultimately, for no inclusive genetic fitness justified reason, creating AGIs which don't care a lick about inclusive genetic fitness.

That is, an iterative process which selects based on some criteria, and arrives at an AGI, need not also produce an AGI which itself optimizes that criteria outside the training/ancestral environment.

[-]Nina Panickssery1mo60

Fair, you’re right, I didn’t realize or forgot that the evolution analogy was previously used in the way it is in your pasted quote.

[-]the gears to ascension1mo11-12

I have 60% probability that you intentionally structured the post to feel like the pattern of how you felt reading the book.

I appreciate this. I haven't finished the book yet, but my impression is you liked it more than I expect to. I suspect a good introduction to alignment should only take a few paragraphs to be understandable to almost anyone and be robust against incorrect counterarguments, and correctly vulnerable to insightfully correct counterarguments if any exist. But I haven't figured out how to write that down myself. A good intro is also a good representation for thinking about, imo, which is most of the value I see in it.

[-]Vaniver1mo195

I suspect a good introduction to alignment should only take a few paragraphs to be understandable to almost anyone and be robust against incorrect counterarguments

I think this is empirically not the case and I think some simple modeling of the relationship between concept complexity and the number of nearby confused interpretations should suggest that this is not reasonable to expect.

[-]the gears to ascension1mo4-9

I agree that most short intros have that problem. as they say, seek simplicity, and distrust it. a short explanation that works for most humans would be relying on concepts they already have. I suspect that that's possible. it would need to minimize analogizing, despite that doing so wouldn't get analogizing very low; the core structure of the argument would be literally true, and the analogical part would be in describing what kind of things do the bad thing, saying "here are a bunch of things that have a reliably bad pattern in history. we expect this to be another one of those. here's the pattern they have. the problem here is basically just that, for the first time, we're in the wrong part of this pattern, the one that always loses." comparison to competing species, wars, power contests. some people will still not have experience with the comparison points and as such not understand, but it's fairly common to have experience; farmers and militaries seem most likely to get it easily.

[-]jdp1mo61

I didn't but I did copy pasta the intro from another post I was writing because it seemed relevant.

[-]Eli Tyre1mo*20

I have 60% probability that you intentionally structured the post to feel like the pattern of how you felt reading the book

I'll take that bet. 1:1, $100?

[This comment is no longer endorsed by its author]Reply

[-]jdp1mo20

I already denied it so.

[-]Eli Tyre1mo20

Yeah, I saw.

[-]TAG1mo63

unfalsifiable,obscurantist shoggoth of the gaps arguments against neural gradient methods.

What? that could really have done with a link , or footnote.

[-]StanislavKrym1mo10

One oddity that stands out is Yudkowsky and Soares ongoing contempt for large language models and hypothetical agents based on them. Again for a book which is explicitly premised on the idea that urgent action is necessary because AI might become superintelligent in just a few years it is bizarre that the authors don't feel comfortable making more reference to the particulars of the existing AI systems which hypothetical near-future agents would be based on.

Except that non-LLM AI agents have yet to be ruled out. Quoting Otto Barten,

Note that this doesn't tell us anything about the chance of loss of control from non-LLM (or vastly improved LLM (sic! -- S.K.)) agents, such as the brain in a box in a basement scenario. The latter is now a large source of my p(doom) probability mass.

Alas, as I remarked in a comment, Barten's mentioning of vastly improved LLM agents makes Barten's optimism resemble the "No True Scotsman" fallacy.

[-]jdp1mo33

Okay but they don't have to be ruled out for you to say things like "One way this could work is bla bla bla" and have that be sane in the context of what already exists. Again if you think something is a near term concern it's not unreasonable to make reference to how the existing things could evolve in the near future. I think what's actually going on here is that rather than non-LLM agents being "not ruled out" (which I agree with, they are by no means ruled out) Yudkowsky and Soares find LLM agents an implausible architecture but don't want to say this explicitly because they think saying that too loudly would speed up timelines. I think they're actually wrong about the viability of LLM agents, but it does contribute to a sort of odd abstract tone it otherwise would have less of.

LESSWRONG
LW

LESSWRONG
LW

76

76

76