Eliezer Yudkowsky

Sequences

Metaethics
Quantum Physics
Fun Theory
Ethical Injunctions
The Bayesian Conspiracy
Three Worlds Collide
Highly Advanced Epistemology 101 for Beginners
Inadequate Equilibria
The Craft and the Community
Load More (9/40)

Wiki Contributions

Comments

I am denying that superintelligences play this game in a way that looks like "Pick an ordinal to be your level of sophistication, and whoever picks the higher ordinal gets $9."  I expect sufficiently smart agents to play this game in a way that doesn't incentivize attempts by the opponent to be more sophisticated than you, nor will you find yourself incentivized to try to exploit an opponent by being more sophisticated than them, provided that both parties have the minimum level of sophistication to be that smart.

If faced with an opponent stupid enough to play the ordinal game, of course, you just refuse all offers less than $9, and they find that there's no ordinal level of sophistication they can pick which makes you behave otherwise.  Sucks to be them!

You have misunderstood (1) the point this post was trying to communicate and (2) the structure of the larger argument where that point appears, as follows:

First, let's talk about (2), the larger argument that this post's point was supposed to be relevant to.

Is the larger argument that superintelligences will misunderstand what we really meant, due to a lack of knowledge about humans?

It is incredibly unlikely that Eliezer Yudkowsky in particular would have constructed an argument like this, whether in 2007, 2017, or even 1997.  At all of these points in my life, I visibly held quite a lot of respect for the epistemic prowess of superintelligences.  They were always going to know everything relevant about the complexities of human preference and desire.  The larger argument is about whether it's easy to make superintelligences end up caring.

This post isn't about the distinction between knowing and caring, to be clear; that's something I tried to cover elsewhere.  The relevant central divide falls in roughly the same conceptual place as Hume's Guillotine between 'is' and 'ought', or the difference between the belief function and the utility function.

(I don't see myself as having managed to reliably communicate this concept (though the central idea is old indeed within philosophy) to the field that now sometimes calls itself "AI alignment"; so if you understand this distinction yourself, you should not assume that any particulary commentary within "AI alignment" is written from a place of understanding it too.)

What this post is about is the amount of information-theoretic complexity that you need to get into the system's preferences, in order to have that system, given unlimited or rather extremely large amounts of power, deliver to you what you want.

It doesn't argue that superintelligences will not know this information.  You'll note that the central technology in the parable isn't an AI; it's an Outcome Pump.

What it says, rather, is that there might be, say, a few tens of thousands of bits -- the exact number is not easy to estimate, we just need to know that it's more than a hundred bits and less than a billion bits and anything in that range is approximately the same problem from our standpoint -- that you need to get into the steering function.  If you understand the Central Divide that Hume's Razor points to, the distinction between probability and preference, etcetera, the post is trying to establish the idea that we need to get 13,333 bits or whatever into the second side of this divide.

In terms of where this point falls within the larger argument, this post is not saying that it's particularly difficult to get those 13,333 bits into the preference function; for all this post tries to say, locally, maybe that's as easy as having humans manually enter 13,333 yes-or-no answers into the system.  It's not talking about the difficulty of doing the work but rather the amount and nature of a kind of work that needs to be done somehow.

Definitely, the post does not say that it's hard to get those 13,333 bits into the belief function or knowledge of a superintelligence.

Separately from understanding correctly what this post is trying to communicate, at all, in 2007, there's the question of whether modern LLMs have anything to say about -- obviously not the post's original point -- but rather, other steps of the larger question in which this post's point appears.

Modern LLMs, if you present them with a text-based story like the one in this parable, are able to answer at least some text-based questions about whether you'd prefer your grandmother to be outside the building or be safely outside the building.  Let's admit this premised observation at face value.  Have we learned thereby the conclusion that it's easy to get all of that information into a superintelligence's preference function?

And if we say "No", is this Eliezer making up post-hoc excuses?

What exactly we learn from the evidence of how AI has played out in 2024 so far, is the sort of thing that deserves its own post.  But I observe that if you'd asked Eliezer-2007 whether an (Earth-originating) superintelligence could correctly predict the human response pattern about what to do with the grandmother -- solve the same task LLMs are solving, to at least the LLM's performance level -- Eliezer-2007 would have unhesitatingly answered "yes" and indeed "OBVIOUSLY yes".

How is this coherent?  Because the post's point is about how much information needs to get into the preference function.  To predict a human response pattern you need (only) epistemic knowledge.  This is part of why the post is about needing to give specifications to an Outcome Pump, rather than it depicting an AI being surprised by its continually incorrect predictions about a human response pattern.

If you don't see any important distinction between the two, then of course you'll think that it's incoherent to talk about that distinction.  But even if you think that Hume was mistaken about there existing any sort of interesting gap between 'is' and 'ought', you might by some act of empathy be able to imagine that other people think there's an interesting subject matter there, and they are trying to talk about it with you; otherwise you will just flatly misunderstand what they were trying to say, and mispredict their future utterances.  There's a difference between disagreeing with a point, and just flatly failing to get it, and hopefully you aspire to the first state of mind rather than the second.

Have we learned anything stunningly hopeful from modern pre-AGIs getting down part of the epistemic part of the problem at their current ability levels, to the kind of resolution that this post talked about in 2007?  Or from it being possible to cajole pre-AGIs with loss functions into willingly using that knowledge to predict human text outputs?  Some people think that this teaches us that alignment is hugely easy.  I think they are mistaken, but that would take its own post to talk about.

But people who point to "The Hidden Complexity of Wishes" and say of it that it shows that I had a view which the current evidence already falsifies -- that I predicted that no AGI would ever be able to predict human response patterns about getting grandmothers out of burning buildings -- have simply: misunderstood what the post is about, not understood in particular why the post is about an Outcome Pump rather than an AI stupidly mispredicting human responses, and failed to pick up on the central point that Eliezer expects superintelligences to be smart in the sense of making excellent purely epistemic predictions.

This deserves a longer answer than I have time to allocate it, but I quickly remark that I don't recognize the philosophy or paradigm of updatelessness as refusing to learn things or being terrified of information; a rational agent should never end up in that circumstance, unless some perverse other agent is specifically punishing them for having learned the information (and will lose of their own value thereby; it shouldn't be possible for them to gain value by behaving "perversely" in that way, for then of course it's not "perverse").  Updatelessness is, indeed, exactly that sort of thinking which prevents you from being harmed by information, because your updateless exposure to information doesn't cause you to lose coordination with your counterfactual other selves or exhibit dynamic inconsistency with your past self.

From an updateless standpoint, "learning" is just the process of reacting to new information the way your past self would want you to do in that branch of possibility-space; you should never need to remain ignorant of anything.  Maybe that involves not doing the thing that would then be optimal when considering only the branch of reality you turned out to be inside, but the updateless mind denies that this was ever the principle of rational choice, and so feels no need to stay ignorant in order to maintain dynamic consistency.

They can solve it however they like, once they're past the point of expecting things to work that sometimes don't work.  I have guesses but any group that still needs my hints should wait and augment harder.

I disagree with my characterization as thinking problems can be solved on paper, and with the name "Poet".  I think the problems can't be solved by twiddling systems weak enough to be passively safe, and hoping their behavior generalizes up to dangerous levels.  I don't think paper solutions will work either, and humanity needs to back off and augment intelligence before proceeding.  I do not take the position that we need a global shutdown of this research field because I think that guessing stuff without trying it is easy, but because guessing it even with some safe weak lesser tries is still impossibly hard.  My message to humanity is "back off and augment" not "back off and solve it with a clever theory".

Not what comes up for me, when I go incognito and google AI risk lesswrong.

I rather expect that existing robotic machinery could be controlled by ASI rather than "moderately smart intelligence" into picking up the pieces of a world economy after it collapses, or that if for some weird reason it was trying to play around with static-cling spaghetti It could pick up the pieces of the economy that way too.

It's false that currently existing robotic machinery controlled by moderately smart intelligence can pick up the pieces of a world economy after it collapses.  One well-directed algae cell could, but not existing robots controlled by moderate intelligence.

What does this operationalize as?  Presumably not that if we load a bone and a diamond rod under equal pressures, the diamond rod breaks first?  Is it more about if we drop sudden sharp weights onto a bone rod and a diamond rod, the diamond rod breaks first?  I admit I hadn't expected that, despite a general notion that diamond is crystal and crystals are unexpectedly fragile against particular kinds of hits, and if so that modifies my sense of what's a valid metaphor to use.

"Pandemics" aren't a locally valid substitute step in my own larger argument, because an ASI needs its own manufacturing infrastructure before it makes sense for the ASI to kill the humans currently keeping its computers turned on.  So things that kill a bunch of humans are not a valid substitute for being able to, eg, take over and repurpose the existing solar-powered micron-diameter self-replicating factory systems, aka algae, and those repurposed algae being able to build enough computing substrate to go on running the ASI after the humans die.

It's possible this argument can and should be carried without talking about the level above biology, but I'm nervous that this causes people to start thinking in terms of Hollywood movie plots about defeating pandemics and hunting down the AI's hidden cave of shoggoths, rather than hearing, "And this is a lower bound but actually in real life you just fall over dead."

Load More