Eliezer Yudkowsky

Sequences

Metaethics
Quantum Physics
Fun Theory
Ethical Injunctions
The Bayesian Conspiracy
Three Worlds Collide
Highly Advanced Epistemology 101 for Beginners
Inadequate Equilibria
The Craft and the Community
Load More (9/40)

Wiki Contributions

Comments

I think that the AI's internal ontology is liable to have some noticeable alignments to human ontology w/r/t the purely predictive aspects of the natural world; it wouldn't surprise me to find distinct thoughts in there about electrons.  As the internal ontology goes to be more about affordances and actions, I expect to find increasing disalignment.  As the internal ontology takes on any reflective aspects, parts of the representation that mix with facts about the AI's internals, I expect to find much larger differences -- not just that the AI has a different concept boundary around "easy to understand", say, but that it maybe doesn't have any such internal notion as "easy to understand" at all, because easiness isn't in the environment and the AI doesn't have any such thing as "effort".  Maybe it's got categories around yieldingness to seven different categories of methods, and/or some general notion of "can predict at all / can't predict at all", but no general notion that maps onto human "easy to understand" -- though "easy to understand" is plausibly general-enough that I wouldn't be unsurprised to find a mapping after all.

Corrigibility and actual human values are both heavily reflective concepts.  If you master a requisite level of the prerequisite skill of noticing when a concept definition has a step where its boundary depends on your own internals rather than pure facts about the environment -- which of course most people can't do because they project the category boundary onto the environment, but I have some credit that John Wentworth might be able to do it some -- and then you start mapping out concept definitions about corrigibility or values or god help you CEV, that might help highlight where some of my concern about unnatural abstractions comes in.

 

Entirely separately, I have concerns about the ability of ML-based technology to robustly point the AI in any builder-intended direction whatsoever, even if there exists some not-too-large adequate mapping from that intended direction onto the AI's internal ontology at training time.  My guess is that more of the disagreement lies here.

Reply1021

What the main post is responding to is the argument:  "We're just training AIs to imitate human text, right, so that process can't make them get any smarter than the text they're imitating, right?  So AIs shouldn't learn abilities that humans don't have; because why would you need those abilities to learn to imitate humans?"  And to this the main post says, "Nope."

The main post is not arguing:  "If you abstract away the tasks humans evolved to solve, from human levels of performance at those tasks, the tasks AIs are being trained to solve are harder than those tasks in principle even if they were being solved perfectly."  I agree this is just false, and did not think my post said otherwise.

Unless I'm greatly misremembering, you did pick out what you said was your strongest item from Lethalities, separately from this, and I responded to it.  You'd just straightforwardly misunderstood my argument in that case, so it wasn't a long response, but I responded.  Asking for a second try is one thing, but I don't think it's cool to act like you never picked out any one item or I never responded to it.

EDIT: I'm misremembering, it was Quintin's strongest point about the Bankless podcast.  https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky?commentId=cr54ivfjndn6dxraD

If Quintin hasn't yelled "Empiricism!" then it's not about him.  This is more about (some) e/accs.

Wow, that's fucked up.

I am denying that superintelligences play this game in a way that looks like "Pick an ordinal to be your level of sophistication, and whoever picks the higher ordinal gets $9."  I expect sufficiently smart agents to play this game in a way that doesn't incentivize attempts by the opponent to be more sophisticated than you, nor will you find yourself incentivized to try to exploit an opponent by being more sophisticated than them, provided that both parties have the minimum level of sophistication to be that smart.

If faced with an opponent stupid enough to play the ordinal game, of course, you just refuse all offers less than $9, and they find that there's no ordinal level of sophistication they can pick which makes you behave otherwise.  Sucks to be them!

You have misunderstood (1) the point this post was trying to communicate and (2) the structure of the larger argument where that point appears, as follows:

First, let's talk about (2), the larger argument that this post's point was supposed to be relevant to.

Is the larger argument that superintelligences will misunderstand what we really meant, due to a lack of knowledge about humans?

It is incredibly unlikely that Eliezer Yudkowsky in particular would have constructed an argument like this, whether in 2007, 2017, or even 1997.  At all of these points in my life, I visibly held quite a lot of respect for the epistemic prowess of superintelligences.  They were always going to know everything relevant about the complexities of human preference and desire.  The larger argument is about whether it's easy to make superintelligences end up caring.

This post isn't about the distinction between knowing and caring, to be clear; that's something I tried to cover elsewhere.  The relevant central divide falls in roughly the same conceptual place as Hume's Guillotine between 'is' and 'ought', or the difference between the belief function and the utility function.

(I don't see myself as having managed to reliably communicate this concept (though the central idea is old indeed within philosophy) to the field that now sometimes calls itself "AI alignment"; so if you understand this distinction yourself, you should not assume that any particulary commentary within "AI alignment" is written from a place of understanding it too.)

What this post is about is the amount of information-theoretic complexity that you need to get into the system's preferences, in order to have that system, given unlimited or rather extremely large amounts of power, deliver to you what you want.

It doesn't argue that superintelligences will not know this information.  You'll note that the central technology in the parable isn't an AI; it's an Outcome Pump.

What it says, rather, is that there might be, say, a few tens of thousands of bits -- the exact number is not easy to estimate, we just need to know that it's more than a hundred bits and less than a billion bits and anything in that range is approximately the same problem from our standpoint -- that you need to get into the steering function.  If you understand the Central Divide that Hume's Razor points to, the distinction between probability and preference, etcetera, the post is trying to establish the idea that we need to get 13,333 bits or whatever into the second side of this divide.

In terms of where this point falls within the larger argument, this post is not saying that it's particularly difficult to get those 13,333 bits into the preference function; for all this post tries to say, locally, maybe that's as easy as having humans manually enter 13,333 yes-or-no answers into the system.  It's not talking about the difficulty of doing the work but rather the amount and nature of a kind of work that needs to be done somehow.

Definitely, the post does not say that it's hard to get those 13,333 bits into the belief function or knowledge of a superintelligence.

Separately from understanding correctly what this post is trying to communicate, at all, in 2007, there's the question of whether modern LLMs have anything to say about -- obviously not the post's original point -- but rather, other steps of the larger question in which this post's point appears.

Modern LLMs, if you present them with a text-based story like the one in this parable, are able to answer at least some text-based questions about whether you'd prefer your grandmother to be outside the building or be safely outside the building.  Let's admit this premised observation at face value.  Have we learned thereby the conclusion that it's easy to get all of that information into a superintelligence's preference function?

And if we say "No", is this Eliezer making up post-hoc excuses?

What exactly we learn from the evidence of how AI has played out in 2024 so far, is the sort of thing that deserves its own post.  But I observe that if you'd asked Eliezer-2007 whether an (Earth-originating) superintelligence could correctly predict the human response pattern about what to do with the grandmother -- solve the same task LLMs are solving, to at least the LLM's performance level -- Eliezer-2007 would have unhesitatingly answered "yes" and indeed "OBVIOUSLY yes".

How is this coherent?  Because the post's point is about how much information needs to get into the preference function.  To predict a human response pattern you need (only) epistemic knowledge.  This is part of why the post is about needing to give specifications to an Outcome Pump, rather than it depicting an AI being surprised by its continually incorrect predictions about a human response pattern.

If you don't see any important distinction between the two, then of course you'll think that it's incoherent to talk about that distinction.  But even if you think that Hume was mistaken about there existing any sort of interesting gap between 'is' and 'ought', you might by some act of empathy be able to imagine that other people think there's an interesting subject matter there, and they are trying to talk about it with you; otherwise you will just flatly misunderstand what they were trying to say, and mispredict their future utterances.  There's a difference between disagreeing with a point, and just flatly failing to get it, and hopefully you aspire to the first state of mind rather than the second.

Have we learned anything stunningly hopeful from modern pre-AGIs getting down part of the epistemic part of the problem at their current ability levels, to the kind of resolution that this post talked about in 2007?  Or from it being possible to cajole pre-AGIs with loss functions into willingly using that knowledge to predict human text outputs?  Some people think that this teaches us that alignment is hugely easy.  I think they are mistaken, but that would take its own post to talk about.

But people who point to "The Hidden Complexity of Wishes" and say of it that it shows that I had a view which the current evidence already falsifies -- that I predicted that no AGI would ever be able to predict human response patterns about getting grandmothers out of burning buildings -- have simply: misunderstood what the post is about, not understood in particular why the post is about an Outcome Pump rather than an AI stupidly mispredicting human responses, and failed to pick up on the central point that Eliezer expects superintelligences to be smart in the sense of making excellent purely epistemic predictions.

This deserves a longer answer than I have time to allocate it, but I quickly remark that I don't recognize the philosophy or paradigm of updatelessness as refusing to learn things or being terrified of information; a rational agent should never end up in that circumstance, unless some perverse other agent is specifically punishing them for having learned the information (and will lose of their own value thereby; it shouldn't be possible for them to gain value by behaving "perversely" in that way, for then of course it's not "perverse").  Updatelessness is, indeed, exactly that sort of thinking which prevents you from being harmed by information, because your updateless exposure to information doesn't cause you to lose coordination with your counterfactual other selves or exhibit dynamic inconsistency with your past self.

From an updateless standpoint, "learning" is just the process of reacting to new information the way your past self would want you to do in that branch of possibility-space; you should never need to remain ignorant of anything.  Maybe that involves not doing the thing that would then be optimal when considering only the branch of reality you turned out to be inside, but the updateless mind denies that this was ever the principle of rational choice, and so feels no need to stay ignorant in order to maintain dynamic consistency.

They can solve it however they like, once they're past the point of expecting things to work that sometimes don't work.  I have guesses but any group that still needs my hints should wait and augment harder.

I disagree with my characterization as thinking problems can be solved on paper, and with the name "Poet".  I think the problems can't be solved by twiddling systems weak enough to be passively safe, and hoping their behavior generalizes up to dangerous levels.  I don't think paper solutions will work either, and humanity needs to back off and augment intelligence before proceeding.  I do not take the position that we need a global shutdown of this research field because I think that guessing stuff without trying it is easy, but because guessing it even with some safe weak lesser tries is still impossibly hard.  My message to humanity is "back off and augment" not "back off and solve it with a clever theory".

Load More