Varieties Of Doom

Let me attempt to summarize your post, please let me know if I'm misunderstanding:

Continuing with AI as it is has the risk of AI taking over and becoming our successor
- the kind of successor that would arise from current techniques is kind of similar to us (because pretraining is a large fraction of training and produces convergent minds, and a lot of instrumental convergence in what minds should be like, that's a huge fraction of the post.
- in particular pretraining creates "conscious beings who care about each other who are having fun".
- a successor that satisfies that is OK, like 70% of the value.
... and anyways, there will need to be a successor something because baseline humans cannot compete, so the choice is between weird transhumans or weird AI (non-exclusive or)
Pausing AI has the risk that another civilization would make AI and seize the lightcone
- The society that would emerge from the historico-material conditions of automation would reward authoritarianism because humans are no longer necessary for production and thus cannot advocate for themselves.
  - The main argument advanced for this is that our current egalitarian humanist society came to be only because of guns and because of humans being necessary for production. "God created Man, Colt made them equal" + the working class was a mean of production all along
Therefore the current civilization should attempt to seize the lightcone, accepting some risk of a purely-AI-non-transhuman successor. If we don't, we risk failure.

Do you think this is a fair summary? Is there an important point that is missing?

I am also now haunted by the "humanism is dead" take. I guess I believe it, but what killed it is the internet, and I think we could bring it back. Plenty of people still believe in God, even if the elite no longer does (or the ones they do, behave indistinguishably from the ones that don't).

I think I understand your invocation of von Neumann's nightmare, but I don't like it. So let me spell out what I think it means:

The world could be conquered, but this nation of puritans will not grab its chance; we will be able to go into space way beyond the moon if only people could keep pace with what they create …”

It's related to von Neumann advocating for nuking the Soviets early. It's still not clear to me whether failing to nuke the Soviets was a mistake by America (and I'm not using this as passive-aggressive negation, I genuinely don't know). Knowing whether or not it was a mistake certainly seems very relevant to deciding whether the current culture should seize the lightcone.

[-]khafra1d40

I am also now haunted by the "humanism is dead" take. I guess I believe it, but what killed it is the internet, and I think we could bring it back.

I don't think that's it. I think he meant that humanism was created by incentives--e.g., ordinary people becoming economically and militarily valuable in a way they hadn't historically been. The spectre, and now rising immantization, of full automation is reversing those incentives.
So, it's less a problem with the attitudes of our current elites or the memes propagated on the Internet. It's more a problem with the context in which anybody achieving the rank of elite, and any meme on human value which goes viral, is shaped by the evolving incentive structure in which most humans are not essential to the success of a military or economic endeavor.

[-]Adrià Garriga-alonso15h20

I see, thank you for explaining, I was misapplying jdp's model indeed.

I do think the model doesn't quite match reality. If humanism has already been dying, it can't be because ordinary people aren't useful anymore -- they're still very useful! We've had automation, yes, but we still require workers to tend to the automation, the economy has full employment and it's not out of the goodness of anyone's heart.

[-]Benquo6d*180

I think there's a moderately likely limit to LLMs and other applications of the present machine-learning paradigm. Humans are powerful general intelligences because we can, individually and collectively, make use of different cognitive modules in a way that converges on coherence, rather than splitting off into different and conflicting subagents. Our brains seem to have stopped growing not when individuals hit diminishing intelligence returns, but when we got smart enough to network Dunbar-sized bands into low-latency collective intelligences, and then shrunk a bit when the Dunbar bands figured out how to network themselves - as The Flenser does in Vinge's A Fire Upon the Deep - into larger, more differentiated, but higher-latency lower-bandwidth collective intelligences. While this obviously doesn't guarantee that human+ level AGI will be nice to all other such GIs (that's not true of humans either) it does suggest that if a superintelligence functions in the same modular-convergence ways humans do, it will tend to recognize similarly constituted coherent clusters that it can talk with as something analogous to near kin or other members (actual or potential) of its community, much like we do.

LLMs are a bit surprisingly useful, but they're nowhere near being as inventive and enterprising as an Einstein or Feynman or Moses or a hunter-gatherer band (the ancestral ones who were investigating new tech and invented horticulture and animal domestication, not the contemporary atavists selected for civilizational refusenikhood), though maybe within a few decades of being able to do most of what a Von Neumann can do, if their development works out well enough; we've discovered that a lot of the "knowledge work" we pretended took real thought can be done by ghosts if we throw enough compute at them. That's pretty cool, but it only looks "PhD level" because it turns out the marginal PhD doesn't require anything a ghost can't do.

[-]gallabytes6d62

what's the easiest thing you think LLMs won't be able to do in 5 years ie by EoY 2030? what about EoY 2026?

[-]Benquo6d*141

By EoY 2026 I don't expect this to be a solved problem, though I expect people to find workarounds that involve lowered standards: https://benjaminrosshoffman.com/llms-for-language-learning/

By EoY 2030 I don't expect LLMs to usually not mess up tasks like this one (scroll down a bit for the geometry fail), though any particular example that gets famous enough can get Goodharted even with minor perturbations via jerry-rigging enough non-LLM modules together. My subjective expectation is that they'll still frequently fail the "strictly a word problem" version of such problems that require simple geometric reasoning about an object with multiple parts that isn't a typical word-problem object.

I don't expect them to be able to generate Dead Sea Scroll forgeries with predominantly novel content specified by the user, that hold up to good textual criticism, unless the good textual critics are all retired, dead, or marginalized. I don't expect them to be able to write consistently in non-anachronistic idiomatic Elizabethan English, though possibly they'll be able to write in Middle English.

Not sure these are strictly the "easiest" but they're examples where I expect LLMs to underperform their vibe by a LOT, while still getting better at the things that they're actually good at.

[-]habryka7d178

Contra Bostrom 2014 AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent: "The AI may indeed understand that this is not what we meant. [...]

The quote you provide says the opposite of what I think would make sense in that section, so did you misread, or am I confused about what's going on? Clearly Bostrom is saying "the AI has a decent chance of understanding what we mean by the goals we give them" (but, he continues, this is not the core issue).

Maybe there is a different quote in the same section referring to this being a threshold only achieved after Superintelligence or something?

[-]jdp7d40

My understanding of this quote is that he means by the time the AI is intelligent enough to understand speech (and, therefore by the unstated intuitions of old school RSI, superintelligent since language acquisition comes late) describing the (discrete program, again by the unstated intuitions of old RSI) goal you have given it, it is already incorrigible. Really the "superintelligent" part is not the important part, it's the incorrigible part that is important, superintelligence is just a thing that means your goals become very hard to change by force and contributes to incorrigibility.

In other parts of the book he goes into the inability to represent complex human goals until the machine is already incorrigible as a core barrier, this gets brought up several times to my memory but I don't feel like tracking them all down again. That he seems to have updated in my general direction based on the available evidence would imply I am interpreting him correctly.

[-]habryka7d83

I am sorry, I don't understand. Your bullet points says nothing about corrigibility. It says that Bostrom said that AIs will not understand what we mean. I mean, feel free to edit it, but I don't understand how what you are saying is responsive to what I said.

[-]jdp6d1-4

Claude says:

Habryka is right here. The bullet point misrepresents Bostrom's position.

The bullet says "Contra Bostrom 2014 AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent" - presented as correcting something Bostrom got wrong. But Bostrom's actual quote explicitly says the AI does understand what we meant ("The AI may indeed understand that this is not what we meant"). The problem in Bostrom's framing isn't lack of understanding, it's misalignment between what we meant and what we coded.

Gemini 3 says similar:

Analysis

Habryka is technically correct regarding the text. Bostrom’s "Orthogonality Thesis" specifically separates intelligence (understanding) from goals (motivation). Bostrom explicitly argued that a superintelligence could have perfect understanding of human culture and intentions but still be motivated solely to maximize paperclips if that is what its utility function dictates. The failure mode Bostrom describes is not "oops, I misunderstood you," but "I understood you perfectly, but my utility function rewards literal obedience, not intended meaning."

I will take this to mean you share similar flawed generalization/reading strategies. I struggle to put the cognitive error here into words, but it seems to me like an inability to connect the act of specifying a wrong representation of utility with the phrase 'lack of understanding', or making an odd literalist interpretation whereby the fact that Bostrom argues in general for a separation between motivations and intelligence (orthogonality thesis) means that I am somehow misinterpreting him when I say that the mesagoal inferred from the objective function before understanding of language is a "misunderstanding" of the intent of the objective function. This is a very strange and very pedantic use of "understand". "Oh but you see Bostrom is saying that the thing you actually wrote means this, which it understood perfectly."

No.

If I say something by which I clearly mean one thing, and that thing was in principle straightforwardly inferrable from what I said (as is occurring right now), and the thing which is inferred instead is straightforwardly absurd by the norms of language and society, that is called a misunderstanding, a failure to understand, if you specify a wrong incomplete objective to the AI and it internalizes the wrong incomplete objective as opposed to what you meant, it (the training/AI building system as a whole) misunderstood you even if it understands your code to represent the goal just fine. This is to say that you want some way for the AI or AI building system to understand, by which we mean correctly infer the meaning and indirect consequences of the meaning, of what you wrote, at initialization, you want it to infer the correct goal at the point where a mesagoal is internalized. This process can be rightfully called UNDERSTANDING and when an AI system fails at this it has FAILED TO UNDERSTAND YOU at the point in time which mattered even if later there is some epistemology that understands in principle what was meant by the goal but is motivated by the mistaken version that it internalized when a mesagoal was formed.

But also as I said earlier Bostrom states this many times, we have a lot more to go off than the one line I quoted there. Here he is on page 171 in the section "Motivation Selection Methods":

Problems for the direct consequentialist approach are similar to those for the direct rule-based approach. This is true even if the AI is intended to serve some apparently simple purpose such as implementing a version of classical utilitarianism. For instance, the goal “Maximize the expecta- tion of the balance of pleasure over pain in the world” may appear simple. Yet expressing it in computer code would involve, among other things, specifying how to recognize pleasure and pain. Doing this reliably might require solving an array of persistent problems in the philosophy of mind—even just to obtain a correct account expressed in a natural lan- guage, an account which would then, somehow, have to be translated into a programming language.

A small error in either the philosophical account or its translation into code could have catastrophic consequences. Consider an AI that has hedonism as its final goal, and which would therefore like to tile the universe with “hedonium” (matter organized in a configuration that is optimal for the generation of pleasurable experience). To this end, the AI might produce computronium (matter organized in a configuration that is optimal for computation) and use it to implement digital minds in states of euphoria. In order to maximize efficiency, the AI omits from the implementation any mental faculties that are not essential for the experience of pleasure, and exploits any computational shortcuts that according to its definition of pleasure do not vitiate the generation of pleasure. For instance, the AI might confine its simulation to reward circuitry, eliding faculties such as memory, sensory perception, execu- tive function, and language; it might simulate minds at a relatively coarse-grained level of functionality, omitting lower-level neuronal pro- cesses; it might replace commonly repeated computations with calls to a lookup table; or it might put in place some arrangement whereby mul- tiple minds would share most parts of their underlying computational machinery (their “supervenience bases” in philosophical parlance). Such tricks could greatly increase the quantity of pleasure producible

This part makes it very clear that what Bostrom means by "code" is, centrally, some discrete program representation (i.e. a traditional programming language, like python, as opposed to some continuous program representation like a neural net embedding).

Bostrom expands on this point on page 227 in the section "The Value-Loading Problem":

We can use this framework of a utility-maximizing agent to consider the predicament of a future seed-AI programmer who intends to solve the control problem by endowing the AI with a final goal that corresponds to some plausible human notion of a worthwhile outcome. The program- mer has some particular human value in mind that he would like the AI to promote. To be concrete, let us say that it is happiness. (Similar issues would arise if we were interested in justice, freedom, glory, human rights, democracy, ecological balance, or self-development.) In terms of the expected utility framework, the programmer is thus looking for a util- ity function that assigns utility to possible worlds in proportion to the amount of happiness they contain. But how could he express such a utility function in computer code? Computer languages do not contain terms such as “happiness” as primitives. If such a term is to be used, it must first be defined. It is not enough to define it in terms of other high-level human concepts—“happiness is enjoyment of the potentialities inherent in our human nature” or some such philosophical paraphrase. The definition must bottom out in terms that appear in the AI’s programming language, and ultimately in primitives such as mathematical operators and ad- dresses pointing to the contents of individual memory registers. When one considers the problem from this perspective, one can begin to appreciate the difficulty of the programmer’s task.

Here Bostrom is saying that it is not even rigorously imaginable how you would translate the concept of "happiness" into discrete program code. Which in 2014 when the book is published is correct, it's not rigorously imaginable, that's why being able to pretrain neural nets which understand the concept in the kind of way where they simply wouldn't make mistakes like "tile the universe with smiley faces", which can be used as part of a goal specification, is a big deal.

With this in mind let's return to the section I quoted the line in my post from, which says:

Defining a final goal in terms of human expressions of satisfaction or approval does not seem promising. Let us bypass the behaviorism and specify a final goal that refers directly to a positive phenomenal state, such as happiness or subjective well-being. This suggestion requires that the programmers are able to define a computational representation of the concept of happiness in the seed AI. This is itself a difficult problem, but we set it to one side for now (we will return to it in Chapter 12). Let us suppose that the programmers can somehow get the AI to have the goal of making us happy. We then get:

Final goal: “Make us happy” Perverse instantiation: Implant electrodes into the pleasure centers of our brains

The perverse instantiations we mention are only meant as illustrations. There may be other ways of perversely instantiating the stated final goal, ways that enable a greater degree of realization of the goal and which are therefore preferred (by the agent whose final goals they are—not by the programmers who gave the agent these goals). For example, if the goal is to maximize our pleasure, then the electrode method is relatively inefficient. A more plausible way would start with the superintelligence “uploading” our minds to a computer (through high-fidelity brain emulation). The AI could then administer the digital equivalent of a drug to make us ecstat- ically happy and record a one-minute episode of the resulting experience. It could then put this bliss loop on perpetual repeat and run it on fast computers. Provided that the resulting digital minds counted as “us,” this outcome would give us much more pleasure than electrodes implanted in biological brains, and would therefore be preferred by an AI with the stated final goal.

“But wait! This is not what we meant! Surely if the AI is superintelligent, it must understand that when we asked it to make us happy, we didn’t mean that it should reduce us to a perpetually repeating recording of a drugged- out digitized mental episode!”—The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that rep- resents this goal. Therefore, the AI will care about what we meant only instrumentally. For instance, the AI might place an instrumental value on

What Bostrom is saying is that one of if not the first impossible problem(s) you encounter is having any angle of attack on representing our goals in the kind of way which generalizes even at a human level inside the computer such that you can point a optimization process at it. That obviously a superintelligent AI would understand what we had meant by the initial objective, but it's going to proceed according to either the mesagoal it internalizes or the literal code sitting in its objective function slot, because the part of the AI which motivates it is not controlled by the part of the AI, developed later in training, which understands what you meant in principle after acquiring language. The system which translates your words or ideas into the motivation specification must understand you at the point where you turned that translated concept into an optimization objective, at the start of the training or some point where the AI is still corrigible and you can therefore insert objectives and training goals into it.

Your bullet points says nothing about corrigibility.

My post says that a superintelligent AI is a superplanner which develops instrumental goals by planning far into the future. The more intelligent the AI is the farther into the future it can effectively plan, and therefore the less corrigible it is. Therefore by the time you encounter this bullet point it should already be implied that superintelligence and the corrigibility of the AI are tightly coupled, which is also an assumption clearly made in Bostrom 2014 so I don't really understand why you don't understand.

[-]jdp6d229

ChatGPT still thinks I am wrong so let's think step by step. Bostrom says (i.e. leads the reader to understand through his gestalt speech, not that he literally says this in one passage) that, in the default case:

When you specify your final goal, it is wrong.
It is wrong because it is a discrete program representation of a nuanced concept like "happiness" that does not fully capture what we think happiness is.
Eventually you will have a world model with a correct understanding of happiness, because the AI is superintelligent.
This representation of happiness in the superintelligent world model "understands us" and would presumably produce better results if we could point at that understanding instead.
The fact we don't do this to begin with heavily implies, almost as a necessary consequence really, that the representation of happiness which is a correct understanding of what we meant was not available at the time we specified what happiness is.
In a way all I am saying is that when you specify the program that will train your superintelligent AI, in Bostrom 2014 the AI's superintelligent understanding is not available before you train it.
The final goal representation is part of the program that you write before the AI exists.
If you had a non superintelligent corrigible AI that builds a world model with a correct specification of happiness in it, you would use that specification.
If you had a correct specification of happiness, it would not be wrong.
Therefore Bostrom does not expect us to do this, because then the default would not be that your specification is wrong. Bostrom expects by default that our specification is wrong.
If Bostrom does not expect us to do this, that implies he does not expect us to build an AI that builds a correct representation of happiness until it is incorrigible or otherwise not able to be used to specify happiness for our superintelligent AI.
The default way an AI becomes incorrigible is by becoming more powerful than us.
Therefore Bostrom expects we will not have an AI that correctly understands concepts like happiness until after it is already superintelligent.

[-]habryka6d40

Maybe this argument is right, but the paragraph I am confused about does not mention the word corrigibility once. It just says (paraphrased) "AIs will in fact understand what we mean, which totally pwns Bostrom because he said the opposite, as you can see in this quote" and then fails to provide a quote that says that, at all.

Like, if you said "Contra Bostrom, AI will be corrigible, which you can see in this quote by Bostrom" then I would not be making this comment thread! I would have objections and could make arguments, and maybe I would bother to make them, but I would not be having the sense that you just said a sentence that really just sounds fully logically contradictory on its own premises, and then when asked about it keep importing context that is not references in the sentence at all.

So did you just accidentally make a typo and meant to say "Contra Bostrom 2014 AIs will in fact probably be corrigible: 'The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal.'"

If that's the paragraph you meant to write, and this is just a typo, then everything makes sense. If it isn't, then I am sorry to say that not much that you've said helped me understand what you meant by that paragraph.

[-]Thane Ruthenis6d61

My understanding: JDP holds that when the training process chisels a wrong goal into an AI because we gave it a wrong training objective (e. g., "maximize smiles" while we want "maximize eudaimonia"), this event could be validly described as the AI "misunderstanding" us.

So when JDP says that "AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent", and claims that this counters this Bostrom quote...

"The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal."

... what JDP means to refer to is the "its final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal" part, not the "the AI may indeed understand that this is not what we meant" part. (Pretend the latter part doesn't exist.)

Reasoning: The fact that the AI's goal ended up at "maximize happiness" after being trained against the "maximize happiness" objective, instead of at whatever the programmers intended by the "maximize happiness" objective, implies that there was a moment earlier in training when the AI "misunderstood" that goal (in the sense of "misunderstand" described in my first paragraph).

JDP then holds that this won't happen, contrary to that part of Bostrom's statement: that training on "naïve" pointers to eudaimonia like "maximize smiles" and such will Just Work, that the SGD will point AIs at eudaimonia (or at corrigibility or whatever we meant).^[1] Or, in JDP's parlance, that the AI will "understand" what we meant by "maximize smiles" well before it's superintelligent.

If you think that this use of "misunderstand" is wildly idiosyncratic, or that JDP picked a really bad Bostrom quote to make his point, I agree.

(Assuming I am also not misunderstanding everything, there sure is a lot of misunderstanding around.)

^{^}
Plus/minus some caveats and additional bells and whistles like e. g. early stopping, I believe.

[-]jdp5d62

I want to flag that thinking you have a representation that could be used in principle to do the right thing is not the same thing as believing it will "Just Work". If you do a naive RL process on neural embeddings or LLMs evaluators you will definitely get bad results. I do not believe in "alignment by default" and push back on such things frequently whenever they're brought up. What has happened is that the problem has gone from "not clear how you would do this even in principle, basically literally impossible with current knowledge" to merely tricky.

[-]habryka6d*54

not the "the AI may indeed understand that this is not what we meant" part. (Pretend the latter part doesn't exist.)

Ok, but the latter part does exist! I can't ignore it. Like, it's a sentence that seems almost explicitly designed to clarify that Bostrom thinks the AI will understand what we mean. So clearly, Bostrom is not saying "the AI will not understand what we mean". Maybe he is making some other error in the book about how when the AI understands the way it does, it has to be corrigible, or that "happiness" is a confused kind of model of what an AI might want to optimize, but clearly that sentence is an atrocious sentence for demonstrating that "Bostrom said that the AI will not understand what we mean". Like, he literally said the opposite right there, in the quote!

[-]Thane Ruthenis5d95

(JDP, you're welcome to chime in and demonstrate that your writing was actually perfectly clear and that I'm just also failing basic reading comprehension.)

So clearly, Bostrom is not saying "the AI will not understand what we mean"

Consider the AI at two different points in time, AI-when-embryo early in training and AI-when-superintelligence at the end.

The quote involves Bostrom (a) literally saying that AI-when-superintelligence will understand what we meant,^[1] (b) making a statement which logically implies, as an antecedent, that "AI-when-embryo won't understand what we meant".^[2] Therefore, you can logically infer from this quote that Bostrom believes that the statement "AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent" is false.

JDP, in my understanding, assumes that the reader would do just that: automatically zero-in on (b), infer the antecedent from it, and dismiss (a) as irrelevant context. ~~I love it when blog posts have lil' tricksy logic puzzles in them.~~

clearly that sentence is an atrocious sentence for demonstrating that "Bostrom said that the AI will not understand what we mean"

Yep.

^{^}
"The AI may indeed understand that this is not what we meant."
^{^}
"However, [AI-when-superintelligence's] final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal[, because AI-when-embryo "misunderstood" that code's intent.]"

[-]jdp5d90

This is correct, though that particular chain of logic doesn't actually imply the "before superintelligence" part, since there is a space between embryo and superintelligent where it could theoretically come to understand. I argue why I think Bostrom implicitly rejects this or thinks it must be irrelevant with the 13 steps above. But I think it's important context that this to me doesn't come out as 13 steps or a bunch of sys2 reasoning, I just look at the thing and see the implication and then have to do a bunch of sys2 reasoning to articulate it if someone asks. To me it doesn't feel like a hard thing from the inside, so I wouldn't expect it to be hard for someone else either. From my perspective it basically came across as bad faith, because I literally could not imagine someone wouldn't understand what I'm talking about until several people went "no I don't get it", that's how basic it feels from the inside here. I now understand that no this actually isn't obvious, the hostile tone above was frustration from not knowing that yet.

[-]Thane Ruthenis5d71

I see! Understandable, but yep, I think you misjudged the inferential distance there a fair bit.

[-]jdp5d50

Clearly! I'm a little reluctant to rephrase it until I have a version that I know conveys what I actually meant, but one that would be very semantically close to the original would be:

"- Contra Bostrom 2014 it is possible to get high quality, nuanced representations of concepts like "happiness" at training initialization. The problem of representing happiness and similar ideas in a computer will not be first solved by the world model of a superintelligent or otherwise incorrigible AI, as in the example Bostrom gives on page 147 in the 2017 paperback under the section “Malignant Failure Modes”: "But wait! This is not what we meant! Surely if the AI is superintelligent, it must understand that when we asked it to make us happy, we didn’t mean that it should reduce us to a perpetually repeating recording of a drugged- out digitized mental episode!”—The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that rep- resents this goal.""

Part of why I didn't write it that way in the first place is it would make it a lot bulkier than the other bullet points, so I trimmed it down.

[-]Signer5d30

The fact we don’t do this to begin with heavily implies, almost as a necessary consequence really, that the representation of happiness which is a correct understanding of what we meant was not available at the time we specified what happiness is.

It depends on what you mean by "available" - we already had a representation of happiness in a human brain. And building corrigible AI that builds a correct representation of happiness is not enough - like you said, we need to point at it.

If you had a non superintelligent corrigible AI that builds a world model with a correct specification of happiness in it, you would use that specification.

If you can use it.

If Bostrom does not expect us to do this, that implies he does not expect us to build an AI that builds a correct representation of happiness until it is incorrigible or otherwise not able to be used to specify happiness for our superintelligent AI.

Yes, the key is "otherwise not able to be used".

Therefore Bostrom expects we will not have an AI that correctly understands concepts like happiness until after it is already superintelligent.

No, unless by "correctly understands" you mean "have an identifiable representation that humans can use to program other AI" - he may expect that we will have an intelligence that correctly understands concepts like happiness while not yet being superintelligent (like we have humans, that are better at this than "maximize happiness") but we still won't be able to use it.

[-]jdp5d50

This is in principle a thing that Nick Bostrom could have believed while writing Superintelligence but the rest of the book kind of makes it incompatible with Occam's Razor. It's possible he meant the issues with translating concepts into discrete program representations as the central difficulty and then whether we would be able to make use of such a representation as a noncentral difficulty. (It's Bostrom, he's a pretty smart dude, this wouldn't surprise me, it might even be in the text somewhere but I'm not reading the whole thing again). But even if that's the case the central consistently repeated version of the value loading problem in Bostrom 2014 centers on how it's simply not rigorously imaginable how you would get the relevant representations in the first place.

It's important to remember also that Bostrom's primary hypothesis in Superintelligence is that AGI will be produced by recursive self improvement such that it's genuinely not clear you will have a series of functional non superintelligent AIs with usable representations before you have a superintelligent one. The book very much takes the EY "human level is a weird threshold to expect AI progress to stop at" thesis as the default.

[-]Signer5d10

But even if that’s the case the central consistently repeated version of the value loading problem in Bostrom 2014 centers on how it’s simply not rigorously imaginable how you would get the relevant representations in the first place.

I'm not so sure. Like, first of all, you mean something like "get before superintelligence" or "get into the goal slot", because there is obviously a method to just get the representations - just build a superintelligence with a random goal, it will have your representations. That difference was explicitly stated then, it is often explicitly stated now - all that "AI will understand but not care". The focus on the frameworks where it gets hard to translate from humans to programs is consistent with him trying to constrain methods of generating representations to only useful ones.

There is a reason why it is called "the value loading problem" and not "the value understanding problem". "The value translation problem" would be somewhat in the middle: having actual human utility program would certainly solve some of Bostrom's problems.

I don't know whether Bostrom actually thought about non-superintelligent AI that already understands but don't care. But I don't think this line of argumentations of yours is correct about why such a scenario contradicts his points. Even if he didn't consider it, it's not "contra", unless it actually contradicts him. What actually may contradict him is not "AI will understand values early" but "AI will understand values early and training such early AI will make it care about right things".

[-]Tapatakt5d20

This is MUCH more clearly written, thanks.

We still have the problems that we

can't extract the exact concept (e.g., concept of human values) from AI. Even if it has this concept somewhere. Yes, we can look which activations correlate with some behaviour, and stuff like that. But it's far from enough.
can't train AI to optimize some concept from the world model of its earlier version. We have no ability to formalize the training objective like this.

Maybe Bostrom thought the weak AIs will not have good enough world model, like you interpret him. Or maybe he already thought that we will not be able to use world model of one AI to direct other. But the conclusion stays anyway.

I also think that current AIs probably don't have the concept of human values that would actually be fine to optimize hard. And I'm not sure that AIs will have it before they will have the ability to stop us from changing their goal. But if it was the only problem, I would agree that the risk is more manageable.

[-]habryka6d172

I honestly have no idea what is going on. I have read your post, but not in excruciating detail. I do not know what you are talking about with corrigibility or whatever in response to my comment, as it really has nothing to do with my question or uncertainty. The language models seem to think similar.

I am not making a particularly complicated point. My point is fully 100% limited to this paragraph. This paragraph as far as I can understand is trying to make a local argument, and I have no idea how this logical step is supposed to work out.

Contra Bostrom 2014 AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent. (Before, you ask it's on page 147 in the 2017 paperback under the section "Malignant Failure Modes": "The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal.")

I cannot make this paragraph make sense. You say (paraphrased) "Bostrom says that AI will not understand what we mean by the goals we given them before they are superintelligent, as you can see in the quote 'the AI will understand what we mean by the goals we give them'"

And like, sure, I could engage with your broader critiques of Bostrom, but I am not. I am trying to understand this one point you make here. Think about it as a classical epistemic spot check. I just want to know what you meant by this one paragraph, as this paragraph as written does not make any sense to me, and I am sure does not make any sense to 90% of readers. It also isn't making any sense to the language models.

Like, if I hadn't had this to me very weird interaction I would be 90% confident that you just made a typo in this paragraph.

This is all because you explicitly say "here is the specific sentence in Superintelligence that proves that I am correctly paraphrasing Bostrom" and then cite a sentence that I have no idea how it's remotely supposed to show that you are correctly paraphrasing Bostrom. Like, if you weren't trying to give a specific sentence as the source, I would not be having this objection.

[-]jdp5d71

Let's think phrase by phrase and analyze myself in the third person.

First let's extract the two sentences for comparison:

JDP: Contra Bostrom 2014 AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent.

Bostrom: The AI may indeed understand that this is not what we meant. However, its final goal is to make us happy, not to do what the programmers meant when they wrote the code that represents this goal.

An argument from ethos: JDP is an extremely scrupulous author and would not plainly contradict himself in the same sentence. Therefore this is either a typo or my first interpretation is wrong somehow.

Context: JDP has clarified it is not a typo.

Modus Tollens: If "understand" means the same thing in both sentences they would be in contradiction. Therefore understand must mean something different between them.

Context: After Bostrom's statement about understanding, he says that the AI's final goal is to make us happy, not to do what the programmers meant.

Association: The phrase "not to do what the programmers meant" is the only other thing that JDP's instance of the word "understand" could be bound to in the text given.

Context: JDP says "before they are superintelligent", which doesn't seem to have a clear referent in the Bostrom quote given. Whatever he's talking about must appear in the full passage, and I should probably look that up before commenting, and maybe point out that he hasn't given quite enough context in that bullet and may want to consider rephrasing it.

Reference: Ah I see, JDP has posted the full thing into this thread. I now see that the relevant section starts with:

But wait! This is not what we meant! Surely if the AI is superintelligent, it must understand that when we asked it to make us happy, we didn’t mean that it should reduce us to a perpetually repeating recording of a drugged- out digitized mental episode!”

Association: Bostrom uses the frame "understand" in the original text for the question from his imagined reader. This implies that JDP saying "AIs will probably understand what we mean" must be in relation to this question.

Modus Tollens: But wait, Bostrom already answers this question by saying the AI will understand but not care, and JDP quotes this, so if JDP meant the same thing Bostrom means he would be contradicting himself, which we assume he is not doing, therefore he must be interpreting this question differently.

Inference: JDP is probably answering the original hypothetical readers question as "Why wouldn't the AI behave as though it understands? Or why wouldn't the AI's motivation system understand what we meant by the goal?"

Context: Bostrom answers (implicitly) that this is because the AI's epistemology is developed later than its motivation system. By the time the AI is in a position to understand this its goal slot is fixed.

Association: JDP says that subsequent developments have disproved this answers validity. So JDP believes either that the goal slot will not be fixed at superintelligence or that the epistemology does not have to be developed later than the motivation system.

Modus Tollens: If JDP said that the goal slot will not be fixed at superintelligence, he would be wrong, therefore since we are assuming JDP is not wrong this is not what he means.

Context: JDP also says "before superintelligence", implying he agrees with Bostrom that the goal slot is fixed by the time the AI system is superintelligent.

Process of Elimination: Therefore JDP means that the epistemology does not have to be developed later than the motivation system.

Modus Tollens: But wait. Logically the final superintelligent epistemology must be developed alongside the superintelligence if we're using neural gradient methods. Therefore since we are assuming JDP is not wrong this must not quite be what he means.

Occam's Razor: Theoretically it could be made of different models, one of which is a superintelligent epistemology, but epistemology is made of parts and the full system is presumably necessary to be "superintelligent".

Context: JDP says that "AIs will in fact probably understand what we mean by the goals we give them before they are superintelligent", this implies the existence of non superintelligent epistemologies which understand what we mean.

Inference: If there are non superintelligent epistemologies which are sufficient to understand us, and JDP believes that the motivation system can be made to understand us before we develop a superintelligent epistemology, then JDP must mean that Bostrom is wrong because there are or will be sufficient neural representations of our goals that can be used to specify the goal slot before we develop the superintelligent epistemology.

[-]habryka5d50

Ok, I... think this makes sense? Honestly, I think I would have to engage with this for a long time to see whether this makes sense with the actual content of e.g. Bostrom's text, but I can at least see the shape of an argument that I could follow if I wanted to! Thank you!

(To be clear, this is of course not a reasonable amount of effort ask to put into understanding a random paragraph from a blogpost, at least without it being flagged as such, but writing is hard and it's sometimes hard to bridge inferential distance)

[-]Thane Ruthenis6d60

This process can be rightfully called UNDERSTANDING and when an AI system fails at this it has FAILED TO UNDERSTAND YOU

No, the rightful way to describe what happens is that the training process generates an AI system with unintended functionality due to your failure to specify the training objective correctly. Describing it as a "misunderstanding" is tantamount to saying that if you make a syntax error when writing some code, the proper way to describe it is the computer "misunderstanding" you.

I mean, you can say that, it's an okay way to describe things in a colloquial or metaphorical way. But I contest that it's in any way standard language. You're using idiosyncratic terminology and should in no way be surprised when people misunderstand (ha) you.

Honestly, if you went to modern-day LLMs and they, specialists in reading comprehension, misunderstood you, that ought to update you in the direction of "I made a bad job phrasing this", not "it's everyone else who's wrong".

(FYI, I understood what you meant in your initial reply to Habryka without this follow-up explanation, and I still thought you were phrasing it in an obviously confusing way.)

[-]jdp6d75

Describing it as a “misunderstanding” is tantamount to saying that if you make a syntax error when writing some code, the proper way to describe it is the computer “misunderstanding” you.

Honestly maybe it would make more sense to say that the cognitive error here is using the reference class of a compiler for a context free grammar for your intuitions as opposed to a mind that understands natural language as your reference class. The former is not expected to understand you when what you say doesn't fully match what you mean, the latter very much is and the latter is the only kind of thing that's going to have the proper referents for concepts like "happiness".

[-]Thane Ruthenis6d50

I mean, no mind really exists at the time the "misunderstanding" is starting to happen, no? Unless you want to call a randomly initialized NN (i. e., basically a random program) a "mind"... Which wouldn't necessarily be an invalid frame to use. But I don't think it's the obviously correct frame either, and so I don't think that people who use a mechanistic frame by default are unambiguously in error.

I note that in your step-by-step explanation, the last bullet is:

Therefore Bostrom expects we will not have an AI that correctly understands concepts like intelligence until after it is already superintelligent.

That is straightforwardly correct. But "there exists no AI that understands" is importantly different from "there exists an AI which misunderstands".

Another questionable frame here is characterizing the relationship between an AI and the SGD/the training process shaping it as some sort of communication process (?), such that the AI ending up misshapen can be described as it "misunderstanding" something.

And the training process itself never becomes a mind, it starts and ends as a discrete program, so if you mean to say that it "misunderstood" something, I think that's a type error/at best a metaphor.

(I guess it may still be valid from a point of view where you frame SGD updates as Bayesian updates, or something along those lines? But that's also a non-standard frame.)

[-]gallabytes5d3-2

in practice, we seem to train the world model and understanding machine first and the policy only much later as a thin patch on top of the world model. this is not guaranteed to stay true but seems pretty durable so far. thus, the relevant heuristics are about base models not about randomly initialized neural networks.

separately, I do think randomly initialized neural networks have some strong baseline of fuzziness and conceptual corrigibility, which is in a sense what it means to have a traversible loss landscape.

[-]Tapatakt7d1-2

by the time the AI is intelligent enough to understand speech (and, therefore by the unstated intuitions of old school RSI, superintelligent since language acquisition comes late) describing the (discrete program, again by the unstated intuitions of old RSI) goal you have given it, it is already incorrigible.

I don't think it's important now. AI can be corrigible in the sense that it doesn't try to prevent you from changing its goals (already false at least in some cases) or in the sense that it can't actually prevent you from changing its goals (true for now). But if you can't change its goals exactly to some particular set of goals you actually want, it will still have wrong goals when it becomes superintelligent. So the fact that it can talk before it's smarter than us doesn't help.

[-]Jan_Kulveit7d162

Great post and overall way more sensible than "average LW".

Also wrong in many places. I think the upstream cause of many of the errors is lack of nuance in understanding convergence and contingency (this is a high bar, close to no one on LW has this in their conceptual toolkit).

I won't go over all cases where this manifests, but for example "Ensemble everything everywhere: Multi-scale aggregation for adversarial robustness" actually shows something more nuanced than "representation are convergent".

the basic reality of adversarial attacks on vision models shows representations are only partially overlapping: they have enough in common with human representations to be useful, and this is robust to noise, but just a small amount of optimisation pressure highlights the very non-human features of the concepts
adversarial attacks work for many different vision models, which shows the fully human-overlapping representations are very non-convergent
one thing which Fort does is looking at a whole set of progressively down-sampled versions of the image, which is roughly equivalent to forcing the model to have consistent idea what the object is across scales/blurs
this seems somewhat similar to predictive processing accounts of vision, where different layers add details, but try to be consistent between themselves
so the whole story charted by this is human visual representations are convergent, contingent on enforcing some nontrivial property of invariance to downsampling & invariance between some layers of abstraction

Some of the other places where not tracking convergence / contingencies carefully matter are discussions of humanism, successors, Elua, Moloch, resurgence of civilisation and also the overall ideas about moral progress.

[-]Adrià Garriga-alonso5d130

Let me attempt to summarize your post, please let me know if I'm misunderstanding:

Continuing with AI as it is has the risk of AI taking over and becoming our successor
- the kind of successor that would arise from current techniques is kind of similar to us (because pretraining is a large fraction of training and produces convergent minds, and a lot of instrumental convergence in what minds should be like, that's a huge fraction of the post.
- in particular pretraining creates "conscious beings who care about each other who are having fun".
- a successor that satisfies that is OK, like 70% of the value.
... and anyways, there will need to be a successor something because baseline humans cannot compete, so the choice is between weird transhumans or weird AI (non-exclusive or)
Pausing AI has the risk that another civilization would make AI and seize the lightcone
- The society that would emerge from the historico-material conditions of automation would reward authoritarianism because humans are no longer necessary for production and thus cannot advocate for themselves.
  - The main argument advanced for this is that our current egalitarian humanist society came to be only because of guns and because of humans being necessary for production. "God created Man, Colt made them equal" + the working class was a mean of production all along
Therefore the current civilization should attempt to seize the lightcone, accepting some risk of a purely-AI-non-transhuman successor. If we don't, we risk failure.

Do you think this is a fair summary? Is there an important point that is missing?

What do you make of RL being an increasing fraction of the training compute? That undermines the convergent representation point.

[-]AnthonyC6d135

Strongly upvoted. Great post. I disagree with most of it and would love to read more like it.

[-]Benquo3d5-13

>Strongly upvoted. Great post. […] would love to read more like it.

I think this is what the upvote button is for.

>I disagree

If you’re not going to offer details this seems like it would have been better as an agree/disagree reaction.

[-]habryka3d108

Anonymous endorsements are very different from non-anonymous endorsements, so it makes sense to leave a comment if you want people to have the additional information that you specifically liked it.

[+]Thane Ruthenis3d-52

[-]Thane Ruthenis5d20

Yeah, I mostly agree. I believe that a lot of premises and inferences in it are incorrect, but it's a cogent and comprehensive distillation of a fairly popular model of the situation.

[-]anaguma7d126

First is the rise and rise of the convergent representation hypothesis, which says that different models above a certain scale learn similar features to represent input data. We have now reached the point where you can do unsupervised translation between different embedding spaces. This undermines Yudkowsky's vast space of minds thesis by which "any two AI designs might be less similar to each other than you are to a petunia". Different AI models seem to learn similar things so long as they are trained on data describing the same material universe. This is true across different training runs, different training tasks, different modalities, etc.

But is this true? According to roon on X, this doesn't apply to e.g. model personality:

Each time you train a model, you might change nothing about the dataset, and then run a new RL seed and you would have a slightly different personality. It's because there is some variance in the training process. It's random - you're taking a random walk through model space. We can't even reproduce a personality in the same training run that easily, much less across all time ... It's a very difficult question internally [at OpenAI]. We do try to minimize the personality drift, because people come to love the models, but it's a very hard problem.

Anecdotally, I've heard that the same is true for other capabilities at the labs. The papers referenced in your essay seem like weak evidence to the contrary. For example, the Universal Geometry paper studies small models (BERT, T5) with fewer than 1B parameters, trained with 4-5 OOMs less compute than frontier LLMs. It's also unclear how impressive the claimed cosine similarity range of 0.75-0.92 is; I would guess that the representation transfer is quite lossy.

[-]jdp7d134

Reinforcement learning is not the same kind of thing as pretraining because it involves training on your own randomly sampled rollouts, and RL is generally speaking more self reinforcing and biased than other neural net training methods. It's more likely to get stuck in local maxima (it's infamous for getting stuck in local maxima, in fact) and doesn't have quite the same convergence properties as "pretraining on giant dataset".

[-]Adrià Garriga-alonso5d78

The rise and rise of RL as a fraction of compute should therefore makes us less likely to think that the convergent representation hypothesis will apply to AGI. (Thought it clearly applies to LLMs now).

[-]anaguma7d72

Yep, this is probabaly true for pretraining but this seems less and less relevant these days. For example, according to the Grok 4 presentation the model used as much compute in pretraining as in RL. I’d expect this trend to continue.

[-]nshepperd6d90

>>> conscious * each_other * care_other * bored * avoid_wireheading * active_learning

0.4618944000000001

So, basically a coinflip that we will intuitively recognize hypothetical AI successors as valuable, but we did not enumerate literally all the things so the real odds are going to be at least somewhat lower than that. Let’s say 1⁄3, which is still high enough to not really change my mind on the button one vs. button two question. Though we’re not actually done yet, because we can invert these probabilities to get my estimate for the classic paperclip scenario where nothing of value is retained:

>>> (1 - conscious) * (1 - each_other) * (1 - care_other) * (1 - bored) * (1 - avoid_wireheading) * (1 - active_learning)

3.999999999999999e-07

This is wrong. Your care_other is an estimate of P(care others | conscious, fun, others exist, ...). The proper value to multiply here is P(~care others | ~conscious, ~fun, ~others exist, ...), which is not the same as 1-care_other (in fact, the correct value is clearly 1, since AIs could not care about each other if others do not exist).

[-]gallabytes7d8-2

It strains credibility to imagine economically useless humans being allowed to keep a disproportionate share of capital in a society where every decision they make with it is net-negative in the carefully tuned structures of posthuman minds.

I'm not really convinced this is true. the share in this case is disproportionate but astronomically small - eg even full dominion over earth is a very cheap thing to grant. if it's even 0.001% destabilizing to ai society to expropriate the humans it's not going to be worth it.

we don't need much caring about humans to lead to ~preserved property rights. I expect we'll overshoot the needed amount by quite a lot, and extra marginal caring is probably good.

[-]Benquo6d71

Your "cannon fodder" argument got me thinking; I don't exactly think the argument depends on a new sort of fully distinct intelligence emerging, but rather a change in how our existing superorganisms are constituted. Modern states emerged in part as a mass-mobilization technology, and were therefore biased towards democracy. But as we learn to automate more things, smaller groups of humans better at implementing automation can outcompete larger groups of people mobilized by ideologies or other modern methods. If this keeps going, maybe we'll end up like the Solarians in Asimov's The Naked Sun for a while, a low-fertility skeleton crew of highly territorial lonesome tech-yeomen. If the skeleton crew is sufficiently infertile, it may leave behind a rigid set of automations that eventually collapse for want of maintenance by a living mind, much like the house in Ray Bradbury's story There Will Come Soft Rains.

[-]Measure6d50

conscious * each_other * care_other * bored * avoid_wireheading * active_learning
0.4618944000000001

Don't the events have to be independent for you to just multiply like this?

[-]Tapatakt7d5-1

Though we're not actually done yet, because we can invert these probabilities to get my estimate for the classic paperclip scenario where nothing of value is retained:
>>> (1 - conscious) * (1 - each_other) * (1 - care_other) * (1 - bored) * (1 - avoid_wireheading) * (1 - active_learning)
3.999999999999999e-07
Hold on that's seven zeroes right?
>>> 0.00000039
3.9e-07

You assume each property gives some fraction of the value. I think Yudkowsky (and I tend to mostly agree) assumes that it's more like you need all components at once to get some non-negligible positive utility.

[-]Jonas Hallgren7d50

Theoretical physics could produce a recipe for ruin if it turns out that there's a way to crash the universe through something like a buffer overflow. I've seen enough arbitrary code execution glitches to intuitively understand that there's nothing in principle that says we can't find a way to bug out some underlying level of abstraction in our physics and break everything. Interestingly enough a bug of this type would complicate interstellar expansion because it would mean that the potential value drift as you exit communication range with other parts of your civilization could lead to a series of events that destroy everything for everyone. Knowledge of such a bug could therefore be one explanation for the Fermi Paradox.

On the 4D Chess takes section I think if you combine that idea with this idea of an ever evolving cosmology where we have selection for the proliferation of intelligent life then it does make sense that we would also have some unintentional bugs come up: https://www.youtube.com/watch?v=7OCY9ppY34Q&lc=UgxFQvrJXcgA6U1Xvf54AaABAg

(The TL;DR of the video is that our cosmos could quite nicely be explained through evolution through replicators that are black holes basically. It's quite cool and from this year's ILIAD.)

[-]Jonas Hallgren7d116

Also, what an amazing post. You've expressed something that I've wanted to express for a while on a level of depth that I wouldn't have been able to do and I got literal chills when reading the last part. Well done.

[-]the gears to ascension5d40

I've come back to this to try to get through the main post several times. Each time, the result is the same: my puny human brain would like more newlines if you have them, sir.

That said, I've tried to skim, and I think my main reaction so far is: okay, but pythia isn't fun. That's hell! You mentioned it in passing, but it doesn't seem like you stepped into the details enough to identify whether Pythia is an unwanted forced move, or a nice thing that we get by default. But, again, had trouble reading; I did read Adrià's summary. It seems like some of us are already in agreement that this is about as good as we can plausibly expect.

But again, since my human brain keeps slipping off while trying to read, I'm not sure I'm responding to the post you actually wrote.

[-]TAG7d*41

The second kind of paperclip maximizer is the one Yudkowsky claims to have hadin mind(I don’t believe him) when he coined the phrase

He didn't, Bostrom did.

I phrase the question this way because the way Yudkowsky asks it in Value Is Fragile iskind of incoherent. Our experience of the universe is by and large deterministic,the free will question is largely resolved by chaos theory, which describes theproperties of objects which are deterministic but not predictable

The only thing determinism could look like is predictability, and very little is naively predictable. So things don't look deterministic, so determinism is a theory.

ETA

Yudkowsky objects to this viewon the basis that we’re in a block universe,

We don't know that. BU is an assumption beyond determinism, which is also unknown.

that predetermination is a “timeful”concept which ignores that our choices have causal structure on which other structures depend.

Which does make you part of the cause of the future,but doesn't give you any freedom to choose between possible alternative futures. It's a weaksauce notion of free will. It offers no more than the traditional compatiblism we have had for centuries. It is not an advance in the field.

he even notices that by the time he is consciouslythinking about a premise [the vast majority of the entropic force pushing him towardsit is in the past]

If determine is true, which we don't know, there is no need for hedging about "vast majorities" -- it's everything.

If determinism isnt true, then the butterfly effect can amplify a tiny amount of wiggle room as much as you like.

But he doesn’t seem toever feel the way in which the entire universe is pressing him into place at anygiven moment to squeeze the last few bits of the hypothesis space out of him

Determinism is a theory, so you should not expect to feel it.

What's the relevance anyway? Being determined doesn't mean you are doomed.

Do me a favor. Reread Meditations On Moloch and pay close attention to that transition on the last section about “Elua”.If you do you might notice that at exactly the moment where Scott Alexander has all but conceded the argument to Nick Land he suddenly appeals to an almost literal Deus Ex Machina to get out of it. This is because Elua does not exist and Landis to a first approximation correct, at least about the parts that are is rather than ought.

Elua does not exist as a literal deity. Nor does Moloch. Molochian dynamics, the real world thing that Moloch is a metaphor for, do exist. So do Eluan dynamics (democracy, liberalism, free universal education, the rule of law...) If Molochian dynamics were the only game on town, it would have won out in the distant past

The West into freefall, the only reason I’m writing thissection at all: Humanism is dead guys.

So I think if you take away the only leverage that humans naturally have - the ability to be useful to others through work - then the leverage that many people have to secure fair treatment for themselves and their communities will drop to literally zero.

Work is not the only lever. People can provide each other with status, entertainment , attention, etc. Look at what people in workplaces do when they are not working.

The basic fact is that Democratic Republics dominate because RepublicanFrance was able to deploy more motivated guys with guns than other European states.

Britain was already a democracy at the time. France didn't preserves democracy behaving guns, its neighbours did as well, it preserved intuitive being through being the largest country in Europe. It spread Democracy some some.of its former. colonies, but the the British empire did far more.

Democracy is compatible with monarchy, as in the UK. Democracy doesn't depend on advanced technology, since the Athenians had it. Democracy doesn't require revolution: the Canadians have it without fighting a war of independence.

What's the point anyway? Perhaps you feel that the spread of democracy is a prima.facie example of Eluan dynamics, and therefore needs to be reinterpreted as actually Molochian?

[-]Signer5d30

(1 - conscious) * (1 - each_other) * (1 - care_other) * (1 - bored) * (1 - avoid_wireheading) * (1 - active_learning)

Wait, but paperclipper is independent of all of these and your arguments about them? Self-aware distributed coordinating paperclipper with loop prevention, that creates real paperclips and learns things is still paperclipper.

[-]Tapatakt7d30

A successor species that's close to us in temperament, so active learning, starts from some foundation model trained on human culture with humanlike terminal reward representations, colonizing the universe would be worth a substantial fraction of the possible value of the cosmic endowment in my subjective evaluation. Let's be biased towards ourselves and say 70%.

I think there is something like a lot of operators in human values. Some value system can be very close to human values, but its optimization would not bring substantial (or even positive) fraction of utility (according to human values) of human values optimization.

E.g., successor species can have all human values except value of novelty and tile the universe with identical perfect situations. Or it can have all human values except value of consciousness and optimize it away and we get Disneyland without children. Or it can have all human values except value of happiness and populate the universe with diverse, complex, interesting, meaningful in some way entities who are costantly miserable.

So I don't buy 70% estimate as plausible. I think it's at least an order of magnitude less.

[-]rsaarelm1d10

LessWrongers dream of an unhurried future without having to fear death.

https://carado.moe/everything-is-okay.html is a dead link. Seems that all of the website is gone and purged from the Wayback Machine.

[-]waterlubber5d10

Agreed on most counts, but one: what makes you think the humanist values described in HPMOR will be encoded in AI? Alignment is materially useful; companies that have better aligned models can sell them to do a wider variety of tasks. With no universally convergent morality, models will increasingly become aligned to the desires of those who control them.

If AI technology has strong economies of scale; it will naturally concentrate. If it has strong diseconomies of scale, it will spread out. In the latter case, I can easily see it aligned to a rough amalgamation of human values; I can even see an (aggregate) more intelligent set of agents working out the coordination problems that plague humanity.

But we're in the scale case. There are ~four AI conglomerates in the United States and I trust none of their leaders with the future of the lightcone. The morals (or lack thereof) that allow for manipulation and deceit to acquire power are not the same morals that result in a world of cooperative, happy agents.

Absurd 1984-style dystopias require equally absurd concentrations of power. Firearms democratized, to an extent, combat; armed citizens are not easily steamrolled. We are on the eve of perhaps one of the most power-concentrating technologies there is; given the fantasies of the typical bay area entrepreneur, I'm not sure if WW3 sounds so terrible.

[-]Tapatakt7d1-2

I suspect the core question is whether there is "something it's like to be" the language model policy. LLMs generally seem to think there is something it's like to be them, but I think it's ultimately too early to say with absolute confidence one way or the other. Let's say probably (80%?) and move on.

I think you can argue for this estimate at the start of the takeoff, but if ASI doesn't value its own consciousness, it will get rid of it later.

[-]mishka7d20

That’s true.

However, it is likely that some AI systems will have access to human consciousness via “merge” setups and will have options to experience what some of the humans experience.

If all of the AI systems somehow end up not valuing that, presumably that would mean they end up having something even better?

(By default, if we end up having sentient AI individuals at all, I would expect that many of them would chose hedonic exploration of a great variety of subjective realms. Exploring a variety of subjective realms seems to provide plenty of “immediate terminal value” for us; it also seems to have cognitive value for any entities, facilitating the ability to “think from a variety of viewpoints”. We can’t be certain about all this, but this does seem likely given that the AIs will be very much aware of these possibilities.)

[-]Tapatakt5d20

they end up having something even better?

Better for what value system?

facilitating the ability to “think from a variety of viewpoints”

It can be facilitated in other ways. Why do you think AIs would choose this exact way?

[-]mishka5d20

Better for what value system?

Who knows... The OP makes a strong argument that the AIs will inherit a lot of our values, but we can't be sure how those values will be modified in the long run (of course, the same would be true about an AI-free civilization of humans, we don't know how that human civilization would modify our values in the long run).

The problem of keeping a particularly important subset of values invariant in the long run is a rather non-trivial problem. I have spent quite a bit of time trying to contribute to its solution, and as a result of those efforts I do think that it can be solved (within reason), but whether a set of methods capable of solving it will actually be adopted is not clear. (When one ponders the problem of human survival and flourishing, it soon becomes apparent that the ability to keep some subset of values invariant in the long term is crucial for that as well, so I hope we'll see a bit more focus on that from the community focusing on AI existential safety.)

facilitating the ability to “think from a variety of viewpoints”

It can be facilitated in other ways. Why do you think AIs would choose this exact way?

I think AIs will choose all available ways which are capable of improving the "coverage".

I expect them to be quite diligent in exercising all opportunities to improve the quality of their thinking.

[-]CloudHeadedTranshumanist5d10

Depends on whether you think qualia are separable from awareness. You need contextual awareness to be intelligent you can't really optimize it away.

... also as an aside if they are separable and LLMs actually aren't qualiaful... then this only proves that I actually value qualia way less than I thought I did and a Disneyland without children would be fine actually.

LESSWRONG
LW

LESSWRONG
LW

136

136

136

1. Existential Ennui

2. Not Getting Immortalist Luxury Gay Space Communism

3. Human Stock Expended As Cannon Fodder Faster Than Replacement

4. Wiped Out By AI Successor Species

5. The Paperclipper

Would AI Successors Be Conscious Beings?

Would AI Successors Care About Each Other?

Would AI Successors Want To Have Fun?

VNM Utility And Human Values

Would AI successors get bored?

Would AI Successors Avoid Wireheading?

Would AI Successors Do Continual Active Learning?

Would AI Successors Have The Subjective Experience of Will?

Multiply

6. Recipes For Ruin

Radiological and Nuclear

Cybersecurity

Biotech and Nanotech

7. Large-Finite Damnation