Why we should expect ruthless sociopath ASI

Steven Byrnes

LESSWRONG
LW

Why we should expect ruthless sociopath ASI — LessWrong

154 Why we should expect ruthless sociopath ASI

by Steven Byrnes

18th Feb 2026

AI Alignment Forum

9 min read

154 Ω 51

The conversation begins

(Fictional) Optimist: So you expect future artificial superintelligence (ASI) “by default”, i.e. in the absence of yet-to-be-invented techniques, to be a ruthless sociopath, happy to lie, cheat, and steal, whenever doing so is selfishly beneficial, and with callous indifference to whether anyone (including its own programmers and users) lives or dies?

Me: Yup! (Alas.)

Optimist: …Despite all the evidence right in front of our eyes from humans and LLMs.

Me: Yup!

Optimist: OK, well, I’m here to tell you: that is a very specific and strange thing to expect, especially in the absence of any concrete evidence whatsoever. There’s no reason to expect it. If you think that ruthless sociopathy is the “true core nature of intelligence” or whatever, then you should really look at yourself in a mirror and ask yourself where your life went horribly wrong.

Me: Hmm, I think the “true core nature of intelligence” is above my pay grade. We should probably just talk about the issue at hand, namely future AI algorithms and their properties.

…But I actually agree with you that ruthless sociopathy is a very specific and strange thing for me to expect.

Optimist: Wait, you—what??

Me: Yes! Like, if you show me some random thing, there’s a 99.999…% chance that it’s not a ruthless sociopath. Instead it might be, like, a dirt clod. Dirt clods are not ruthless sociopaths, because they’re not intelligent at all.

Optimist: Oh c’mon, you know what I mean. I’m not talking about dirt clods. I’m saying, if you pick some random mind, there is no reason at all to expect it to be a ruthless sociopath.

Me: How do you “pick some random mind”? Minds don’t just appear out of nowhere.

Optimist: Like, a human. Or an AI.

Me: Different humans are different to some extent, and different AI algorithms are different to a much, much greater extent. “AI” includes everything from A* search to MuZero to LLMs. Is A* search a ruthless sociopath? Well, I mean, it does seem rather maniacally obsessed with graph traversal! Right?

Optimist: Haha, very funny. Please stop being annoyingly pedantic. I obviously didn’t mean “AI” in the sense of the academic discipline. I meant, like, AI in the colloquial sense, AI that qualifies as a mind, like LLMs. I’m mainly talking about human minds and LLM “minds”, i.e. all the minds we’ve ever seen in the real world, rather than in sci-fi. And hey, what a coincidence, ≈100% of those minds are not ruthless sociopaths.

Me: As it happens, the threat model I’m working on is not LLMs, but rather “brain-like” Artificial General Intelligence (AGI), which (from a safety perspective) is more-or-less a type of actor-critic model-based reinforcement learning (RL) agent. LLMs are profoundly different from what I’m working on. Saying that LLMs will be similar to RL-agent AGI because “both are AI” is like saying that LLMs will be similar to the A* search algorithm because “both are AI”, or that a frogfish will be similar to a human because “both are animals”. They can still be wildly different in every way that matters.

Are people worried about LLMs causing doom?

Optimist: OK, but lots of other doomers talk about LLMs causing doom.

Me: Well, kinda. I think we need to tease apart two groups of people. Both are sometimes called “doomers”, but one is much more pessimistic than the other. This is very caricatured, but:

The comparatively-less-pessimistic group (say, P(doom) [probability of human extinction from AI, assuming progress continues] in the 5%–50% range) is a bigger group, and I vaguely associate them with the center-of-gravity of the Effective Altruism movement and Anthropic employees. They definitely do not expect ruthless sociopath ASI as the default path we’re on, absent a technical breakthrough, like I’m arguing for here. At most, they’ll entertain the idea of ruthless sociopath ASI as an odd hypothetical, or as a result of a competitive race-to-the-bottom, or from egregiously careless programmers, or bad actors, etc. They’re probably equally or more concerned about lots of other potential AI problems—AI-assisted bioterrorism, dictatorships, etc.^[1]
I’m part of an even more pessimistic group (motto: If Anyone Builds It, Everyone Dies), which generally does expect ruthless sociopath ASI as the default path we’re on, absent a technical breakthrough (along with other miracles). We tend to think “50% chance that humans will survive continued AI development” is deliriously over-optimistic.

Anyway, the extra heap of concern in that latter camp is not from the LLMs of today causing near-certain doom, or even the somewhat-better LLMs of tomorrow, but rather the wildly better ASIs of … maybe soon, maybe not, who knows. But even if it’s close in calendar time, and even if it comes out of LLM research, such an ASI would still be systematically different from LLMs as we know them today—

Optimist: —a.k.a., you have no evidence—

Me: —no evidence either way, at least no evidence of that type. Anyway, as I was saying, ASI would be systematically different from today’s LLMs because … umm, where do I start …

…Actually, it would be much easier for me to explain if we start with the ASI threat model that I spend all my time on, and then we can circle back to LLMs afterwards. Is that OK?

Positive argument that “brain-like” RL-agent ASI would be a ruthless sociopath

Optimist: Sure. We can pause the discussion of LLMs for a few minutes, and start in your comfort zone of actor-critic model-based RL-agent “brain-like” ASI. Doesn’t really matter anyway: regardless of the exact algorithm, you clearly need some positive reason to believe that this kind of ASI would be a ruthless sociopath. You can’t just unilaterally declare that your weird unprecedented sci-fi belief is the “default”, and push the burden of proof onto people who disagree with you.

Me: OK. Maybe a good starting point would be my posts LeCun’s ‘A Path Towards Autonomous Machine Intelligence’ has an unsolved technical alignment problem, or ‘The Era of Experience’ has an unsolved technical alignment problem.

Optimist: I’ve read those, but I’m not seeing how they answer my question. Again, what’s your positive argument for ruthless sociopathy? Lay it on me.

Me: Sure. Back at the start of the conversation, I mentioned that random objects like dirt clods are not able to accomplish impressive feats. I didn’t (just) bring up dirt clods to troll you, rather I was laying the groundwork for a key point: If we’re thinking about AI that can autonomously found, grow, and staff innovative companies for years, or autonomously invent new scientific paradigms, then clearly it’s not a “random object”, but rather a thing that is able to accomplish impressive feats. And the question we should be asking is: how does it do that? Those things would be astronomically unlikely to happen if the AI were choosing actions at random. So there has to be some explanation for how the AI finds actions that accomplish those impressive feats.^[2]

So an explanation has to exist! What is it? I claim there are really only two answers that work in practice.

The first possible explanation is consequentialism: the AI accomplishes impressive feats by (what amounts to) having desires about what winds up happening in the future, and running some search process to find actions that lead to those desires getting fulfilled. This is the main thing that you get from RL agents, and from model-based planning algorithms. (My “brain-like AGI” scenario would involve both of those at once.) The whole point of those subfields of AI is: these are algorithms designed to find actions that maximize an objective, by any means available.

I.e., you get ruthless sociopathic behavior by default.

And this is not just my armchair theorizing. Go find someone who was in AI in the 2010s or earlier, before LLMs took over, and they may well have spent a lot of time building or using RL agents and/or model-based planning algorithms. If so, they’ll tell you, based on their lived experience, that these kinds of algorithms are ruthless by default (when they work at all), unless the programmers go out of their way to make them non-ruthless. See e.g. this 2020 DeepMind blog post on “specification gaming”.

And how would the programmers “go out of their way to make them non-ruthless”? I claim that the answer is not obvious, indeed not even known. See my LeCun post, and my Silver & Sutton post, and more generally my post “‘Behaviorist’ RL reward functions lead to scheming” for why obvious approaches to non-ruthlessness won’t work.

Rather, algorithms in this class are naturally, umm, let’s call them, “ruthless-ifiers”, in the sense that they transmute even innocuous-sounding objectives like “it’s good if the human is happy” into scary-sounding ones like “ruthlessly maximize the probability that the human is happy”, which in turn suggest strategies such as forcibly drugging the human. Likewise, the innocuous-sounding “it’s bad to lie” gets ruthless-ified into “it’s bad to get caught lying”, and so on.

Of course, evolution did go out of its way to make humans non-ruthless, by endowing us with social instincts. Maybe future AI programmers will likewise go out of their way to make ASIs non-ruthless? I hope so—but we need to figure out how.

To be clear, ruthless consequentialism isn’t always bad. I’m happy for ruthless consequentialist AIs to be playing chess, designing chips, etc. In principle, I’d even be happy for a ruthless consequentialist AI to be emperor of the universe, creating an awesome future for all—but making that actually happen would be super dangerous for lots of reasons (e.g. you might need to operationalize “creating an awesome future for all” in a loophole-free way; see also “‘The usual agent debugging loop’, and its future catastrophic breakdown”).

…So that’s consequentialism, one possible answer for how an AI might accomplish impressive feats, and it’s an answer that brings in ruthlessness by default.

Circling back to LLMs: imitative learning vs ASI

…And then there’s a second, different possible answer to how an AI might accomplish impressive feats: imitative learning from humans. You train an AI to predict what actions a skilled human would take in many different contexts, and then have the AI take that same action itself. I claim that LLMs get their impressive capabilities almost entirely from imitative learning.^[3] By contrast, “true” imitative learning is entirely absent (and impossible) in humans and animals.^[4]

Imitative-learning AIs do not have ruthless sociopathy by default, because of course the thing they’re imitating is non-ruthless humans.^[5]

Optimist: Huh … Wait … So you’re an optimist about superintelligence (ASI) being non-ruthless, as long as people stick to LLMs?

Me: Alas, no. I think that the full power of consequentialism is super dangerous by default, and I think that the full power of consequentialism is the only way to get ASI, and so AI researchers are going to keep working until they eventually learn to fully tap that power.

In other words, I see a disjunction:

EITHER, LLMs will always get their powers primarily from imitative learning, as I claim they do today—in which case they will never be able to figure things out way beyond the human-created training data, and will thus never reach ASI. And then eventually we’ll get ASI via a different AI paradigm, one that can rocket arbitrarily far past any human data. And that paradigm will have to draw its powers from consequentialism, which brings in ruthlessness-by-default.
OR, someone will figure out how to get LLMs themselves to rocket arbitrarily far past human training data and into ASI. But the only way to do that is to somehow modify LLMs to draw on the full powers of consequentialism. In which case, again, we get ruthlessness-by-default.

For what it’s worth, I happen to expect that ASI will come from the former (future paradigm shift) rather than the latter (LLM modifications). But it hardly matters in this context.

Optimist: I dunno, if you’re willing to concede that LLMs today are not maximally ruthless, well, LLMs today don’t seem that far from superintelligence. I mean, humans don’t “rocket arbitrarily far past any training data” either. They usually do things that have been done before, or at most (for world experts on the bleeding edge) go just one little step beyond it. LLMs can do both, right?

Me: Yes, but humans collectively and over time can get way, way, way beyond our training data. We’re still using the same brain design that we were using in Pleistocene Africa. Between then and now, there were no angels who dropped training data from the heavens, but humans nevertheless invented language, science, technology, industry, culture, and everything else in the $100T global economy entirely from scratch. We did it all by ourselves, by our own bootstraps, ultimately via the power of consequentialism, as implemented in the RL and model-based planning algorithms in our brains.

(See “Sharp Left Turn” discourse: An opinionated review.)

By the same token, if humanity survives another 1000 years, we will invent wildly new scientific paradigms, build wildly new industries and ways of thinking, etc.

There’s a quadrillion-dollar market for AIs that can likewise do that kind of thing, as humans can. If the LLMs of today don’t pass that bar (and they don’t), then I expect that, sooner or later, either someone will figure out how to get LLMs to pass that bar, or else someone will invent a new non-LLM AI paradigm that passes that bar. Either way, imitative learning is out, consequentialism is in, and we get ruthless sociopath ASIs by default, in the absence of yet-to-be-invented theoretical advances in technical alignment. (And then everyone dies.)

Thanks Jeremy Gillen, Seth Herd, and Justis Mills for critical comments on earlier drafts.

Changelog: 2026-02-23: Added another reference to footnote 3.

^{^}
We should definitely also be thinking about these other potential problems, don’t get me wrong!
^{^}
Related: the so-called “Follow-the-Improbability Game”.
^{^}
Details: “imitative learning” describes LLM pretraining, but not posttraining; my claim is that LLM capabilities come almost entirely from the former, not the latter. That’s not obvious, but I argue for it in “Foom & Doom” §2.3.3, and see also a couple papers downplaying the role of RLVR (Karan & Du 2025, Venhoff et al. 2025, Yue et al. 2025), along with “Most Algorithmic Progress is Data Progress” by Beren Millidge.
^{^}
E.g. if my brain is predicting what someone else will say, that’s related to auditory inputs, and if my brain is speaking, that involves motor-control commands going to my larynx etc. There is no straightforward mechanical translation from one to the other, analogous to the straightforward mechanical translation from “predict the next token” to “output the next token” in LLM pretraining. More in “Foom & Doom” §2.3.2.
^{^}
See GPTs are Predictors, not Imitators for an even-more-pessimistic-than-me counterargument, and “Foom & Doom” §2.3.3 for why I don’t buy that counterargument.

AI RiskConsequentialismAI

Frontpage

154 Ω 51

New Comment

49 comments, sorted by

top scoring

Click to highlight new comments since: Today at 7:53 PM

[-]Victor Levoso6d*Ω52112

LLM in practice these days do include increasingly bigger % of RL wich seems like it should at least make you less certain about capabilities mostly coming from pretraining and papers from before that continuing to be relevant for very long and you do mention it on the other post but still wrote that capabilities come mostly from pretraining on the footnote?.

I expect an optimist or someone from the comparatively-less-pessimistic group would argue that LLM or LLM +RL might lead to consequentialists that have human-like goals due to being built from a base of human imitation even as they move towards ASI.

And an important disagrement with those people there is that you don't in fact expect LLM +RL to work. You also don't think LLM+ RL would be safe if they did work but still feels relevant because It comes from pretty diferent models of how LLM will improve.

Relatedly a question you are not answering here is why you think imitation learning won't lead to discovering things that are far from the distribution of humans since its unclear what the limits are out of distribution and this also seems like you have a strong view on human imitation not generalicing much to the point It won't be even be able to do human level inventing of new things wich seems like would imply bad imitation of humans, as humans can do things like invent writting.

You seem to be describing a shallow kind of imitation wich seems like a nontrivial asumtion people in the LLM camp would likely disagree with?.

"rocket arbitrarily far past human training data" is mixing the question of "will LLM think in ways that are very diferent from humans" with something like " can LLM only shalowly reuse reasoning from their training data and not invent new things".

Perfect human imitators will just rocket arbitrarily far past human training data in the first sense and invent writting etc.

[-]Steven Byrnes5dΩ470

If you compare a human in 30000 BC to a human today, our brains are full of new information that wasn’t in the training data of 30000 BC. I want to talk about: what would it look to be in a world where you can put millions of LLMs in a sealed box containing a VR environment, for (the equivalent of) thousands of years, and then we open up the box and find that the LLMs have made an analogous kind of scientific and technological progress? (See §1 of “Sharp Left Turn” discourse: An opinionated review.)

Spoiler: I think this is fundamentally impossible with LLMs as we know them today. Anyway, let’s explore the options.

One option is: the LLMs have super-long context windows that store textbooks for all the new fields of science and technology that were invented after we closed the box. I don’t this would work because LLMs (at least as we know them today) struggle with large amounts of interrelated complexity in the context window far outside the distribution of anything in the weights. A more likely option is: there’s a mechanism to update the LLM’s weights. Anyway, there has to be some selection mechanism, that decides what new content is worth keeping or not. If you come across a proto-idea, do you update the weights with it or not? If you just update with “whatever seems right” or something, then I claim that errors will compound over time and the whole system goes off the rails.

Basically, my claim is that this selection mechanism (for new knowledge, plans, strategies, etc.), whatever it is, has to be grounded in consequentialism, to work perpetually inside this closed box.

And as this process proceeds, (subjective) century after (subjective) century, the influence of pretraining would get diluted away, until everything is ultimately coming from the consequentialism-grounded selection mechanism.

Anyway, we can argue about the details, but I don’t think LLM people are thinking about how to get to an end-state of this box, in which (again) you close the box, then open it much later, and find that huge amounts of open-ended intellectual progress has occurred while it was closed, analogous to what global human civilization has created over the centuries. I think that if people tried to work out what such a box might look like in detail, they would find that it either needs ruthless consequentialist agency going into it, or else creates ruthless consequentialist agency while it runs. Or perhaps they’d just agree with me that LLMs are not cut out for populating this box and never will be.

Sorry if I’m missing your point.

You seem to be describing a shallow kind of imitation

I don’t think so… I think I’m making a narrower claim: the manner in which humans (alone and collectively) do open-ended continual learning, especially over extended periods of time, does not have an analogue in LLMs. This is different from the question of whether LLMs are imitating humans “deeply” vs “shallowly” at inference time. I’m certainly not one of the people who call LLMs “stochastic parrots” etc. The thing they’re doing at inference time is IMO clearly capturing a deep (+ also wide) level of knowledge / understanding.

[-]Victor Levoso5dΩ350

Oh okay then I think some of my objections are wrong but then your post seems like It fails to explain the narrower claim well?. You are describing a failure of LLM to imitate humans as if It was a problem with imitation learning. If you put LLM in a box and you get a diferent results than if you put humans in a box you are describing LLM that are bad at human imitation. Namely they lack open-ended continual learning. As oposed to saying the problem is that you think cannot do continual learning on LLM without some form of consequentialism.

In the case of very long context LLM you are even claiming LLM couldn't be able to imitate human behaviour in their context.

I like your box example better(we could also call It a country of geniuses on a closed datacenter) I feel like theres a lot of interesting debate to be had about what kind of improvements on LLM get us to them making lots of inventions in the box.

And this seems important to me, because the obious to me question here is "can you imitation learn whatever process humans use to invent things without being ruthless consequentialists?"

Or in another words can your whole research program if how to imitate the things that make social insticnts in the brain be bitter lesson-ed via imitation learning on long horizon tasks/data?.

Or not even long horizon maybe It just generalizes from short horizons + external memory. Its unclear to me whether if you put smart and competent adult humans without the capability to remember more than 1h on a box + they already know how to write they wouldn't be able to manage to invent arbitrary things with a lot of extra effort obsesively note taking and inventing better ways of using notes.

Humans doing this if It works would works because It IS grounded in the consequentialist behaviour of humans . But It woudln't be ruthless consequentialism becuse humans have social insticts.

It seems like you are impliying LLM have something like the human social insticts via imitation already at inference time but you can't use them in any way to boostrap to some continuous learning thing thats grounded in human-like consequentialism and that seems like thats the direction were interesting discusion lies ?.

Also to be clear my own position is more on the side of thinking you can probably get something that could populate the box from LLM+RL+maybe some memory related change but in practice you likely do It by acidentaly making them ruthless consequentialists unless you really knew what you were doing or get extremely lucky.

But I want to take the side of the AI optimists here because I feel like you haven't adressed smarter versions of their position very well?.

Even if the typical AI optimist hasn't though that far. Though duno I don't know what Antropic's comparatively less pesimistic people think(and I expect there's actually a wide range of views in there) but they have to be thinking about continual learning or how LLM will do long horizon tasks, and if still skeptical of ruthless consequentialists being a thing they'll have some reason why they expect whatever solution to not lead to that.

[-]Steven Byrnes3dΩ220

Right, I think LLMs are good at imitating “Joe today”, and good at imitating “Joe + 1 month of learning Introductory Category Theory”, but not good at imitating the delta between those, i.e. the way that Joe grows and changes over that 1 month of learning—or at least, not “good at imitating” in a way that would generalize to imitating a person learning a completely different topic that’s not in the training data.

In other words, after watching 1000 people try to learn category theory over the course of a month (while keeping diaries), I claim that an LLM would learn category theory itself, and it would learn all the common misconceptions about category theory that people make as they start learning, but it wouldn’t learn “the general process of learning and sense-making itself” in a way that allows it to then autonomously invent some field that has not been invented yet.

I had a long comment-thread argument with Cole Wyeth on this general topic last year: link. We didn’t resolve our disagreement and I eventually bowed out of the conversation, but you might find it helpful anyway. See especially my analogy to trying to imitation-learn AlphaZero improving itself through self-play.

Or not even long horizon maybe It just generalizes from short horizons + external memory. Its unclear to me whether if you put smart and competent adult humans without the capability to remember more than 1h on a box + they already know how to write they wouldn't be able to manage to invent arbitrary things with a lot of extra effort [obsessively] note taking and inventing better ways of using notes.

My answer is “obviously not”. Here’s an example:

Imagine that the “competent adult humans” were all from 100 years before linear algebra existed, and we are hoping that they will invent linear algebra. Now, linear algebra involves a giant pile of interlinked concepts: matrices, bases, rank, nullity, spans, determinants, trace, eigenvectors, dual space, unitarity, and on and on. Now take a parade of these “competent human adults” with no prior exposure to any of this, and give them an hour each before they get fired, but they are allowed to send notes to each other. The goal is for them to collectively invent the entire edifice of linear algebra from scratch. I think it’s doomed. If you take a person who has never seen linear algebra before, then it will just take them a lot of time (much more than an hour) to internalize all these concepts and get sufficiently familiar with them to start building on them. It doesn’t matter how good the notes are, it just takes time to develop strong and deep intuitions about a new concept. It doesn’t matter how many people there are, because zero of those people will be able to push forward the frontier in the one hour before they get fired, because it takes longer than that to internalize a new conceptual space.

(I don’t think it’s too relevant for this thread, but fun fact: there were some experiments by Ought in like 2018-2020 vaguely related to this, see e.g. this post on “relay programming”.)

[-]RohanS2dΩ340

I'd be curious for you to say a bit more in response to this point from above:

It seems like you are impliying LLM have something like the human social insticts via imitation already at inference time but you can't use them in any way to boostrap to some continuous learning thing thats grounded in human-like consequentialism and that seems like thats the direction were interesting discusion lies ?.

I'm moderately optimistic about our ability to get roughly human-like consequentialism from LLM-based AGI, with character training instilling a non-ruthless non-sociopath character that is still compatible with lots of consequentialist agency, like very good scientists/inventors/entrepreneurs/etc. who never do anything that could be very dangerous (because a bunch of useful things aren't that risky, and because they have good enough moral principles to deliberately avoid or be careful around things that are risky).

I think long-horizon RL or reflection or other components of the continual learning process could break the instilled character, but it seems >50% likely to me that between the preliminary character training and ongoing training and prompting to maintain good character, those things won't dominate, and we'll just have nice AIs. (I get a bit more nervous about this argument for ASI, but I think it may well hold up even there.)

[-]Steven Byrnes2dΩ462

The thing I’m skeptical of is maintaining non-ruthless behavior in the presence of arbitrary amounts of open-ended continual learning. By “open-ended continual learning”, I mean something analogous to what humans did between 30000 BC and today, e.g. inventing new fields, and then still more new fields that build on those new fields, etc. And the AI has to do that without any human input, given enough time.

My actual belief is that this kind of open-ended continual learning is simply impossible in LLMs. If I’m wrong about that, then I would next claim that it requires continually updating the LLM weights (not just context window). I think it’s well known that LLMs struggle with large amounts of interrelated complexity in the training data, when none of that complexity, or anything remotely related to it, was in the pretraining data. “Open-ended continual learning purely through the context window” would be a very extreme version of that—centuries of knowledge, entire new fields and ways of thinking, entirely absent from pretraining and exclusively in the context window. No way is that going to work.

OK, so far I’ve argued that if this kind of continual learning is possible at all, it would require continual weight updates to lock in the new knowledge and ideas that the LLM generates—and not just one-time small updates, but more and more updates as the process continues, asymptoting to 100% of the training data.

If you buy all that, how do you think these weight updates will work? Where do you think the “training data” for those updates will come from?

Or if you don’t buy that, how do you think the continual learning will work?

My experience is that lots of LLM-focused people say “open-ended continual learning will be solved somehow, I guess”, and not think too hard about exactly how it gets solved. And then that’s how the pea gets hidden under the thimble. Because actually, I claim, continual learning needs some kind of ground truth or else it will go off the rails, and that ground truth basically amounts to an objective function, and when the LLM continual-learns enough from that ground truth, all the niceness of pretraining gets diluted away in favor of the ruthless maximization of that objective function.

Again, maybe you have some specific idea about how LLM open-ended continual learning would work that you think won’t have this problem? If so, what is it?

[-]RohanS13h10

(Slightly rambly comment, sorry)

I agree open-ended continual learning (CL) is probably big, I have been thinking and writing about CL a bunch recently but tbh I don’t think I’m near the end of clarifying all my thoughts on it. (Still hope to publish a sequence on it with some collaborators soon though.)

I think it’s well known that LLMs struggle with large amounts of interrelated complexity in the training data, when none of that complexity, or anything remotely related to it, was in the pretraining data. “Open-ended continual learning purely through the context window” would be a very extreme version of that—centuries of knowledge, entire new fields and ways of thinking, entirely absent from pretraining and exclusively in the context window. No way is that going to work.

I agree weight updates are probably needed, I like the way you phrased the limitation, it matches some thoughts I’ve had but never as precisely.
I expect you understand continual learning and especially the brain better than I do, but it seems plausible to me that your interpretation of the alignment implications on top of that understanding is flawed.
I think “asymptotically 100% consequentialist” is quite possibly wrong about the objectives used for open-ended CL training.
- We can incorporate ongoing character training to ensure it has non-negligible asymptotic representation.
  - I think humans interpret a lot of experiences we have in the context of our existing values, and this informs how we update. This can frequently reinforce our values.
  - Self-verification seems like it may be an important part of the CL objectives, and this can include self-verification of alignment with existing character (which starts close to Claude’s current nice character and hopefully stays close with some desirable ironing-out).
  - Maybe this “doing continual learning informed by existing values” is kind of similar to humans doing continual learning informed by human social instincts? I also think this is related to the confusion that I and some other commenters have about why imitation and consequentialism are the only options. I have a much messier list of possible update mechanisms that doesn’t seem like it fits cleanly into those two as broad categories. Maybe a good example is that humans update on a ton of random observations we’re surprised by. This doesn’t seem like imitation, nor does it seem consequentialist enough to be very risky? (Maybe there’s an active inference-related case to be made for consequentialism here, but I haven’t looked into that much, I’d be curious for someone to make that argument if so.)
  - “Increasing quantity and quality of character training throughout continual learning” seems like a potentially promising avenue for interventions, do you agree?

[-]RohanS2h10

[Edit: I don't think this is saying anything that different than my comment above, but it is a slightly different framing.]

Another point that I think might be quite important: we often set ourselves complex subgoals in line with our existing values, and then we try hard to achieve those goals, and we learn how to be more effective consequentialist agents at achieving that type of subgoal. There may be clearer feedback on how well we did at the subgoal than how well we achieved our existing values, but in lots of cases we notice if there's a significant divergence between what we achieved and our underlying values, which moderates the consequentialist learning and is a pressure towards maintaining alignment.

[-]Seth Herd6d201

I find it a bit weird that this argument needs to be made at all. But it does, in current company at least.

One argument for current company is: have you actually met people? Or just your EA friends, some of the sweetest (and noncoincidentally most privileged) people to ever exist?

Outside of the EA sphere, I doubt people would be as drawn to the idea that maybe minds are safe and nice by default. There's only a vague and arguable tendency for smarter people to be nicer. And even the nice smart people aren't that nice. And most of history looks suffused with ruthless sociopathy to my eye.

I suspect most humans would be more likely to agree with "smarter humans are more ruthless". And they'd be very suspicious of the argument "maybe intelligence produces niceness! So let's make our alien LLMs way smarter and see!".

I'd sort of like to give humanity as a whole more of a vote on whether we develop AGI as fast as humanly possible, because I think their intuitions would trend in the right directions. It takes a lot of privilege to think that minds are nice by default, according to my understanding of history and the current state of the world.

The other usual argument is "well LLMs don't really have goals so that should be fine". And maybe it would be, if developers weren't busily making them more goal-directed! Assuming they'll only work on the goals that authorized, wise individuals give them seems pretty optimistic. So does assuming that their current prosocial tendencies keep on counterbalancing their goal-directedness in a way you'll like. This seems like an unlikely default If they ever do self-directed learning and so change outside of human control (another thing developers are excited to work on!).

This is essentially agreeing with you that LLMs won't reach superintelligence in their current adorably goalless but also incompetent state.

I also composed a whole argument along the lines of minds in general are ruthless and sociopathic because caring about others and not getting stuff done even when you're competent and even vaguely goal-directed are special properties that must be carefully designed. But I'll develop that and present it elsewhere, because it's similar in form and less well phrased than your argument.

[-]Random Developer5d911

And most of history looks suffused with ruthless sociopathy to my eye.

This is the part that always confuses me about "alignment", and it boils down to "aligned to who?"

The AI lab CEOs? I wouldn't trust most of them with control of a superhuman intelligence. On his best day, Dario Amodei looks like the protagonist of a Greek tragedy about to be destroyed by the gods. And I wouldn't trust Sam Altman with my lunch money.
The government? I'm sure the government can be trusted with control of a superhuman intelligence. /s
The voting public? I'm sure this is really fun, unless you're trans, or an immigrant, or belong the Out Group. In which case, an AI aligned to the popular vote means you're going to get "cured" of whatever society doesn't like this year.

Now, I happen to personally believe that alignement of superintelligent, learning, goal-seeking entities is impossible. Not "difficult" or "it might take decades", but flat out impossible. An AI might like humans enough to keep us as pets, but that will be the AI's decision, not ours. Dogs have approximately no control over their relationship with humans, and I figure that "humans as house pets" is the absolute best possible result of building superintelligent AI. My P(not doom|someone builds superintelligence) is about 1/6, and nearly all of that 1/6 is placed on "humans as house pets."

But if we could control AIs? Those AIs would be controlled by powerful humans, the same sorts of people who had warm personal relationships with Epstein, and who had zero problems with Epstein trafficking and raping children. Given a choice between "superhuman AIs aligned to the Epstein class" and getting paperclipped, I'd go with the paperclips.

The only winning move is not to build superintelligence.

[-]Seth Herd5d40

These are serious questions.

In short, I have a view of human nature that's somewhat more optimistic than yours.

I don't think leaving humans in charge of the world is obviously a win either. It does look to me like the arc of history is bending toward justive, but it's happening slowly and in fits and starts. And we could all be dead before we get to a stable just society. This isn't really an argument for building ASI; I think we probably shouldn't, or at least not this fast.

But it looks like we're going to.

The big advantage of building intent-aligned AGI (if we can avoid do that instead of build a misaligned ASI that kills us all) is that it makes being good to people vastly easier, essentially completely free. You just tell your ASI "okay fine, go make the world better for people. Tell me how you'd do it and I'll choose some options".

This lowers the bar for how good someone has to be to benefit the humanity to just above zero. If they have more inclination to be helpful than harmful, that's all it takes.

No human who's ever lived has been in that position. Even the most powerful have had to worry about losing their power, and about themselves and their loved ones dying painfully and fairly soon.

So strangely, I wouldn't trust Sam Altman with my lunch money, but I would guess he'd probably produce a very good future if he were to wind up god-emperor for eternity. The exposes I've seen don't claim he's a particularly vengeful person. We'll just have to celebrate Samday every week :)

There are individuals with what I think of as a negative empathy-sadism balance, but they're pretty rare. Sociopathic individuals do seem to be overrepresented in the halls of power, but even there I think we've got pretty good odds of minimally good people winding up in charge of ASI.

This is not a scenario I'm comfortable with If a sadistic individual gets control of the future, it could be worse than death, a permanent state of suffering. But it would take both a very selfless and competent person to launch such a thing successfully. . I'd almost rather see an attempt at value-aligned AGI.

I'm not sure how to up our odds; getting good people into power is an old challenge and I don't know of new methods to improve our odds.

[-]Mo Putera2d20

and it boils down to "aligned to who?"

What do you think of the Meaning Alignment Institute's (MAI) "democratic fine-tuning (DFT)" work on eliciting moral graphs from populations? e.g. this post from Oct '23 (primer here):

We report on the first run of “Democratic Fine-Tuning” (DFT), funded by OpenAI. DFT is a democratic process that surfaces the “wisest” moral intuitions of a large population, compiled into a structure we call the “moral graph”, which can be used for LLM alignment.
We show bridging effects of our new democratic process. 500 participants were sampled to represent the US population. We focused on divisive topics, like how and if an LLM chatbot should respond in situations like when a user requests abortion advice. We found that Republicans and Democrats come to agreement on values it should use to respond, despite having different views about abortion itself.
We present the first moral graph, generated by this sample of Americans, capturing agreement on LLM values despite diverse backgrounds.
We present good news about their experience: 71% of participants said the process clarified their thinking, and 75% gained substantial respect for those across the political divide.
Finally, we’ll say why moral graphs are better targets for alignment than constitutions or simple rules like HHH. We’ll suggest advantages of moral graphs in safety, scalability, oversight, interpretability, moral depth, and robustness to conflict and manipulation.
In addition to this report, we're releasing a visual explorer for the moral graph, and open data about our participants, their experience, and their contributions.

and their more recent full-stack alignment vision? I ask because I've asked myself the same exact question, and MAI's actual DFT above getting Reps and Dems to agree on hot-button questions seemed like the only line of work getting concrete results.

That said, I do lean towards your "the only winning move is not to build superintelligence" take, I suspect because I was born and raised in a country that until a few decades ago was a British colony, so I am biased to view your threat model description as obviously correct. So I'm guessing your answer to my question above is "who cares what MAI is working on, aligning ASI is impossible"?

[-]Random Developer2d*31

What do you think of the Meaning Alignment Institute's (MAI) "democratic fine-tuning (DFT)" work on eliciting moral graphs from populations?

Interesting! I will need to read through this in more detail, to get an idea of their approach. I'm glad someone is trying to do something in this space.

My objection to other approaches of democratic governance tend to break down roughly as follows:

I fear that democratic governance of superintelligence about as likely to succeed as chimpanzees coming up with elaborate schemes to democratically manage Homo sapiens for the benefit of chimps. No matter how careful and clever the chimps are, they're going to fail. They don't even understand 99% of what's going on, so how could they hope to manage it?
We will not, in practice, actually attempt any such governance scheme. The Chinese labs won't, because China doesn't even believe in Western notions of democracy and human rights. OpenAI has recently gutted its existing non-profit governance structure in order the reduce the risk of anyone attempting to govern it. Anthropic, out of all the labs, just might try. But the US government is currently trying to break Anthropic and bring them to heel by threatening to designate them as a supply chain risk (like Huawei) unless they agree to support "all legal uses," potentially including things like fully autonomous killbots and domestic surveillance. The "supply chain risk" designation, as I understand it, would mean that no Anthropic customer would be allowed to do business with the US government. Perhaps I've misunderstood this specific situation, but in the end, Anthropic is subject to the people with the guns. And the people with the guns do not necessarily want democratic oversight. So in practice, no, the billionaires and politicians will almost certainly not agree to some clever democratic governance system.
Even if we could somehow control superintelligence and if we could somehow place it under democratic control, I don't especially trust democratic control. Why? Well, I'm bi, my friends are trans, and I'm old enough to remember the 1980s. Had someone proposed a plan like, "LGBT+ people are mentally ill, and we can cure them by nonconsensually rewriting their minds," it's entirely possible that the public might have voted for that.
Finally, democracy is inherently unstable. About 20-25% of people appear to be "authoritarian followers", which means they're pretty happy to vote for a strongman. This number increases in times of fear and crisis. (It went up after 9/11, for example.) And another big chunk of the population can be moved by propaganda, or barely understand anything at all about politics. So historically, a number of 20th century democratic nations voted in the leaders who destroyed their democracy. This can be fixed; Germany is a democracy again today. But I expect democratic governance of superintelligence would be subject to similar risks, and in the case of superintelligence, you may not be able to fix your mistakes.

So a plan like MAI's is crtically dependant on a number of assumptions:

We can control superintelligence.
We have sufficiently good democratic control over the rich and the powerful to make sure they don't wind up controlling superintelligence.
If the people do succeed in getting democratic control over superintelligence, they won't vote it away, and they won't democratically decide to horrible things to unpopular minorities.

So from my perspective, MAI's plan is a "hail Mary" plan. But we're pretty deep in "hail Mary" territory, so I'm not opposed to placing bets on what look like unlikely outcomes.

Similarly, as far as I can tell, Dario Amodei's current plan for Anthropic is "build superintelligence as fast as we can, do our very best to make it like humans, and expect to totally lose all human control within 5-20 years." Personally, I feel like this is the least horrible version of the worst idea in human history. Like, obviously, no, we should not do this. But if we're going to do this, Anthropic is at least thinking about the real issues. They know that humans are likely to lose control, but they're basically hoping we can wind up as beloved house pets.

I still think the best plan is "just don't build something vastly smarter than us with the ability to learn, ^[1] pursue goals and replicate." One obvious objection to my plan is that we're probably going to go right ahead and build superintelligence anyway. Which is why I am sympathetic to long-shot plans that might have an outside chance of working.

But I still prefer "just don't build superintelligence." Or, failing that, delay it. Emotionally, I'm treating it sort of like a diagnosis of terminal cancer for me and everyone I love. Even a remission of several years would be of immense value. And delay also gives some of the hail Mary plans a slightly better chance of working, or of the public realizing that maybe they don't want to be "beloved house pets" of minds no human can possibly understand.

Learning is essentially a form of self-modification. Combined with differential replication of more successful entities, this gives you natural selection. ↩︎

[-]andrew sauer5d20

Yeah. The true nature of power is shown by more horrors of history than can be counted, but Epstein and factory farming are especially illustrative to me.

[-]Canaletto5d40

I’d sort of like to give humanity as a whole more of a vote on whether we develop AGI as fast as humanly possible, because I think their intuitions would trend in the right directions.

Well, would you say this if their intuitions were tending in the wrong directions?

[-]Benjy_Forstadt6d40

The claim isn’t that minds are safe and nice by default. It’s that they’re not sociopaths.

If in your view, most humans are basically ruthless sociopaths, then that’s good news, isn’t it? Sociopathic AIs would fit in well to our culture. It would mean our laws and norms do a remarkably good job of restraining us, so there’d be hope they’d do the same for future AIs.

[-]Seth Herd6d72

What I mean by "nice" is roughly the opposite of being a ruthless sociopath. It means treating other sentient beings well for its own sake.

Most humans are definitely not ruthless sociopaths. Sociopaths are estimated at about 10% of the population. And most of those aren't even that ruthless; I think it's a spectrum, like all biological mental differences. This leaves the conclusion that even NON-sociopathic humans are often pretty ruthless when they can get away with it, like when they hold a lot of power. But that's pretty much beside the main point here, which is that we shouldn't expect nice/non-ruthless behavior by default.

Laws and norms are not going to restrain an AGI that can route around them easily when it's smart enough. They barely restrain human sociopaths who are bad at routing around them.

You might envision AGIs enforcing laws that include human well-being, and creating a nice just society roughly for the same reasons we have now. But if those AGIs don't genuinely care about humans, I'd strongly expect humans to soon occupy the legal positions of farm animals - or perhaps worse, the many species we've extincted even though we like them, because they don't provide our society any real benefit, and we wanted to do other stuff with their habitats.

That's why we're kind of obsessed with aligning smarter than human AI so it genuinely and intrinsically cares about us, or at least reliably takes our orders as intended.

[-]Dagon6d30

The claim isn’t that minds are safe and nice by default. It’s that they’re not sociopaths.

I thought one of the tenets of this debate is that there's no in-between. Either safe and nice (aligned) or everybody dies (not aligned). Humans are a good example - most are not pure psychopaths, and yet they do a ton of harm to each other all the time, and have threatened to destroy the species for decades. A set of much more powerful minds with even that level of misalignment would be disaster, and if they're slightly worse than humans, so much the worse.

[-]1428576dΩ4132

So an explanation has to exist! What is it? I claim there are really only two answers [consequentialism and imitative learning] that work in practice.

How do we know that these are the only two answers?

[-]Carl Feynman4dΩ4100

Well, they're the only two methods we've got that work, and people have been thinking about this for decades. So if there's a third method, it's beyond current understanding.

Notice that evolution by natural selection is consequentialist learning on a terribly slow timescale. The consequence is successful reproduction vs death, and you learn at most one bit per lifetime.

[-]sanxiyn4d20

Evolution also distinguishes between one and two progeny so it is not binary, but yeah, just a few bits per lifetime.

[-]tailcalled6d134

I do wonder if there's a difference between consequentialism as in expected utility maximization versus consequentialism as in Nash equillibrium optimization. As in, when the AI is learning to model the world, it might model humans using some empirically derived probability distribution which doesn't handle OOD shifts well, or it might model humans by using its own full agency to ask what the most effective human action would be in a given scenario. The latter would be scarier because the AI would be more proactive in sabotaging human resistance, whereas in the former case, the independence assumptions built into the probability distribution might be such that powerful human resistance is assumed impossible, and therefore the AI would immediately fold when resisted.

As a corrolary, I'm much more worried about AI applied to adversarial domains like policing or war, where it can get forced into Nash equillibrium optimization, than when AI is applied to non-adversarial domains like programming where it can plausibly achieve ~optimal results without resistance.

[-]faul_sname6dΩ290

It seems like one implicit assumption is something like

In environments with lots of agents doing things, the most ruthless consequentialist agents will outperform the more prosocial and cooperative ones.

Without that assumption, we could end up in a situation where there are ruthless consequentialist AIs, most agents (both human and AI) recognize them as such and recognizes that interacting with them is a bad idea, and so these ruthless consequentialist AIs backstab and lie and cheat and do a bunch of damage but never actually accumulate enough power to seize control of the light cone from coalitions of agents that are capable of cooperating ^[1] .

I know you've written a lot of stuff - I don't recall seeing anything about why that assumption in particular, but maybe you've already written on the topic?

In the long term this probably still looks bad for humans, because worlds with many AI agents probably look like a second cambrian explosion and the cambrian explosion was not good for those who came before ↩︎

[-]Steven Byrnes6dΩ360

This is an interesting topic, but no, my central expectation (and what I’m arguing for here) is that 100% of the ASIs will be ruthless consequentialists.

Couple little points on that side-track though: (1) Ruthless consequentialist AIs can still copy themselves, and cooperate with their copies, if their goals are non-indexical (which they might or might not be, no opinion off the top of my head), (2) Your comment seems to assume that AIs can read each other’s minds? If they can’t, a smart ruthless consequentialist AI would act in a cooperative and prosocial way in an environment where doing so was to its advantage. I agree that mind-reading is an important dynamic that might change the equilibrium in a multipolar AI world.

[-]faul_sname6dΩ120

Thanks.

"If their goals are non-indexical" seems like quite a big "if".
Yeah, my modal assumption is that AIs will be able to make fairly strong inferences about the mechanics of the decision processes of other AIs by making observations about their behavior (including of side channels). "Mind reading" might be a slightly strong term for this, but, it's not very far off.

Likely out of scope for this comment section though. I should, at some point, probably write my modal expectation of what the next couple decades look like in more detail.

[-]Loki zen5d70

"Me: As it happens, the threat model I’m working on is not LLMs, but rather “brain-like” Artificial General Intelligence (AGI), which (from a safety perspective) is more-or-less a type of actor-critic model-based reinforcement learning (RL) agent. LLMs are profoundly different from what I’m working on. Saying that LLMs will be similar to RL-agent AGI because “both are AI” is like saying that LLMs will be similar to the A* search algorithm because “both are AI”, or that a frogfish will be similar to a human because “both are animals”. They can still be wildly different in every way that matters."

How would you respond to the critique that this basically amounts to saying "I'm not interested in or saying anything about things that exist, but this thing that AI X-Risk types made up because it's the most worrying hypothetical to talk about sure is a worrying hypothetical."

[glibly phrased but meant in a spirit of genuine curiousity because I still don't really understand why people care about AI risk but lack any interest in stuff that actually exists right now]

[-]Steven Byrnes5d60

Sure—check out Intro Series §3.9 for my response. Does that help?

[-]Jsevillamol4d*Ω26-2

Some unspelled implications of this post, taking it as true for a moment:

Since humans are consequentialist-type intelligences, we should expect them to be ruthless, and we should prevent them from gaining too much power, lest they destroy everything we hold dear. (one may retort most humans share values with us, but since value formation is so fragile, they likely have values incompatible with us once they start optimizing for them for real).
Developing compute-intensive, imitation-learning-based AI should be considered closer to human-brain augmentation than ASI capability development, since it will be all "pointless" until people figure out how to develop consequentialist thinking. (one may retort that imitation-learning-based AI might make it easier to develop the consequentialist-minded AI, as a base for a consequentialist AI. But to the extent that consequentialist-based thinking is so much more powerful than imitation-based learning and likely to developed through an entirely different path than LLMs, the first factor should mostly be a rounding error. It might still speed up scientific discovery more broadly, bringing forward the date when consequentialist AI is invented, but this is not differentially speeding up that particular technology, except maybe insofar the users of such AI might be predisposed to use future LLM differentially for this purpose)

[-]Steven Byrnes4dΩ220

Since humans are consequentialist-type intelligences, we should expect them to be ruthless, and we should prevent them from gaining too much power, lest they destroy everything we hold dear.

This is a very weird sentence to me. If we want to know about human behavior, we can just observe past and present humans. We shouldn’t take one fact about human brain architecture in isolation, and ignore everything else we know about human brains and human psychology and human history.

In particular, if we want to know whether “absolute power corrupts absolutely”, we should obviously start by looking at the historical record of humans with absolute power. (No opinion.)

Developing compute-intensive, imitation-learning-based AI should be considered closer to human-brain augmentation than ASI capability development

I’m not sure what this paragraph is getting at. My best guess is that you’re interested in the AI pause / stop vs AI acceleration debate, and suggesting that if LLMs are not a path to “ASI”, then that’s a reason not to pause LLM progress?

If so: (1) I generally stay out of that debate because I don’t expect it to make much difference regardless, (2) I don’t like taking sides in a generic way rather than talking about specific proposals with their own particular suites of intended and unintended consequences, (3) …but if I had to pick a side, it would be the “pause” side, because, while my opinion is in fact that LLMs are not a path to “ASI” (as I define it), OTOH (A) I don’t hold that opinion with 100% confidence, and (B) there are legitimate LLM x-risk worries even without “ASI”, and (C) there are legitimate LLM worries short of x-risk, and (D) like you said, there are various indirect ways that I’d expect the (small, indirect, marginal) effect of LLM-centric “pause” efforts to push ASI later rather than earlier, including via LLM-assisted coding & research, the relentless ramp-up of global compute, etc.

[-]Tom Davidson4dΩ350

Where was the argument for consequentialism (including intense optimisation) and imitation being the only two ways to do impressive feats?

Also, will you change your mind if the current paradigm is still non-psychopathic even when RL training dominates?

(Big picture, I think the main place I might get off the train is in expecting future AI development to use a mix of rewards, including some from other equally capable AIs judging that behaviors aren't deceptive/unintended/misaligned. And this mirroring the role that "other humans think we're nice" played in evolution)

[-]Steven Byrnes4d*Ω562

Where was the argument for consequentialism (including intense optimisation) and imitation being the only two ways to do impressive feats?

Well, you can also do impressive feats via conventional programming / GOFAI too, but I don’t think you get ASI that way. What else? I dunno, but I think if there was another big-picture approach that plausibly gets to ASI, lots of people would be working on it, and I would have heard of it. Lmk if you think I’m forgetting something.

Also, will you change your mind if the current paradigm is still non-psychopathic even when RL training dominates?

Normally if someone says “RL training dominates”, they mean “the amount of compute applied to RLVR is much greater than the amount of compute applied to pretraining”. That’s very different from “RLVR is so important that the impacts of pretraining are diluted away to irrelevancy”. (E.g. discussions of information efficiency by Toby Ord and by Dwarkesh.) But the latter is what would be relevant here.

Hmm, here are three example scenarios:

If a company did away with pretraining entirely, i.e. just used RLVR from random initialization, and got a non-psychopathic result -- I would DEFINITELY feel confused.
If a company did enough RLVR that the chains of thought became completely unrecognizable in any language (e.g. a chain of thought like “…5BnSjYEkIokhPiTePWBlIO1FIwQUOg7PvJ…”) (per the Karpathy quote: “You know you did RL right when the models stop thinking in English”), and got a non-psychopathic result -- I would PROBABLY feel confused.
If we stopped seeing papers like Karan & Du 2025, or Venhoff et al. 2025, or Yue et al. 2025, and instead saw results that were astronomically improbable (e.g. 10^-100) with the base model, even with arbitrarily fancy sampling techniques -- I would MAYBE feel confused.

“Feeling confused” is weaker than “changing my mind”, because maybe I would puzzle over it and find some way to make sense of it. But also maybe not. Probably I could make a stronger statement / prediction if I spent a bunch of time thinking about it, but hopefully this gives some sense of what I have in mind.

other equally capable AIs judging that behaviors aren't deceptive/unintended/misaligned

I’m pessimistic mainly for reasons discussed in “‘Behaviorist’ RL reward functions lead to scheming” §3.1.

And this mirroring the role that "other humans think we're nice" played in evolution

You’re kinda pointing to a challenge to my view. My view is: a hypothetical smart consequentialist human with a ruthless drive to have lots of grandkids will have more grandkids than a human with the normal suite of innate drives, like falling in love and so on. Proof: strategy-stealing. Whatever the latter human does, if it’s an objectively good way to have lots of grandkids, then the former human can notice that it’s a good strategy and do the same thing.

And then the challenge to that view is: …But we did actually evolve all these innate drives that make us intrinsically desire love and curiosity etc. Doesn’t that prove my strategy-stealing argument wrong?

I think there’s a good answer to that challenge, and it’s some combination of: (1) evolution has no way to build “a ruthless drive to have lots of grandkids” into our brains (details), and (2) even if it did, humans are not sufficiently smart and strategic in regards to long-term planning to be very effective ruthless consequentialists. (“We are the least intelligent species capable of building an industrial civilization.”)

Anyway, I claim that neither (1) nor (2) would be applicable in your AI training scenario (as I understand it), so the strategy-stealing argument would go through, and thus the RL selection pressure should at best be neutral between ruthless scheming strategies vs intrinsically honest ones, and much more likely favor the former.

(I am however assuming no interpretability / mind-reading.)

[-]Tom Davidson41mΩ120

Thanks!

Re other ways to do the feat:

Bootstrapping from very smart human intelligence by doing loads of serial thinking, writing great non-sociopathic outputs, and find sft on them
Sophisticated scaffolds - basically gofai but you can insert LLMs in all over the place
Searching over different the above and RL and imitation to things for methods that don't produce sociopaths... Evolution and within lifetime learning don't produce sociopaths despite this not happening!!

Re RL: thanks for those examples. I'd have defined it via smg like# data points or amount of training signal -agree compute is the wrong measure.

> other equally capable AIs judging that behaviors aren't deceptive/unintended/misaligned
I’m pessimistic mainly for reasons discussed in “‘Behaviorist’ RL reward functions lead to scheming” §3

Hmm but the cognition "try to break the rules without getting caught" might in practice lead to worse reward than just trying to follow the rules, if the chances to secretly hack are sufficiently rare and the punishment for getting caught is high enough.

Eg I think that if I tried to "pretend I cared about my friends/partner but exploit them when I can get away with it" would MASSIVELY fail me in the long run. Ppl are good at reading ppl. We're transparent to each other. There is big cognitive overhead to maintaining two narratives. I'd end up with fewer committed/deep long-term friendships.

Relatedly, if you're in a cult, the best way to signal deep commitment is to become a true believer.

And it will, by comparison, be way easier for the overseers in the case of AI. Interp. Seeing all behaviour. Running counterfactual experiments.

think there’s a good answer to that challenge, and it’s some combination of: (1) evolution has no way to build “a ruthless drive to have lots of grandkids” into our brains (details), and (2) even if it did, humans are not sufficiently smart and strategic in regards to long-term planning to be very effective ruthless consequentialists. (“We are the least intelligent species capable of building an industrial civilization.”)

Ok interesting, we have pretty different intuitions here!

On 1, the equ that evolution chose, is way less sociopathic than it could have chosen. Some ppl are sociopaths! So it can be done. (Maybe there's a lot of work done by the "aim for grandkids" part, sorry I didn't read the link. But ppl do have some desires for healthy kids and grandkids)
On 2, I agree that if one agent's cognition rises while the "are you a sociopath" checks stay constant, we're in trouble. But we should imagine both rising with AI capabilities. (Note that humans evolved in a similar situation with both sides rising and the equ was non sociopathy.) Also, the world itself becomes more complex, making scheming plans harder to pull off.
I'm flat out sceptical the sociopath strategy would dominate in equilibrium. Eg: Men sociopaths would trick many women into partnering with them, leaving each. Women evolve to check hard for true commitment. Men evolve credible signals of non-sociopathy. I just don't believe such signals would be impossible to find.

Zooming out, I recall you thinking that humans aren't sociopaths bc they have some special specific reward thing that we can't replicate, related to wanting others to approve. Whereas I see no reason to think it's some specific thing. We just had some selection pressure to seem like good non-sociopathic allies. That selection pressure worked. If we apply similar selection to ai, it will probably also work - the implementation details won't need to match some specific human learning circuit

[-]No77e3dΩ12-1

Hmm but humans are not ruthless consequentialists, despite being consequentialist enough to be able to do all kinds of tasks and build civilization. So I don't see how the Optimist's argument is addressed.

[-]Steven Byrnes3dΩ240

That’s this part:

Of course, evolution did go out of its way to make humans non-ruthless, by endowing us with social instincts. Maybe future AI programmers will likewise go out of their way to make ASIs non-ruthless? I hope so—but we need to figure out how.

A workable solution (to building stable non-ruthlessness within a powerful consequentialist framework like RL + model-based planning) probably exists, and I’m obviously working on it myself, and I think I’m making gradual progress, but I think the appropriate overall attitude right now is pessimism and panic about where we’re at. See “oh man, are we dropping this ball” section here & the three-part disjunction here.

(Why only “probably exists”? Because the human example is highly suggestive but not an airtight proof. For example, for all I know right now, maybe making a nice human requires a “training environment” that entails growing up with a human body, in a human community, at human speed. Doing that with AI is not really feasible in practice, for many reasons. There are other things like that too. Presumably further research will eventually either find a plan for non-ruthlessness + powerful capabilities in ASI, or a good argument that no plan exists, and I don’t currently have a very strong opinion on which one it would be.)

[-]azsantosk6d21

Two things that make me a bit less convinced that “consequentialism ⇒ ruthless sociopath by default” holds for the whole class:

1. Consequentialism is over internal model states, not world-states.
A planner/RL agent is maximizing an internal evaluation V(z) over learned latents z, not “the world” directly. So the key question isn’t “is it consequentialist?”, but “what internal concepts does V actually latch onto?” In principle that evaluation can be keyed to learned concepts like obedience, loyalty, prudence, promise-keeping, timidity, etc. (which is why your “social instincts / pointer-binding into learned reps” agenda seems so central to me).

2. Model-based planning under uncertainty is adversarial to the world-model.
Even if V intends to score “obedience/prudence”, long-horizon search can find policies that drive very high internal obedience/prudence-evaluations without the agent actually being obedient/prudent—basically “plan exploits weaknesses in (a) the world-model and (b) the learned value readout / concept grounding”. This is tightly analogous to optimizer’s-curse / overfitting-to-the-model, and it’s a capabilities problem too: agents that work in reality need robust planning / uncertainty-awareness / “hard vs soft model” reasoning that avoids exploiting soft artifacts. If those techniques are capability prerequisites, they’re plausibly also the place where “robust social goals” become feasible—and where simple resource/power grabs stop being obviously robust ways to maximize the internal eval.

[-]Joachim Bartosik1d10

And hey, what a coincidence, ≈100% of those minds are not ruthless sociopaths.

I don't think that's true in any interesting way. See a lot of examples[1] from history of humans doing terrible things to other humans and also everything else. And how little care about this.

Boring way to defend this is "but if we exclude the stuff people don't care about". But then what makes you think your welfare is on the list of the stuff AI cares about?

[1] Cesar in Gaul, Belgian Kongo, WWI, Soviet Union, factory farming

[-]Steven Byrnes1d51

A person who is kind to their friend and cruel to their enemies is not a “ruthless sociopath”. That’s not how the term “ruthless sociopath” is used in common parlance, and that’s not how I’m using it, and I don’t believe that even you personally would use the term that way in your everyday life.

I will rephrase what I think you’re trying to say:

“People who are not ruthless sociopaths are often nevertheless callous and cruel towards other people, especially their enemies. By the same token, even if ASI is NOT a ruthless sociopath, it still might be callous and cruel towards humans.”

I would heartily agree with that paragraph.

In this post I am making a strong claim that ASI will be a ruthless sociopath, because I think that strong claim is true and important. It’s possible that this strong claim is false, and yet ASI will kill everyone anyway, via any number of alternate pathways. But in this post I’m defending the strong claim, and I’d like to keep discussion centered on that.

[-]Joachim Bartosik1d10

I can’t say I agree with what you wrote.

I also don’t want to move the subject of the discussion from the one you intend.

If you think I’m doing that I’m happy to delete the rest of this comment.

I think it’s important to the topic you want to discuss that a lot of people at a lot of places for a lot of time did a lot of very cruel things to not just their enemies. Instead they did it because it was convenient or fun (including their friends).

I’ll skip details until requested.

To me it seems it’s opposite - very specific arrangement of incentives manages to limit human to being cruel mostly to enemies and those out of sight.

Which is relevant to figuring out how likely it is that ais will be ruthless sociopaths. I would be more optimistic about that if ~100% minds so far were not one (as opposed to 10-90% (quick estimate on how many people historically seem to have engaged in practices that indicate callous disregard for life of others as a matter of routine or entertainment)

[-]Steven Byrnes1d20

OK, I guess we’re arguing about whether “intrinsically caring about other people” is a thing that exists in humans or not. Right? Well I think it does. This seems very obvious to me, and I’m puzzled if it doesn’t to you.

very specific arrangement of incentives manages to limit human to being cruel mostly to enemies and those out of sight

Here’s a scenario.

Alice is an elderly lady with just one daughter, Beth. Beth has a terminal illness, and is clearly about to die. (Alice is a retired hospice nurse, so she knows what “clearly about to die” looks like.) They are alone in a rural area with no phones or cameras, and there’s no chance that anyone else will arrive at the house before Beth dies (there’s a blizzard, the road will be impassable for the next few days).

Alice can (A) torture Beth just out of curiosity to see how Beth will respond, or (B) walk away from Beth to do the dishes, or (C) try to make Beth feel very comfortable and loved in her last moments of life.

There are no incentives here. No one will know either way.

My prediction is that the overwhelming majority of Alice’s (>90%) would choose (C). If you believe that people are nice only because of a “very specific arrangement of incentives”, then presumably you would guess that (C) would be very rare, and most Alice’s would choose (A) or (B).

Is that your belief?

[-]Joachim Bartosik1d10

I didn't try to claim '“intrinsically caring about other people” is not a thing that exists in humans'. It does. I also think it's not strong enough to do that much vs other incentives. It's strong enough to in some places establish a system which in a buch of situations aligns other incentives with the drive.

And replying to your example: parent-child instincts are real so I'd expect Alice to care. But if roles were reversed a lot of people decide to not take care of their parents

[-]valentin forch3d10

I share your belief that end-to-end RL is ruthless, but I’d be more interested in a version of your argument that does not invoke ASI. ASI implies levels of power that are very dangerous by default.

Human-level artificial intellects (let’s call them AHI) that interface naturally with the internet are more tangible and more likely under our current technological paradigm - and potentially still very dangerous (being kind-of-immortal let’s you play different games). However, if you allow for AHI, there may be a third way to get to it: brain-like perception + GOFAI. I think the human brain’s super-power does not lie in its advanced RL-algorithms to figure out our life policy. Day-to-day decisions are under control of the Basal Ganglia (hierarchical TD-learning), human “deep” planning is some form of very resource-constrained MCTS with good pruning heuristics (e.g., intuitions about interestingness), and all the other advanced stuff we may do is achieved by imitation of planning / decision algorithms and heuristics we deliberately learned, i.e. they are a product of cultural evolution.

So the planning algorithms are already there - we don’t need to reinvent them, and we don’t need to run them in unintelligible deep networks. The hardest part here is getting the hierarchical abstraction for representing our state space, but this is “just” an extension of perception. And we are currently clearly missing brain-like perception. Pairing human levels of robustness in perception with classic planning + the language interface of LLMs would be much more useful than whatever we have today for self-driving and robotics.

The nice thing about brain-like perception + GOFAI is that even if you get there via end-to-end RL, you can throw away the agentic part after you are done training and just keep the perception module that gives you the world model which you then query with GOFAI. The problem becomes representational alignment (make it intelligible) instead of agentic alignment (make it do what you want).

[-]Steven Byrnes1d42

I think the human brain’s super-power does not lie in its advanced RL-algorithms to figure out our life policy.

Bob is a mathematician. One day, he’s facing a certain type of problem, and he tries a certain proof strategy, and it doesn’t work. I claim that, next time Bob finds himself facing a similar type of problem, he’s less likely to try that particular proof strategy.

I claim that this change involved RL in Bob’s brain. And this process (repeated countless times) is essential to how Bob succeeds at math.

And likewise, gaining expertise in anything—piano, sports, real estate, legal defenses, whatever—involves RL in a person’s brain.

And not just acquiring the expertise, but also using it. The pianist needs RL to learn new songs. Bob the mathematician needs RL to gain proficiency with a new math concept that he just invented yesterday. A legal expert needs RL because each case is a bit different than anything she’s seen before. Etc.

See also: §1 of this post on why RL is core to “figuring things out”, which is core to human competence.

brain-like perception + GOFAI

I don’t understand this. The world-model will still involve millions of inscrutable unlabeled concepts, right? (See here, here.) What does GOFAI even mean in that context?

you can throw away the agentic part after you are done training and just keep the perception module that gives you the world model which you then query with GOFAI

As I mentioned in the OP, there’s a quadrillion-dollar market for AIs that can go out in the world and figure things out, 100% autonomously and way outside the distribution of things that humans already know. Someone will make such AIs sooner or later, unless they’re eternally prevented from doing so by force. I’m confused about how you think about that problem.

If you want to talk about making much weaker AIs safe, and “AHI”, isn’t that already kinda a solved problem, via the LLMs that exist right now? They already exist, they can already do lots of things, and they seem basically safe, in the grand scheme of things—i.e., negligible extinction risk, and some problems but only the usual kinds of problems associated with new technologies. I’m focused on the more powerful AIs of the future, which (I claim) might cause human extinction etc. If you’re focused on some other problem besides that, then what is it? I.e., what is motivating you to think about “brain-like perception + GOFAI”? What problem do you imagine that it might solve?

[-]valentin forch1d10

Thanks for taking time to respond.

I am not saying humans don’t use RL. I am trying to say that RL is not what makes us special compared to current SotA (LLM or RL) models. It is our perception. AlphaZero blows us away in closed, non-fuzzy domains. Our ability for abstraction, which I claim is mostly an extension of perception, is what makes us special. Finding a robust hierarchy of coarse grainings in perceptual chaos through self-supervised learning where RL is mostly there to maximize for interestingness. Some call it understanding.

By GOFAI I mean things like MCTS (with good pruning heuristics), TD learning (over a hierarchy of abstract states), production systems, model predictive control, etc. I claim that we don’t need to train a fancy policy network to build very useful stuff. Having a reasonable predictive model for car and environment dynamics with a sampling rate of, say, 50 Hz that does not break under adversarial perturbations over a time horizon of 30 seconds is extremely useful for example.

Yes, a brain-like world model is not an open book. But it should be easier to understand this thing than some world model + some policy network which might be in superposition. When you have a brain-like world model you can combine it for instance with a (potentially malicious) policy and sample future trajectories to evaluate the expected outcomes of the policy.

Yes, ASI will happen, somehow, sometime. I think your concerns are valid in principle and also apply for AHI involving unaligned end-to-end RL, which I think is also much more likely to occur in the next decade. LLMs as of today are not AHI and scaling them won’t get us to AHI. ASI scenarios however are extremely uncertain and there are many more known and unknown dangers to this which dilutes the expected utility of solving the problem you pose.

Do you think there is no grave danger in consequentalist RL if ASI (something super human, incomprehensible by default) is not involved?

[-]Steven Byrnes15h20

RL is not what makes us special compared to current SotA (LLM or RL) models

I don’t think the human RL system is especially fancy, and I certainly don’t think it’s the secret sauce that will unlock ASI. I think the secret sauce of human intelligence is the cortex (and thalamus etc.), which I think agrees with you. But “the RL system is not some fancy secret sauce” is different from “the RL system is optional, and we can just leave it out entirely without sacrificing anything”. I don’t think you can do anything nontrivially useful in the world if you have a cortex-like algorithm but don’t attach it to any RL, as I explained in my last comment.

There also seems to be a clash where I’m using the term “RL” more broadly than you are. For example, you categorize TD learning as part of GOFAI, whereas I would categorize TD learning as part of RL. This is just terminology, so whatever, but it does lead to us talking past each other. For example, I probably agree with “we don’t need to train a fancy policy network”. It doesn’t need to be fancy.

Yes, a brain-like world model is not an open book. But it should be easier to understand this thing than some world model + some policy network which might be in superposition.

It’s not “easier” unless you actually have a viable plan. Is it “easier” to survive a 100 km asteroid strike than a 150 km asteroid strike? Well, if you give me a choice, I’ll choose the 100 km one. But it doesn’t actually matter, because both of those options will definitely kill everyone.

“Not an open book” is an understatement. It’s a massive unlabeled data structure. It would be a huge research project to understand it, at best. Then if you want to do anything useful with it, you presumably need to do continual learning, so the data structure keeps changing, so you need to keep pausing the system and doing huge research projects. Meanwhile you get outcompeted by the next firm down the street that’s running 100,000 of these things at full speed.

When you have a brain-like world model you can combine it for instance with a (potentially malicious) policy and sample future trajectories to evaluate the expected outcomes of the policy.

Can you walk through a concrete example of what someone can do with a such a system? Ideally something that’s very impactful, e.g. so impactful that it could plausibly cause or prevent human extinction.

As an example of what I tend to think about: Jeff Bezos alone earned $250B. So just run 100 Bezos-level AHIs and ask them to start companies, and you can in principle earn $25T over the next 20 years.

…And you can do a lot more than that! Stalin was a single human-level intelligence, and he maneuvered his way into dictatorial control over 200,000,000 people. Right?

See what I mean? Actual human-level intelligence is a big, big, big deal.

But that requires the RL part, and having the AI go out into the world, and do things autonomously, with open-ended continual learning, and no human in the loop. These big-big-big-deal things do not seem compatible with what you seem to be describing. The fact that you brought up self-driving cars seems to indicate that you’re not appreciating the stakes, I think? Or sorry if I’m misunderstanding.

[-]Phaedrus4dΩ010

This approach ignores the fact that if we use advanced LLMs to make new paradigm advancements that are extremely effective RL sociopaths, we'll at that point have the help of the relatively harmless but still very powerful LLMs to do safety work on the RL agents — this is a major help with mitigating autonomy risks! Of course, there's always the risk that new RL architecture discoveries create economic incentives to scale the scary RL agents without sufficient safety work, but the prospect of using HHH AI to align scary AI is weirdly under-explored when talking about that exact advanced LLM + advanced RL learner world.

[-]Steven Byrnes4dΩ342

For one thing, my actual expectation is that LLMs will be a helpful research tool for the human discovering the next AI paradigm, rather than the LLMs discovering the next AI paradigm themselves (see Foom & Doom §1.4.4).

For another thing, even if I’m wrong about that, note that we have “very powerful” humans “to do safety work on RL agents” right now, but it turns out that those humans are overwhelmingly uninterested in doing so. So instead there’s maybe 1000× more money and effort going into figuring out how to make RL agents more powerful rather than how to make them safe. (See We need a field of Reward Function Design.) I don’t see any reason to expect this situation to change if it’s LLMs doing the research instead of humans.

That said, if people have ideas about how to make a near-future world full of LLMs a wiser world than the world of today, then great, I endorse that goal and wish them luck :)

[-]Steve Kommrusch4d10

I think that, even if LLM's don't smoothly evolve into AGI then ASI, an alternative 'brain-like' AGI will have a similar progress ramp that allows for alignment learning-by-doing in a very meaningful way. To explain this, let's discuss the LLM path a bit. OpenAI's deliberative alignment and Anthropic's more sober discussion of the ongoing alignment challenge both highlight the effort that companies today put in to understanding and improving LLM alignment. Alignment work is progressing through improved training, RLHF, RLAIF, Constitutional Classifiers, etc. One would expect that, as AI agents get used more and home robots get marketed, customers will refuse to buy unsafe AI agents and AI companies will need to learn to improve the AI behavior. It would be great to have some regulation or strong liability laws to help with this, but customer demand alone will provide impetus for general alignment of today's systems. As LLMs and their cousins VLAs move towards AGI, we'll have tolerably aligned AGI and we'll have learned how to get alignment to generalize for an AGI. As AGIs advance to ASI, we'll continue to have product pressure and RLAIF will improve in capability along with the AGI's themselves. The point of that summary is not to say that I'm sure AI safety will play out well, but that there is indeed a lot of effort put in to prevent sociopathic results.

Now if we posit a different learning system that takes us to ASI, I would still expect a multi-year ramp from 'not yet on the public radar' to ASI. There will be many companies and watchdog groups watching the new systems grow, make mistakes, and get fixed. If this new learning approach results in AI's as capable as today's systems but LESS aligned, they aren't likely to sell well. I think that before we need to worry about ASI, we should accept that the AGI we build will be valuable to someone and, hence, by definition tolerably aligned (although I don't disagree that 'tolerable' may be a low bar).

In the end, I would expect that a useful AGI (not ASI) would need to have features like corrigibility (ability to evaluate goals and adjust or abort them), curiosity (recognizing when a conclusion or plan may be wrong), and self-critiquing (using classifiers or other systems to stress-test a plan for unwanted side-effects). I disagree with the premise that ASI's will evolve into ruthless optimizers because a useful AGI will have learned the value of reconsidering goals and trying to understand the full impact of plans and actions. These features don't guarantee we avoid sociopaths, but I see them as necessary items to solve for useful AGI and, hence, the ASI developers will have something to build on.

[-]StanislavKrym4d10

I prepared a response to your ideas in a separate post. I am not sure that the actual reason for the humans not to become schemers is due to interpretability-based Approval Reward itself and not some combination of the humans starting with an Approval Reward-like primitive circuitry like valuing smiles and learning to genuinely cooperate with others by having similar capabilities (e.g. due to being embodied). As for the field of Reward Function Design, I suspect it to require not interpretability, but something like Max Harms' CAST agenda.

[-]soycarts5d1-2

Of course, evolution did go out of its way to make humans non-ruthless, by endowing us with social instincts. Maybe future AI programmers will likewise go out of their way to make ASIs non-ruthless? I hope so—but we need to figure out how.

I think this could happen via the same mechanism — ASI will identify with humans and thus we'll have a social contract via falling under the same self-preservation umbrella.

[+][comment deleted]6d10

Moderation Log