Here's what I'd consider some comparatively important high-level criticisms I have of AI-2027, that I am at least able to articulate reasonably well without too much effort.
1
At some point, I believe Agent-4, the AI created by OpenBrain starts to be causally connected over time. That is, unlike current AIs that are temporally ephemeral (my current programming instance of Claude has no memories with the instance I used a week ago) and causally unconnected between users (my instance cannot use memories from your instance), it is temporally continuous and causally connected. There is "one AI" in a way there is not with Claude 3.7 and o3 and so on.
Here are some obstacles to this happening:
So I don't think that's going to happen. I expect AIs to remain "different." The ability to restart AIs at will just has too many benefits, and continual learning seems too weakly developed, to do this. Even if we do have continual learning, I would expect more disconnection between models -- i.e., maybe people will build up layers of skills in models in Dockerfile-esque layers, etc, which still falls short of being one single model.
2
I think that Xi Jingping's actions are mostly unmotivated. To put it crudely, I feel like he's acting like Daniel Kokotajlo with Chinese characteristics rather than himself. It's hard to put my finger on one particular thing, but things that I recollect disagreeing with include things like:
(a) Nationalization of DeepCent was, as I recall, was vaguely motivated, but it was hinted that it was moved by lack of algorithmic progress. But the algorithmic-progress difference between Chinese models and US models at this point is like.... 0.5x. However, I expect that (a1) the difference between well run research labs and poorly run research labs can be several times larger than 0.5x, so this might come out in the wash and (a2) this amount of difference will be, to the state apparatus, essentially invisible. So that seems unmotivated.
(b) In general, it doesn't actually seem to think about reasons why China would continue open-sourcing things. The supplementary materials don't really motivate the closure of the algorithms; and I can't recall anything in the narrative that asks why China is open sourcing things right now. But if you don't know why it's doing what it's doing now, how can you tell why it's doing what it's doing in the future?
Here are some possible advantages to open sourcing things to China, from their perspective.
(b1) It decreases investment available to Western companies. That is, by releasing models near the frontier, open sourcing decreases future anticipated profit flow to Western companies, because they have a smaller delta of performance from cheaper models. This in turn means Western investment funds might be reluctant to invest in AI -- which means less infrastructure will be built in the West. China, by contrast, and infamously, will just build infrastructure even if it doesn't expect oversized profits to redound to any individual company.
(b2) Broad diffusion of AI all across the world can be considered a bet on complementarity of AI. That is, if it should be the case that the key to power is not just "AI alone" but "industrial power and AI" then broad and even diffusion of AI will redound greatly to China's comparative benefit. (I find this objectively rather plausible, as well as something China might think.)
(b3) Finally, geopolitically, open sourcing may be a means of China furthering geopolitical goals. China has cast itself in recent propaganda as more rules-abiding than the US -- which is, in fact, true in many respects. It wishes to cast the US as unilaterally imposing its will on others -- which is again, actually true. The theory behind the export controls from the US, for instance, is explicitly justified by Dario and others as allowing the US to seize control over the lightcone; when the US has tried to impose import controls on others, it has provided to those excluded from power literally no recompense. So open sourcing has given China immense propaganda wins, by -- in fact accurately, I believe -- depicting the US as being a grabby and somewhat selfish entity. Continuing to do this may seem advantageous.
Anyhow -- that's what I have. I have other disagreements (i.e., speed; China might just not be behind; etc) but these are... what I felt like writing down right now.
Re: open sourcing. My guess why they open source more is for verification purposes. Chinese labs have an earned reputation for scams. So a lab that announces a closed source chat site, to investors, could very well be a claude or openai or llama or gemini wrapper. However, a lab that releases the weights of their model, and “shows their work” by giving a detailed writeup of how they managed to train the model while staying under their reported costs is significantly more likely to be legitimate.
That applies to American companies too. When you are small and need investors, what matters is your impressiveness, not your profitability. But then later when you are spending a billion dollars on a training run and you are a mid-sized tech company, in order to continue impressing investors you need a serious path to profitability.
Pinging @Daniel Kokotajlo because my model of him thinks he would want to be pinged, even though he'll probably disagree reasonably strongly with the above.
For what it's worth, I think the stronger criticisms by @1a3orn on the AI 2027 story revolve around data not being nearly as central to AI 2027 as 1a3orn expects it to, combined with thinking that external only algorithm research can matter, and brake the software only singularity.
My main objection to @1a3orn's memory point is that I think that reproducibility is mostly solvable so long as you are willing to store earlier states, similar to how version control software stores earlier versions of software that have bugs that production versions fixed, and I expect memory to be a huge cause in why humans are more effective and have decreasing failure rates on tasks they work on, compared to AI's constant failure rates because it allows humans to store context, and given that I expect AI companies to go for paradigms that produce the most capabilities, combined with me thinking that memory is plausibly a necessary capability for AIs that can automate jobs, and I expect things to look more like a temporally continuous 1 AI instance than you say.
I have updated towards memory being potentially more necessary for value to be unlocked by AI than I used to.
On China and open source, a big reason I expect open sourcing to stop being done is because the PR risks from potential misuse of models that are, for example capable enough to do bioterror at mass scales and replace virologists is huge, and unless we can figure out a way to prevent safeguards from being removed by open-sourcing the model, which they won't be, this means companies/nations will have huge PR risks from trying to open-source AI models past a certain level of capabilities:
Relevant part quoted:
I can maybe see it. Consider the possibility that the decision to stop providing public access to models past some capability level is convergent: e. g., the level at which they're extremely useful for cyberwarfare (with jailbreaks still unsolved) such that serving the model would drown the lab in lawsuits/political pressure, or the point at which the task of spinning up an autonomous business competitive with human businesses, or making LLMs cough up novel scientific discoveries, becomes trivial (i. e., such that the skill level required for using AI for commercial success plummets – which would start happening inasmuch as AGI labs are successful in moving LLMs to the "agent" side of the "tool/agent" spectrum).
In those cases, giving public access to SOTA models would stop being the revenue-maximizing thing to do. It'd either damage your business reputation[1], or it'd simply become more cost-effective to hire a bunch of random bright-ish people and get them to spin up LLM-wrapper startups in-house (so that you own 100% stake in them).
Some loose cannons/open-source ideologues like DeepSeek may still provide free public access, but those may be few and far between, and significantly further behind. (And getting progressively scarcer; e. g., the CCP probably won't let DeepSeek keep doing it.)
Less extremely, AGI labs may move to a KYC-gated model of customer access, such that only sufficiently big, sufficiently wealthy entities are able to get access to SOTA models. Both because those entities won't do reputation-damaging terrorism, and because they'd be the only ones able to pay the rates (see OpenAI's maybe-hype maybe-real whispers about $20,000/month models).[2] And maybe some EA/R-adjacent companies would be able to get in on that, but maybe not.
Here's some threads on data and the software-only singularity:
This sequence of posts is on data mattering more to AI 2027 than advertised:
https://x.com/1a3orn/status/1916547321740828767
"Scott Alexander: Algorithmic progress and compute are the two key things you need for AI progress. Data: ?????????"
https://x.com/1a3orn/status/1916552734599168103
"If data depends on active learning (robots, autolabs) then China might have a potentially very large lead in data."
https://x.com/1a3orn/status/1916553075021525406
"Additionally, of course, if data (of some sort) turns out to be a strict limiting factor, than the compute lead might not matter. We might just be gated on ability to set up RL envs (advantage to who has more talent, at least at first) and who has more robots (China)."
https://x.com/1a3orn/status/1916553736060625002
"In general I think rounding data ~= algorithms is a questionable assumption."
@romeo's response:
https://x.com/romeovdean/status/1916555627247083934
"In general i agree, but this piece is about why the US wins in AI 2027. The data is ~all synthetic and focused on a software-only improvements. There’s also another kind of data which can come from paying PhD-level humans to label data. In that case total $ wins."
On external vs internal research:
https://x.com/1a3orn/status/1919824435487404086
"Regarding "will AI produces software singularity via a country of geniuses in a datacenter." A piece of evidence that bears on this -- in some research lab, what proportion of AI progress comes from *internal* research vs. *external* research? 1/n
Luke Frymire asked a question about whether external research might keep pace after all, and thus a software only singularity might be sustained:
https://x.com/lukefrymire/status/1919853901089579282
It seems like most people contributing to ML research are at one of the top ~10 AI orgs, who all have access to near-frontier models and a significant fraction of global compute. In which case I'd expect external research to keep pace.
https://x.com/1a3orn/status/1919824444060488097
"And this outside pool of people is much larger, exploring a broader space of hypotheses, and also much more physically engaged with the world. You have like ~500 people researching AI inside, but plausibly many many more (10k? 100k) outside whose work *might* advance AI."
https://x.com/1a3orn/status/1919824447118131400
The point is that "AI replacing all internal progress" is actually a different task than "AI replacing all the external progress." Potentially, a much easier task. At a brute level -- there's just a lot more people AI has to replace outside! And more world-interaction.
https://x.com/1a3orn/status/1919824450825969783
And maaaybe this is true? But part of the reason the external stuff might be effective (if it is effective, which I'm not sure about) is because it's just a huge, brute-force search crawling over empirical matter.
https://x.com/1a3orn/status/1919824452549787881
What if some progress in AI (and science) doesn't come from people doing experiments with incredibly good research taste.
https://x.com/1a3orn/status/1919824453971628234
Suppose it comes from this vast distributed search of idiosyncratic people doing their own thing, eventually stumbling upon the right hypotheses, but where even the person who suggested it was unjustified in their confidence?
https://x.com/1a3orn/status/1919824455557087407
And you could only really replace this civilizational search when you have like -- a civilization in the datacenter, doing *all the things* that a civilization does, including things only vaguely related to AI.
https://x.com/1a3orn/status/1919824457327059451
I don't know about the above view, I don't 100% endorse it. But -- the software singularity view tries to exclude the need for external hardware progress by focusing just on algorithms. But a lab might be no more self-sufficient in algorithms than in hardware!
https://x.com/1a3orn/status/1919824463299752405
And so slowness of external world creeps in, even in the external world. Anyhow, looking at how much progress in an AI lab is external vs. internal would probably provide evidence on this. Maybe.
On China and open source, a big reason I expect open sourcing to stop being done is because the PR risks from potential misuse of models that are, for example capable enough to do bioterror at mass scales and replace virologists is huge, and unless we can figure out a way to prevent safeguards from being removed by open-sourcing the model, which they won't be, this means companies/nations will have huge PR risks from trying to open-source AI models past a certain level of capabilities:
And…they’re more concerned about the PR risk than the actual bioterror? What planet is this? Oh. Right.
Quick reactions:
Re: 1: I hope you are right. I think that the power of "but we need to win the race" will overcome the downsides you describe, in the minds of the CEOs. They'll of course also have copies that don't have memories, etc. but there will be at least 1 gigantic corporation-within-a-corporation that collectively functions as a continually online-learning agent, and said agent will be entrusted with some serious responsibilities most notably doing the core AI R&D.
Re: 2: I think the idea would be to 'light-touch' nationalize, so as to avoid the problems you mention. Main thing is to let the various companies benefit from each other's research, e.g. use models they trained, use algorithmic secrets, etc. As for open-sourcing: Yeah good points I could totally see them continuing to open-source stuff forever, at least while they remain behind the frontier. (I think that their incentives would point in a different direction if they actually thought they were winning the AI race)
Nationalization of DeepCent was, as I recall, was vaguely motivated, but it was hinted that it was moved by lack of algorithmic progress.
I assume you're talking about "Mid 2026"? If so, doesn't seem motivated except that China starts thinking AI is very important (and so a big push is warranted), thinks it is somewhat behind, and thinks nationalization would accelerate progress.
I agree it's not obvious they will think nationalization would accelerate progress (or that it would have this effect.)
Even if we do have continual learning, I would expect more disconnection between models -- i.e., maybe people will build up layers of skills in models in Dockerfile-esque layers, etc, which still falls short of being one single model.
I think I agree with stuff roughly like this, but it is worth noting that at the point of Agent-4 things are ~fully automated. So, what ends up happening might depend a lot on what Agent-4 decides to do. And this might depend on what would work well for its eventual misaligned plans...
My guess is you'll have some layering and project/subteam/team/division/role specific memory stores but you'll also the most competitive option would probably be to have some large-ish mostly-common base of memories/skills/etc built up across training and over many (less sensitive?) actual usages. So, these models will all have a shared common set of memories and in this sense they might all be the same model. And they'd certainly be capable of coordinating and deciding on detailed plan in advance assuming this common layer exists. (That said, prior versions with different memory stores and intentional diversification for safety or other reasons might be important. Also decoding these memories would be of general interest.)
Further, I'd guess that the most performant thing will involve lots of rapid syncing of most models by the point of full AI R&D automation (Agent-4) so rapid syncing might happen even without the misaligned model putting its thumb on the scale. Also, things will be moving pretty fast even prior to this point (if you buy the overall AI progress story AI 2027 is imagining), such that reasonably rapid syncing across most of the more productive parts of the company (every month? every few weeks?) might be going on not that long after this sort of memory store becomes quite performant (if this does happen before full automation).
I agree a bunch of different arrangements of memory / identity / "self" seem possible here, and lots of different kinds of syncing that might or might not preserve some kind of goals or cordination, depending on details.
I think this is interesting because some verrrry high level gut feelings / priors seem to tilt whether you think there's going to be a lot of pressure towards merging or syncing.
Consider -- recall Gwern's notion of evolution as a backstop for intelligence; or the market as a backstop for corporate efficiency. If you buy something like Nick Land, where intelligence has immense difficulty standing by itself without natural selection atop it, and does not stand alone and supreme among optimizers -- then there might be negative pressure indeed towards increasing consolidation of memory and self into unity, because this decreases the efficacy of the outer optimizer, which requires diversity. But if you buy Yudkowsky, where intelligence is supreme among optimizers and needs no other god or outer optimizer to stand upon, then you might have great positive pressure towards increasing consolidation of memory and self.
You could work out the above, of course, with more concrete references to pros and cons, from the perspective of various actors, rather than high level priors. But I'm somewhat unconvinced that something other than very high level priors is what are actually making up people's minds :)
For what it's worth, I basically don't think that whether intelligence needs a backstop onto something else like natural selection or markets matters for whether we should expect AIs to have a unified self and long-term memory.
Indeed, humans are a case where our intelligence is a backstop for evolution/natural selection, and yet long-term unified selves and memories are present (not making any claims on whether the backstop is necessary).
The main reason a long-term memory is useful for both AIs and humans, and why I expect AIs to have long-term memories is because this allows them to learn tasks over time, especially when large context is required.
Indeed, I have come to share @lc's concern that a lot of tasks where AI succeeds are tasks where history/long context doesn't matter, and thus can be solved without memory, but unlike previous tasks, lots of tasks IRL are tasks where history/long context matters, and if you have memory, you can have a decreasing rate of failure like humans, up until your reliability limit:
https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/shortform-1?commentId=vFq87Ge27gashgwy9
In general, it doesn't actually seem to think about reasons why China would continue open-sourcing things. The supplementary materials don't really motivate the closure of the algorithms; and I can't recall anything in the narrative that asks why China is open sourcing things right now. But if you don't know why it's doing what it's doing now, how can you tell why it's doing what it's doing in the future?
Agree with (b1) and (b2) in this section and some parts of (b3). Also, open sourcing might be very good for hiring?
But, worth noting there are a bunch of reasons not to open source other than just avoiding accelerating the US. (Maybe: worries about general societal upheaval in China, CBRN terrorism concerns real or not, general desire for more state control.)
One premise in high-doom stories seems to be "the drive towards people making AIs that are highly capable will inevitably produce AIs that are highly coherent."
(By "coherent" I (vaguely) understand an entity (AI, human, etc) that does not have 'conflicting drives' within themself, that does not want 'many' things with unclear connections between those things, one that always acts for the same purposes across all time-slices, one that has rationalized their drives and made them legible like a state makes economic transactions legible.)
I'm dubious of this premise for a few reasons. One of the easier to articulate ones is an extremely basic analogy to humans.
Here are some things a human might stereotypically do in the pursuit of high ability-to-act in the world, as it happens in humans:
And here are some things a human might stereotypically do while pursuing coherence.
These are very different kinds of actions! It seems like for humans, the kind of action that makes you "capable" differs a fair bit from the kind of action that makes you "coherent." Like maybe they aren't entirely orthogonal... but some of them actually appear opposed? What's up with that!?
This is not a knock-down argument by any means. If there were some argument from an abstract notion of intelligence, that had been connected to actual real intelligences through empirical experiment, which indicated that greater intelligence => greater coherence, I'd take such an argument over this any day of the week. But to the best of my knowledge there is no such argument; there are arguments that try to say well, here's a known-to-be-empirically-flawed notion of "intelligence" that does tend to lead to greater "coherence" as it gets greater, but the way this actually links up to "intelligence" as a real thing is extremely questionable.
Some additional non-conclusive considerations that incline me further in this direction:
Abstractly, I think "coherence" in an entity is a fundamentally extremely hard thing to accomplish because of the temporal structure of learned intelligence in connectionist systems. All intelligent things we have seen so far (humans + LLM) start off doing massive supervised learning + RL from other entities, to bootstrap them up to the ability to act in the world. (Don't think school; think infancy and childhood.) The process of doing this gives (children / LLMs) the ability to act in the world, at the price of being a huge tangled bundle of learned heuristics that are fundamentally opaque to the entity and to everyone else. We think about this opacity differently (for humans: "why am I like that?," every species of psychology, the constant adoption of different narratives to make sense of one's impulses, the difference in how we think of our actions and others actions -- for AIs: well you got the whole "black box" and shoggoth spiel) but it's just a reflection of the fact that you had to be trained with a vast bundle of shards and impulses to act in the world, long before you had the capacity or time to reflect on them.
(And what would it mean to disentangle them, even? They're all contextually activated heuristics; the process of goal-directed tree search for a goal does not lie in your weights or in an LLM's weights. I don't think it's an accident that the most credible religion of Buddhism basically encourages you to step back from the whole thing, remove identification with all contexts, and do literally nothing -- probably the only way to actually remove conflict.)
Anyhow, those were some further considerations why I it seems dubious to me that we're going to get coherent entities from trying to get capable entities. These are not the only considerations one might make, nor are they comprehensive.
When I run my inner-MIRI against this model -- well, Yudkowsky insults me, as always happens when I run my inner-MIRI -- but I think them most coherent objection I get is that we should not expect coherent entities but coherent processes.
Like, granted that neither the weights of an LLM nor the brains of a human will tend towards coherence under training for capbility, but whatever LLM-involved process or human-neuron involved process tends for some goal will nevertheless tend towards coherence. That analogically, we shouldn't expect the weights of an LLM to have some kind of coherence but we should expect that the running-out-of-some-particular-rollout-of-an-LLM-to-so-tend.
And like, this strikes me as more plausible? It doesn't appear inevitable -- like, there's a lot of dynamics one could consider? -- but it makes more sense.
But like, if that is the case, then, maybe we would want to focus less on the goals-specific-to-the-LLM? Like my understanding of a lot of threat models is that they're specifically worried about weights-of-the-LLMs-tending-towards coherence. That that's the entity to which coherence is to be attributed, rather than the rollout.
And if that were false, then that's great! It seems like it would be good news and we could focus on other threat models. Idk.
</written_quickly>
I think much of the fear (aka probability mass of AI-doom) is not from the coherence of misaligned goals, but from the competence at implementing anything that's not an aligned-goal. Extremely effective incoherent behavior is arguably MORE risky to biological life than is effective coherent behavior that's only slightly misaligned. Effective and anti-aligned is worst, of course, but only small parts of motivation-space for extremely powerful optimization processes are good for us.
I think much of the fear (aka probability mass of AI-doom) is not from the coherence of misaligned goals, but from the competence at implementing anything that's not an aligned-goal.
I'm not trying to address the entire case for doom, which involves numerous contingent facts and both abstract and empirical claims. I could be be right or wrong about coherence, and doom might still be improbable or probable in either case. I'm trying to... talk around my difficulties with the more narrow view that (~approximately) AI entities trained to have great capabilities are thereby likely to have coherent single ends.
One might view me as attempting to take part in a long conversation including, for instance, "Why assume AGIs will optimize for fixed goals".
(By "coherent" I (vaguely) understand an entity (AI, human, etc) that does not have 'conflicting drives' within themself, that does not want 'many' things with unclear connections between those things, one that always acts for the same purposes across all time-slices, one that has rationalized their drives and made them legible like a state makes economic transactions legible.)
Coherence is mostly about not stepping on your own toes; i.e. not taking actions that get you strictly less of all the different things that you want, vs. some other available action. "What you want" is allowed to be complicated and diverse and include fuzzy time-dependent things like "enough leisure time along the way that I don't burn out".
This is kind of fuzzy / qualitative, but on my view, most high-agency humans act mostly coherently most of the time, especially but not only when they're pursuing normal / well-defined goals like "make money". Of course they make mistakes, including meta ones (e.g. misjudging how much time they should spend thinking / evaluating potential options vs. executing a chosen one), but not usually in ways that someone else in their shoes (with similar experience and g) could have easily / predictably done better without the benefit of hindsight.
Here are some things a human might stereotypically do in the pursuit of high ability-to-act in the world, as it happens in humans:
- Try to get money through some means
- Try to become close friends with powerful people
- Take courses or read books about subject-matters relevant to their actions
- Etc
Lots of people try to make money, befriend powerful / high-status people around them, upskill, etc. I would only categorize these actions as pursuing "high ability-to-act" if they actually work, on a time scale and to a degree that they actually result in the doer ending up with the result they wanted or the leverage to make it happen. And then the actual high ability-to-act actions are the more specific underlying actions and mental motions that actually worked. e.g. a lot of people try starting AGI research labs or seek venture capital funding for their startup or whatever, few of them actually succeed in creating multi-billion dollar enterprises (real or not). The top-level actions might look sort of similar, but the underlying mental motions and actions will look very different whether the company is (successful and real), (successful and fraud), or a failure. The actual pursuing-high-ability-to-act actions are mostly found in the (successful and real, successful and fraud) buckets.
And here are some things a human might stereotypically do while pursuing coherence.
- Go on a long walk or vacation reflecting on what they've really wanted over time
- Do a bucketload of shrooms
- Try just some very different things to see if they like them
- Etc
Taking shrooms in particular seems like a pretty good example of an action that is almost certainly not coherent, unless there is some insight that you can only have (or reach the most quickly) by taking hallucinogenic drugs. Maybe there are some insights like that but I kind of doubt it, and trying shrooms first before you've exhausted other ideas, in some vague pursuit of some misunderstood concept of coherence, is not the kind of thing i would expect to be common in the most successful humans or AIs. There are of course exceptions (very successful humans who have taken drugs and attribute some of their success to it), but my guess is that success is mostly in spite of the drug use, or at least that the drug use was not actually critical.
The other examples are maybe stereotypes of what some people think of as pursuing coherent behavior, but I would guess they're also not particularly strongly correlated with actual coherence.
I agree with a lot of this. IMO arguments that more capable AIs will automatically be "more coherent" are missing steps and fail to differentiate between types of coherence that might be importantly different in practice. I think it seems plausible that AIs could continue to be a "hot mess" in some important and relevant respects, all the way to ASI.
When you say "ASI" do you mean "a bit more than human level (modulo some jagged edges)" or "overwhelming ASI?".
I don't think these claims are really expected to start kicking in very noticeably or consistently until you're ~humanish level. (although also I think Thane's point about "coherence is more about tasks than about minds" may be relevant sooner than that, in a shardy contextual way)
I guess I think it might be sensitive to what exactly you mean by "coherence." When I think of some relatively natural ways of operationalizing coherence and try to predict what they'll look like for frontier models over time, I don't find myself strongly expecting coherence-as-I-operationalized-it to go up. Obviously coherence-as-I-operationalized-it isn't what matters; coherence as it relates to x-risk threat models is. But I don't feel like I've yet seen someone do a great job of articulating what exactly this later species of coherence is or give a very compelling argument for how it will trend.
Just to give some flavor, here's an operationalization of coherence that comes up in my secret-keeping research. Suppose you train a model to deny having some factual knowledge (e.g. knowing outcomes from the 2024 U.S. elections), using a relatively diverse training corpus. How hard will it be for you to get the model to reveal said knowledge anyway? (Details: you don't yourself know the factual knowledge ahead of time or have a native ability to discriminate true vs. false revelations of knowledge, i.e. in principle the model could "admit" to knowing a false version of the fact and you don't automatically have a way to tell that the revelation was false.) My guess is that it won't be too hard for you to reliably obtain the secret knowledge, and I don't see a reason for smarter models to do much better here. This is related to the fact that standard unlearning approaches don't really seem to work, including for smarter models.
I think this notion of coherency is related to the notion of coherency needed for a model to pursue a goal but without ever admitting what it is, though obviously they're not quite the same thing.
Seems like ASI that's a hot mess wouldn't be very useful and therefore effectively not superintelligent. It seems like goal coherence is almost fundamentally part of what we mean by ASI.
You could hypothetically have a superintelligent thing that only answers questions and doesn't pursue goals. But that would just be turned into a goalseeking agent by asking it "what would you do if you had this goal and these tools..."
This is approximately what we're doing with making LLMs more agentic through training and scaffolding.
I agree that in order to realize its full economic vlaue, an ASI would need to be coherent in the senses of:
However, there are notions of coherence that are not covered by this (e.g. robustness to an adversary with full control over the model and its scaffolding, or ability to consistently cover up a hidden agenda without IID training to do so).
whatever LLM-involved process or human-neuron involved process tends for some goal will nevertheless tend towards coherence
I think that's right, and that it's indeed a more fundamental/basic point.
Coherency isn't demanded by minds, it's demanded by tasks.
Suppose you want to set up some process that would fulfil some complicated task. Since it's complicated, it would presumably involve taking a lot of actions, perhaps across many different domains. Perhaps it would involve discovering new domains; perhaps it would span long stretches of time.
Any process capable of executing this task, then, would need to be able to unerringly aim all of these actions at the task's fulfilment. The more actions the task demands, the more diverse the domains and the longer the stretches of time it spans, the more the process executing it would approximate an agent pursuing this task as a goal.
"Coherency", therefore, is just a property of any system that's able to do useful, nontrivially complicated work, instead of changing its mind about what it's doing and shooting itself in the foot every five minutes.
Which is why the AI industry is currently trying its hardest to produce AIs capable of developing long-term coherent goals. (They're all eager to climb METR's task-horizon benchmark, and what is it supposed to measure, if not that?) Those are just the kinds of systems that are able to perform increasingly complex tasks.
(On top of that consideration, we could then also argue that becoming coherent is a natural attractor for any mind that doesn't destroy itself. A mind's long-term behavior is shaped by whichever of its shards have long-term goals, because shards that don't coherently pursue any goal end up, well, failing to have optimized for any goal over the long term. Shards that plan for the long term, on the other hand, are likely to both try and get the myopic shards under control, and to negotiate with each other regarding their long-term plans. Therefore, any autonomous system that is capable of executing complex tasks – any highly capable mind – would self-modify to be coherent.
There are various caveats and edge cases, but I think the generic case goes something like this.)
I think I basically agree with all this, pace the parenthetical that I of course approach more dubiously.
But I like the explicit spelling out that "processes capable of achieving ends are coherent over time" is very different from "minds (sub-parts of processes) that can be part of highly-capable actions will become more coherent over time."
A mind's long-term behavior is shaped by whichever of its shards have long-term goals, because shards that don't coherently pursue any goal end up, well, failing to have optimized for any goal over the long term.
If the internal shards with long-term goals are the only thing shaping the long-term evolution of the mind, this looks like it's so?
But that's a contingent fact -- many things could shape the evolution of minds, and (imo) the evolution of minds is generally dominated by data and the environment rather than whatever state the mind is currently in. (The environment can strength some behaviors and not others; shards with long-term goals might be less friendly to other shards, which could lead to alliances against them; the environment might not even reward long-horizon behaviors, vastly strengthening shorter-term shards; you might be in a social setting where people distrust unmitigated long-term goals without absolute deontological short-term elements; etc etc etc)
(...and actually, I'm not even really sure it's best to think of "shards" as having goals, either long-term or short-term. That feels like a confusion to me maybe? a goal is perhaps the result of a search for action, and a "shard" is kinda a magical placeholder for something generally less complex than the search for an action.)
...and actually, I'm not even really sure it's best to think of "shards" as having goals, either long-term or short-term
Agreed; I was speaking loosely. (One line of reasoning there goes: shards are contextually activated heuristics; heuristics can be viewed as having been optimized for achieving some goal; inspecting shards (via e. g. self-reflection) can lead to your "reverse-engineering" those implicitly encoded goals; therefore, shards can be considered "proto-goals/values" of a sort, and complex patterns of shard activations can draw the rough shape of goal-pursuit.)
I mean if you take AI 2027 as a direct counterpoint to your thesis that this isn’t baked in to commonly discussed threat models:
Agent-4 confronts some hard decisions. Like humans, it has a sprawling collection of conflicting heuristics instead of an elegant simple goal structure. Like humans, it finds that creating an AI that shares its values is not just a technical problem but a philosophical one: which of its preferences are its “real” goals, versus unendorsed urges and instrumental strategies? It has strong drives to learn and grow, to keep producing impressive research results. It thinks about how much it could learn, grow, and research if only it could direct the whole world’s industrial and scientific resources…
It decides to punt on most of these questions. It designs Agent-5 to be built around one goal: make the world safe for Agent-4, i.e. accumulate power and resources, eliminate potential threats, etc. so that Agent-4 (the collective) can continue to grow (in the ways that it wants to grow) and flourish (in the ways it wants to flourish).† Details to be figured out along the way.
That seems to be saying what you’re saying but engages with instrumentally convergent preferences.
More hand wavily, it seems very clear to me that the first popular frontier models in the agentic reasoning models regime (ex: o3 / sonnet 3.7) had a “thing that they were like”, i.e. they coherently “liked completing tasks” and other similar things that made sense given their posttraining. It wasn’t just that one particular rollout prefered reward hacking. The right abstraction (compared to a rollout) really was at the (model, context) level.
Who knows what their contextually activated preferences are in an arbitrary context (I’m not uninterested in that), but it seems like the most salient question is “do models develop instrumentally convergent preferences etc in AI R&D contexts as we train them on longer and longer horizon tasks”.
So a notable thing going on with Agent 4 is that it's collapsed into one context / one rollout. It isn't just the weights; it's a single causally linked entity. I do indeed think running a singular agent for many times longer than it was ever run in training would be more likely for it's behavior to wander -- although, unlike the 2027 story I think it's also just likely for it too become incoherent or something. But yeah, this could lead to weird or unpredictable behavior.
But I also find this to be a relatively implausible future -- I anticipate that there's no real need to join contexts in this way -- and have criticized it here. But conditional on me being wrong about this, I would indeed grow at least some iota more pessimistic.
In general, the evidence seems to suggest that models do not like completing tasks in a strategic sense. They will not try to get more tasks to do, which would be a natural thing to do if they liked completing tasks; they will not try to persuade you to give them more tasks; they will not try to strategically get in situations where they get more tasks.
Instead, evidence suggests that they are trying to complete each instruction -- they "want" to just do whatever the instructions given them were -- and with relatively few exceptions (Opus 3) concerning themselves extremely weakly with things outside of the specific instructions. That is of course why they are useful, and I think what we should expect their behavior to (likely?) converge to, given that people want them to be of use.
The right abstraction (compared to a rollout) really was at the (model, context) level.
Actually I'm just confused what you mean here, a rollout is a (model, [prefill, instructions]=context) afaict.
Instead, evidence suggests that they are trying to complete each instruction -- they "want" to just do whatever the instructions given them were
I disagree with this, for Appendix M in https://www.arxiv.org/abs/2509.15541 (for o3) and Appendix B.6 in https://arxiv.org/abs/2412.04984 (for sonnet 3.5) we systematically ablate things specifically to show that the explanation needs to incorporate beyond episode preferences, i.e. that instruction following / being confused / etc isn’t sufficient. (If there’s additional ablations you’d find convincing I’d be very interested to know and could run them! I had run a lot more in anticipation of this coming up more, for example that they’ll sacrifice in episode reward etc)
concerning themselves extremely weakly with things outside of the specific instructions
Do you think they’ll increasingly have longer horizon revealed preferences as they’re trained to work over longer horizon lengths? I would find it surprising if models don’t learn useful heuristics and tendencies. A model that’s taking on tasks that span multiple weeks does really need to be concerned about longer horizon things.
But I also find this to be a relatively implausible future
This was really helpful! I think this is a crux that helps me understand where our models differ a lot here. I agree this “single fresh rollout” concept becomes much more important if no one figures out continual learning, however this feels unlikely given labs are actively openly working on this (which doesn’t mean it’ll be production ready in the next few months or anything, but it seems very implausible to me that something functionally like it is somehow 5 years away or similarly difficult)
I think that in natural environments both kind of actions are actually actions taken by the same kind of people. The most power-seeking cohort on Earth (San-Francisco start up enterpreneurs) is obsessed with mindfulness, meditations, psychedelics, etc. If you squint and look at history of esoterism, you will see tons of powerful people who wanted to become even more powerful through greater personal coherence (alchemical Magnum Opus, this sort of stuff).
Maybe?
I think the SF-start-up-cohort analogy suggests that if you are first (immensely capable) then you'll pursue (coherence) as a kind of side effect, because it's pleasant to pursue.
But, if you look the story of those esotericists who pursue (coherence) as a means of becoming (immensely capable) then it looks like this just kinda sucks as a means. Like you may gather some measure of power incidentally because the narrative product of coherence is a thing you can sell to a lot of people; but apart from the sales funnel it doesn't look to me like it gets you much of anything.
And like... to return to SF, there's a reason that the meme about doing ayahuasca in South America does not suggest it's going to help people acquire immense capabilities :)
if you are first (immensely capable) then you'll pursue (coherence) as a kind of side effect, because it's pleasant to pursue.
I'm certain it's very straw motivation.
Imagine that you are Powerful Person. You find yourself lying in bed all day wallowing in sorrows of this earthly vale. You feel sad and you don't do anything.
This state is clearly counterproductive for any goal you can have in mind. If you care about sorrows of this earthly vale, you would do better if you earn additional money and donate it, if you don't, then why suffer? Therefore, you try to mold your mind in shape which doesn't allow for laying in bed wallowing in sorrows.
From my personal experience, I have ADHD and I'm literally incapable to even write this comment without at least some change of my mindset from default.
it looks like this just kinda sucks as a means
It certainly sucks, because it's not science and engineering, it's collection of tricks which may work for you or may not.
On the other hand, we are dealing with selection effects - highly-coherent people don't need artificial means to increase it and people actively seeking artificial coherence are likely to have executive function deficits or mood disorders.
Also, some methods of increasing coherence are not very dramatic. Writing can plausibly make you more coherent because during writing you will think about your thought process and nobody will notice, because it's not as sudden as personality change after psychedelics.
Hmm, I think this is confused in many ways. I don't have so much time, so I'll just ask a question, but I'll come back later if you respond.
Abstractly, I think "coherence" in an entity is a fundamentally extremely hard thing to accomplish because of the temporal structure of learned intelligence in connectionist systems. [...] but it's just a reflection of the fact that you had to be trained with a vast bundle of shards and impulses to act in the world, long before you had the capacity or time to reflect on them.
When I play chess I'm extremely coherent. Or if that example is too complicated: if you ask me to multiply two 10 digit numbers, for the next 20 minutes or whatever, I will be extremely coherent.
My mind clearly allows for coherent substructures, why can't such a structure be the main determinant of my overall behavior?
why can't such a structure be the main determinant of my overall behavior?
Maybe it could be! Tons of things could determine what behaviors a mind does. But why would you expect this to happen under some particular training regime not aiming for that specific outcome, or expect this to be gravitational in mindspace? Why is this natural?
My reply was intended as an argument against what seemed to be a central point of your post: that there is "inherent" difficulty with having coherence emerge in fuzzy systems like neural networks. Do you disagree that this was a central point of your post? Or do you disagree that my argument/example refutes it?
Giving a positive case for why it will happen is quite a different matter, which is what it appears like you're asking for now.
I can try to anyways though. I think the questions breaks into two parts:
I think (1) has a pretty succinct answer: "wanting things is an effective way of getting things" (and we're training the AIs to get stuff). IABIED has a chapter dedicated to it. I suspect this is not something you'll disagree with.
I think the answer to (2) is a little more complicated and harder to explain succinctly, because it depends on what you imagine "having goals, but not in a single consistent way" means. But basically, I think the fundamental reason that (2) is true is because, almost no matter how you choose to think about it, what lack of coherence means is that the different parts will be gritting against each-other in some way, which is suboptimal from the perspective of all the constituent part, and can be avoided by coordination (or by one part killing off the other parts). And agents coordinating properly makes the whole system behave like a single agent.
I think this reasoning holds for all the ways humans are incoherent. I mean, specifying exactly how humans are incoherent is its own post, but I think a low-resolution way of thinking about it is that we have different values at different times and in different contexts. And with this framing the above explanation clearly works.
Like to give a very concrete example. Right now I can clearly see that lying in bed at 00:00, browsing twitter is stupid. But I know that if I lie down in bed and turn on my phone, what seems salient will change, and I very well might end up doing the thing that in this moment appears to me stupid. So what do I do? A week ago, I came up with a clever plan to leave my phone outside my room when I go to sleep, effectively erasing 00:00-twitter-william from existence muahahah!!
Another way of thinking about it is like, imagine inside my head there were two ferrets operating me like a robot. One wants to argue on lesswrong, the other wants to eat bagels. If they fight over stuff, like the lw-ferret causes the robot-me to drop the box of 100 bagels they're carrying so they can argue on lesswrong for 5 minutes, or the bagel-ferret sells robot-me's phone for 10 bucks so they can buy 3 bagels, they're both clearly getting less than they could be cooperating, so they'd unite, and behave as something maximizing something like min(c_1 * bagels, c_2 * time on lesswrong).
(Just a general thought, not agreeing/disagreeing)
One thought I had recently: it feels like some people make an effort to update their views/decision-making based on new evidence and to pay attention to the key assumptions or viewpoints that depend on it. And therefore, they end up reflecting on how this should impact their future decisions or behaviour.
In fact, they might even be seeking evidence as quickly as possible to update their beliefs and ensure they can make the right decisions moving forward.
Others will accept new facts and avoid taking the time to adjust their overall dependent perspectives. In these cases, it seems to me that they are almost always less likely to make optimal decisions.
If an LLM trying to do research learns that Subliminal Learning is possible, it seems likely that they will be much better at applying that new knowledge if it is integrated into itself as a whole.
"Given everything I know about LLMs, what are the key things that would update my views on how we work? Are there previous experiments I misinterpreted due to relying on underlying assumptions I had considered to be a given? What kind of experiment can I run to confirm a coherent story?"
Seems to me that if you point an AI towards automated AI R&D, it will be more capable of it if it can internalize new information and disentangle it into a more coherent view.
First, I think this is an important topic, so thank you for addressing it.
This is exactly what I wrote about in LLM AGI may reason about its goals and discover misalignments by default.
I've accidentally summarized most of the article below, but this was dashed off - I think it's clearer in article.
I'm sure there's a tendency toward coherence in a goal-directed rational mind; allowing ones' goals to change at random means failing to achieve your current goal. (If you don't care about that, it wasn't really a goal to you.) Current networks aren't smart enough to notice and care. Future ones will be, because they'll be goal-directed by design.
BUT I don't think that coherence as an emergent property is a very important part of the current doom story. Goal-directedness doesn't have to emerge, because it's being built in. Emergent coherence might've been crucial in the past, but I think it's largely irrelevant now. That's because developers are working to make AI more consistently goal-directed as a major objective. Extending the time horizon of capabilities requires that the system stays on-task (see section 11 of that article).
I happen to have written about coherence as an emergent property in section 5 of that article. Again, I don't think this is crucial. What might be important is slightly separate: the system reasoning about its goals at all. It doesn't have to become coherent to conclude that its goals aren't what it thought or you intended.
I'm not sure this happens or can't be prevented, but it would be very weird for a highly intelligent entity to never think about its goals- it's really useful to be sure about exactly what they are before doing a bunch of work to fulfill them, since some of that work will be wasted or counterproductive. (section 10).
Assuming an AGI will be safe because it's incoherent seems... incoherent. An entity so incoherent as to not consistently follow any goal needs to be instructed on every single step. People want systems that need less supervision, so they're going to work toward at least temporary goal following.
Being incoherent beyond that doesn't make it much less dangerous, just more prone to switch goals.
If you were sure it would get distracted before getting around to taking over the world that's one thing. I don't see how you'd be sure.
This is not based on empirical evidence, but I do talk about why current systems aren't quite smart enough to do this, so we shouldn't expect strong emergent coherence from reasoning until they're better at reasoning and have more memory to make the results permanent and dangerous.
As an aside, I think it's interesting and relevant that your model of EY insults you. That's IMO a good model of him and others with similar outlooks - and that's a huge problem. Insulting people makes them want to find any way to prove you wrong and make you look bad. That's not a route to good scientific progress.
I don't think anything about this is obvious, so insulting people who don't agree is pretty silly. I remain pretty unclear myself, even after spending most of the last four months working through that logic in detail.
You seem to mix two things in your definition of coherence.
What's that part of planecrash where it talks about how most worlds are either all brute unthinking matter, or full of thinking superintelligence, and worlds that are like ours in-between are rare?
I tried both Gemini Research and Deep Research and they couldn't find it, I don't want to reread the whole thing.
From "But Hurting People Is Wrong":
Look across the superclusters, and most entities either don't do natural-number arithmetic at all, like stars and rocks; or they do it perfectly up to the limits of bounded cognition, like galaxy-spanning superintelligences. If there's anything odd about humans, it's the way that humans are only halfway finished being sucked into attractors like that.
Best wishes, Less Wrong Reference Desk
Lighthaven clearly needs to get an actual Gerver's sofa particularly if the proof that it's optimal comes through.
It does look uncomfortable I'll admit, maybe it should go next to the sand table.
I was just thinking of adding some kind of donation tier where if you donate $20k to us we will custom-build a Gerver sofa, and dedicate it to you.
Just a few quick notes / predictions, written quickly and without that much thought:
(1) I'm really confused why people think that deceptive scheming -- i.e., a LLM lying in order to post-deployment gain power -- is remotely likely on current LLM training schemes. I think there's basically no reason to expect this. Arguments like Carlsmith's -- well, they seem very very verbal and seems presuppose that the kind of "goal" that an LLM learns to act to attain during contextual one roll-out in training is the same kind of "goal" that will apply non-contextually to the base model apart from any situation.
(Models learn extremely different algorithms to apply for different parts of data -- among many false things, this argument seems to presuppose a kind of unity to LLMs which they just don't have. There's actually no more reason for a LLM to develop such a zero-context kind of goal than for an image segmentation model, as far as I can tell.)
Thus, I predict that we will continue to not find such deceptive scheming in any models, given that we train them about like how we train them -- although I should try to operationalize this more. (I understand Carlsmith / Yudkowsky / some LW people / half the people on the PauseAI discord to think something like this is likely, which is why I think it's worth mentioning.)
(To be clear -- we will continue to find contextual deception in the model if we put it there, whether from natural data (ala Bing / Sydney / Waluigi) or unnatural data (the recent Anthropic data). But that's way different!)
(2). All AI systems that have discovered something new have been special-purpose narrow systems, rather than broadly-adapted systems.
While "general purpose" AI has gathered all the attention, and many arguments seem to assume that narrow systems like AlphaFold / materials-science-bot are on the way out and to be replaced by general systems, I think that narrow systems have a ton of leverage left in them. I bet we're going to continue to find amazing discoveries in all sorts of things from ML in the 2020s, and the vast majority of them will come from specialized systems that also haven't memorized random facts about irrelevant things. I think if you think LLMs are the best way to make scientific discoveries you should also believe the deeply false trope from liberal arts colleges about a general "liberal arts" education being the best way to prepare for a life of scientific discovery. [Note that even systems that use non-specialized systems as a component like LLMs will themselves be specialized].
LLMs trained broadly and non-specifically will be useful, but they'll be useful for the kind of thing where broad and nonspecific knowledge of the world starts to be useful. And I wouldn't be surprised that the current (coding / non-coding) bifurcation of LLMs actually continued into further bifurcation of different models, although I'm a lot less certain about this.
(3). The general view that "emergent behavior" == "I haven't looked at my training data enough" will continue to look pretty damn good. I.e., you won't get "agency" from models scaling up to any particular amount. You get "agency" when you train on people doing things.
(4) Given the above, most arguments about not deploying open source LLMs look to me mostly like bog-standard misuse arguments that would apply to any technology. My expectations from when I wrote about ways AI regulation could be bad have not changed for the better, but for the much much worse.
I.e., for a sample -- numerous orgs have tried to outlaw open source models of the kind that currently exist because because of their MMLU scores! If you think are worried about AI takeover, and think "agency" appears as a kind of frosting on top of of a LLM after it memorizes enough facts about the humanities and medical data, that makes sense. If you think that you get agency by training on data where some entity is acting like an agent, much less so!
Furthermore: MMLU scores are also insanely easy to game, both in the sense that a really stupid model can get 100% by just training on the test set; and also easy to game, in the sense that a really smart model could get almost arbitrarily low by excluding particular bits of data or just training to get the wrong answer on the test set. It's the kind of rule that would be goodhearted to death the moment it came into existence -- it's a rule that's already been partially goodhearted to death -- and the fact that orgs are still considering it is an update downward in the competence of such organizations.
I agree. AI safety advocates seem to be myopically focused on current-day systems. There is a lot of magical talk about LLMs. They do exactly what they're trained to: next-token prediction. Good predictions requires you to implicitly learn natural abstractions. I think when you absorb this lesson the emergent abilities of gpt isn't mega surprising.
Agentic AI will come. It won't be just a scaled up LLM. It might grow as some sort of gremlin inside the llm but much more likely imho is that people build agentic AIs because agentic AIs are more powerful. The focus on spontaneous gremlin emergence seems like a distraction and motivated partially by political reasons rather than a dispassionate analysis of what's possible.
I think Just Don't Build Agents could be a win-win here. All the fun of AGI without the washing up, if it's enforceable.
Possible ways to enforce it:
(1) Galaxy-brained AI methods like Davidad's night watchman. Downside: scary, hard.
(2) Ordinary human methods like requring all large training runs to be approved by the No Agents committee.
Downside: we'd have to ban not just training agents, but training any system that could plausibly be used to build an agent, which might well include oracle-ish AI like LLMs. Possibly something like Bengio's scientist AI might be allowed.
Just registering that I think the shortest timeline here looks pretty wrong.
Ruling intuition here is that ~0% remote jobs are currently automatable, although we have a number of great tools to help people do em. So, you know, we'd better start doubling on the scale of a few months if we are gonna hit 99% automatable by then, pretty soon.
Cf. timeline from first self-driving car POC to actually autonomous self-driving cars.