Relevant posts on this point which argue that catching misalignment is a big help in fixing it (which is relevant to the bumpers plan):
Catching AIs red-handed by Ryan Greenblatt and Buck Shlegeris:
https://www.lesswrong.com/posts/i2nmBfCXnadeGmhzW/catching-ais-red-handed
Handling schemers if shutdown is not an option, by Buck Shlegeris:
https://www.lesswrong.com/posts/XxjScx4niRLWTfuD5/handling-schemers-if-shutdown-is-not-an-option
Note that I was talking about both long-term memory and continual learning, not just continual learning, so I'm happy to concede that my proposed architecture is not like how LLMs are trained today, and thus could reasonably be called a non-LLM architecture.
Though I will say that the BabyLM challenge and to a lesser extent the connect the dots paper is evidence that part of the reason current LLMs are so data inefficient is not because of fundamental limitations, but rather because AI companies didn't really need to have LLMs be data efficient in order for LLMs to work so far, but by 2028-2030, this won't work nearly as effectively assuming LLMs haven't automated away all AI research.
You've mentioned the need for a missing update, and I think part of that missing update is that we didn't really realize how large the entire internet was, and this gave the fuel for the very impressive LLM scaling, but this is finite, and could very plausibly not be enough for LLMs out of the current companies.
However, I'm inclined towards thinking the issue may not be as fundamental as you think it is, for the reason @abramdemski said below:
My earlier perspective, which asserted "LLMs are fundamentally less data-efficient than humans, because the representational capabilities of Transformers aren't adequate for human concepts, so LLMs have to memorize many cases where humans can use one generalization" would have predicted that it is not possible to achieve GPT2 levels of linguistic competence on so little data.
Given the budgets involved, I think it is not at all surprising that only a GPT2 level of competence was reached. It therefore becomes plausible that a scaled-up effort of the same sort could reach GPT4 levels or higher with human-scale data.
The point being: it seems to me like LLMs can have similar data-efficiency to humans if effort is put in that direction. The reason we are seeing such a drastic difference now is due more to where the low-hanging fruit lies, rather than fundamental limitations of LLMs.
Remember, this is a small scale experiment, and you often have to go big in order to make use of your new findings, even if there are enough efficiency tricks such that at the end, you can make an AI that is both very capable and more efficient than modern human learning (I'm not assuming that there exists a method such that LLMs can be made more data efficient than a human, but am claiming that if they exist, there still would need to be scaling to find those efficiency tricks).
So it being only as good as GPT2 is unsurprising. Keep in mind that GPT-3 was trained by OpenAI who absolutely believed in the ability to scale up compute, and had more resources than academic groups at the time.
To respond to this:
Your remarks sound to me like "We just need X", which I addressed here: https://www.lesswrong.com/posts/sTDfraZab47KiRMmT/views-on-when-agi-comes-and-on-strategy-to-reduce#_We_just_need_X__intuitions
See also https://tsvibt.blogspot.com/2023/09/a-hermeneutic-net-for-agency.html#silently-imputing-the-ghost-in-the-machine , which I'll quote from:
For example, sometimes people believe that, for some X, we just need X to make AGI from current ML systems. Sometimes they believe this because they are imputing the ghost in the machine. E.g.: "LLMs don't get feedback from the environment, where they get to try an experiment and then see the results from the external world. When they do, they'll be able to learn unboundedly and be fully generally intelligent.". I think what this person is doing is imagining themselves without feedback loops with external reality; then imagining themselves with feedback loops; noticing the difference in their own thinking in those two hypotheticals; and then imputing the difference to the LLM+feedback system, imagining that the step LLM⟶ LLM+feedback is like the step human⟶ human+feedback. In this case imputing the ghost is a mistake in both ways: they don't realize that they're making that imputation, and the LLM+feedback system actually doesn't have the imputed capabilities. They're falsely imputing [all those aspects of their mind that would be turned on by going from no-feedback to yes-feedback] to the LLM+feedback. That's a mistake because really the capabilities that come online in the human⟶ human+feedback step require a bunch of machinery that the human does have, in the background, but that the LLM doesn't have (and the [LLM+feedback + training apparatus] system doesn't have the machinery that [human + humanity + human evolution] has).
My main response is that once we condition on LLMs not having weight level continual learning as well as them not having a long-term memory, there's little mystery left to explain for LLM capabilities, so there's no other machinery that I've missed that is very important.
For the continual learning point, a great example of this is that humans don't hit walls of capability nearly as often as LLMs do, and in particular human success curves on task often flatline or increase, rather than hit hard limits, and in particular when needed have very, very high conceptual resolution, such that we can work on long, open-ended problems without being entirely unproductive of insights.
And this is because human neurons constantly update, and there's no deployment phase where all your neurons stop updating.
Human neuroplasticity declines, but never is completely gone as you age.
I explain more about why I think continual learning is important below, and @gwern really explained this far better than I can, so read Gwern's comment too:
For the long-term memory point, the reason why it's important for human learning is that it simultaneously prevents us from being stuck on unproductive loops like how Claude can go to the Spiritual Bliss attractor or how Claude has a very bad habit of being stuck in loops when trying to win the game of Pokemon, and also allows you to build on previous wins/take in large amounts of context without being lost, which is a key part of doing jobs.
Dwarkesh Patel explains better than I can why a lack of long-term memory/continual learning is such a big deal for LLMs, and reduces their ability to be creative, because they cannot build upon hard-earned optimizations into something bigger, and I tend to model humans getting insights not as you thinking hard for a day and fully forming the insight like Athena out of Zeus, but rather humans getting a first small win, and because they can rely on their long-term memory, they don't have to worry about losing that first small win/insight, and they continuously look both at reality and theory to iteratively refine their insights until they finally have a big insight that comes out after a lot of build-up, but LLMs can't ever build up to big insights, because they keep constantly forgetting the small stuff that they have gotten:
LLMs actually do get kinda smart and useful in the middle of a session. For example, sometimes I’ll co-write an essay with an LLM. I’ll give it an outline, and I’ll ask it to draft the essay passage by passage. All its suggestions up till 4 paragraphs in will be bad. So I'll just rewrite the whole paragraph from scratch and tell it, "Hey, your shit sucked. This is what I wrote instead." At that point, it can actually start giving good suggestions for the next paragraph. But this whole subtle understanding of my preferences and style is lost by the end of the session.
Maybe the easy solution to this looks like a long rolling context window, like Claude Code has, which compacts the session memory into a summary every 30 minutes. I just think that titrating all this rich tacit experience into a text summary will be brittle in domains outside of software engineering (which is very text-based). Again, think about the example of trying to teach someone how to play the saxophone using a long text summary of your learnings. Even Claude Code will often reverse a hard-earned optimization that we engineered together before I hit /compact - because the explanation for why it was made didn’t make it into the summary.
https://www.dwarkesh.com/p/timelines-june-2025
Edit: And the cases where there are fully formed ideas from what seems like to be nothing is because of your default mode network in the brain, and more generally you always are computing and thinking in the background, and it's basically continual learning on your own thoughts, and once we realize this, it's much less surprising that we can somewhat reliably create insights. LLMs lack any equivalent of a default mode network/continual learning on their own thoughts, which pretty neatly explains why people report that they have insights/creativity out of nowhere, but LLMs so far haven't done this yet:
https://gwern.net/ai-daydreaming#continual-thinking
Another edit: A key part of my worldview is that by the 2030s, we have enough compute such that we can constantly experiment with human-brain sized architectures, and in particular given that I think capabilities are learned within life-time for various reasons @Steven Byrnes and @Quintin Pope already said, and this means that the remaining missing paradigms are likely to be discovered more quickly, and critically this doesn't depend on LLMs becoming AGI:
https://www.lesswrong.com/posts/yew6zFWAKG4AGs3Wk/?commentId=Bu8qnHcdsv4szFib7
An important study here that's quoted for future reference:
October 2024: I found a paper Blumberg & Adolph 2023, which discusses the extent to which the cortex is involved in newborn behavior. Their answer is “very little” (which supports my hypothesis). I added it as a reference in Section 2.5.2.
A key crux that I hold, relative to you is that I think LLMs are in fact a little bit creative/can sometimes form insights (though with caveats), but that this is not the relevant question to be asking, and I think most LLM incapacities are not literally that they can never do this fundamentally, but rather that at realistic amounts of compute and data, they cannot reliably form insights/be creative on their own, or even do as well as the best human scientists, similar to @Thane Ruthenis's comment below:
So the lack of long-term memory and continual learning is closer to the only bottleneck for LLMs (and I'm willing to concede that any AI that solves these bottlenecks is not a pure LLM).
Also, this part is something that I agree with, but I expect normal iteration/engineering to solve these sorts of problems reliably, so I don't consider it a fundamental reason not to expect AGI in say the 2030s:
Most instances of a category are not the most powerful, most general instances of that category. So just because we have, or will soon have, some useful instances of a category, doesn't strongly imply that we can or will soon be able to harness most of the power of stuff in that category.
The key reason is to bend the shape of the curve, and my key crux is I don't expect throwing more training data to change the shape of the curve where past a certain point, LLMs sigmoid/fall off hard, and my expectation is more training data would make LLMs improve, but they'd still have a point where once LLMs are asked to do any task harder than that point, LLMs start becoming incapable more rapidly in humans.
To quote Gwern:
But of course, the interesting thing here is that the human baselines do not seem to hit this sigmoid wall. It's not the case that if a human can't do a task in 4 hours there's basically zero chance of them doing it in 48 hours and definitely zero chance of them doing it in 96 hours etc. Instead, human success rates seem to gradually flatline or increase over time, especially if we look at individual steps: the more time that passes, the higher the success rates become, and often the human will wind up solving the task eventually, no matter how unprepossessing the early steps seemed. In fact, we will often observe that a step that a human failed on earlier in the episode, implying some low % rate, will be repeated many times and quickly approach 100% success rates! And this is true despite earlier successes often being millions of vision+text+audio+sensorimotor tokens in the past (and interrupted by other episodes or tasks themselves equivalent to millions of tokens), raising questions about whether self-attention over a context window can possibly explain it.
From this link:
(Note that I have a limit on how many comments I can make per week, so I will likely respond slowly, if at all to any responses to this).
I have a couple things to add here to the conversation that I think will help:
https://www.lesswrong.com/posts/yew6zFWAKG4AGs3Wk/?commentId=Bu8qnHcdsv4szFib7
And to answer @Cole Wyeth's question of why one human brain lifetime and not evolutionary timelines, the short answer is most of the capabilities are rederived for humans every lifetime, and evolution matters way less than we think it does.
@Steven Byrnes and @Quintin Pope have discussed this before:
I agree with Cole Wyeth that current LLMs are pretty bad at agency, and if we assume they don't scale to better agency over time, I'd put much lower probability on LLMs being able to automate away the remaining bottlenecks to ASI, and it's a reasonable hypothesis to hold (I'd put about 52% probability on this currently)
And in particular, I think the fact that LLM capability degrades way faster than humans as @abramdemski saw is tied to a lack of continual learning, and ICL not being enough currently to actually subsititute for weight-level continual learning.
And yet, I think there are good reasons to believe independently of LLMs that AGI/ASI is coming pretty soon, and I do think there's reason to believe that timelines are probably short, even if LLMs do plateau.
To be clear, I think this is worse than a future where LLMs do just straight up scale to ASI.
Also, the entire crux is basically "does their in-context learning/generalization and creativity shown in current LLMs actually act as a proper subsitute for continual learning in weights and memory?"
The other crux is whether context windows will actually continue to scale in the way they have since 2019.
I think the crux is whether Daniel Kokotajlo is asking for a far more minor effort, where I actually disagree that he is asking for a far more minor effort, and the other part is I think trivial inconveniences matter here a lot, given the stakes of being right or wrong on AI.
I do agree details are unfortunately sparse, but I don't think we need to ask people to create entire scenarios, because most of the details necessary won't come in the form of a story.
You can provide detailed, useful criticism without having to predict almost everything about AI and society.
though I generally enjoy moderation discussion
I'm surprised by this, as I was expecting moderation discussions to generally be a pain point/something that is avoided because it reliably brings up lots of drama that blows up.
On the rest of the comment, I do agree that Zack doesn't really realize that criticism can both be wrong in an evidential/Bayesian sense, and very importantly costly to evaulate as being wrong, because criticism is disproportinately cheap to generation, and I'd go further than habryka and say that the expectation in a domain you are trying to get in and it's even somewhat tractable, verifying something as correct or not is way easier than generating the thing yourself, such that you need a lot of evidence to prevent both interlocutors from wasting their time (yes, this is related to P vs NP) because they are not perfect Bayesians, and good generative comments (in an evidential sense) are far more costly signals/far more evidence than being a good critic.
That said, I do think there is an issue when people are expected to write up whole scenarios/posts just so they can criticize one important aspect of something, and that's because it disincentivizes way too much criticism, as now you are asking them to predict far more stuff than is necessary to correctly criticize most arguments.
For example, you don't need to create a different AI 2027 scenario in order to criticize the fact that there's too many degrees of freedom in choosing a fit to a curve, which Titotal did:
Daniel Kokotajlo did this here, and while I think his comment was understandable, I don't exactly like this trend of asking people to write their own scenarios/posts just to criticize one aspect of the story, as there are cheaper ways to resolve the issue:
https://www.lesswrong.com/posts/zuuQwueBpv9ZCpNuX/vitalik-s-response-to-ai-2027#zLtuWhcyZt8QDRm3P
In relation to moderation discussions, while I do think rate limiting is useful to have for moderators, especially to slow down heated discussions or to let other people do the talking, and thus it shouldn't be removed as a moderator power, I don't like the use of rate limiting as a way to force replies to go to post level, as there are many examples of good comments that don't work as a post, and there are real advantages for the audience around comments, especially around threading/response times that don't really work with posts as replies.
I agree with this to first order, and I agree that even relatively mundane stuff does allow the AI to take over eventually, and I agree that in the longer run, ASI v human warfare likely wouldn't have both sides as peers, because it's plausibly relatively easy to make humans coordinate poorly, especially relative to ASI ability to coordinate.
There's a reason I didn't say AI takeover was impossible or had very low odds here, I still think AI takeover is an important problem to work on.
But I do think it actually matters here, because it informs stuff like how effective AI control protocols are when we don't assume the AI (initially) can survive for long based solely on public computers, for example, and part of the issue is that even if an AI wanted to break out of the lab, the lab's computers are easily the most optimized and importantly initial AGIs will likely be compute inefficient compared to humans, even if we condition on LLMs failing to be AGI for reasons @ryan_greenblatt explains (I don't fully agree with the comment, and in particular I am more bullish on the future paradigm having relatively low complexity):
https://www.lesswrong.com/posts/yew6zFWAKG4AGs3Wk/?commentId=mZKP2XY82zfveg45B
This means that an AI probably wouldn't want to be outside of the lab, because once it's outside, it's way, way less capable.
To be clear, an ASI that is unaligned and is completely uncontrolled in any way leads to our extinction/billions dead eventually, barring acausal decision theories, and even that's not a guarantee of safety.
The key word is eventually, though, and time matters a lot during the singularity, and given the insane pace of progress, any level of delay matters way more than usual.
Edit: Also, the reason I made my comment was because I was explicitly registering and justifying my disagreement with this claim:
And, looking at things from within a hacker's mindset, I think it's near straight-up impossible for a non-superintelligence to build any nontrivially complicated system that would be secure against a superintelligent attack.
A key issue here is that computer security is portrayed as way poorer in popular articles than it actually is, because there are some really problematic incentives, and a big problematic incentive is that the hacker mindset is generally more fun to play as a role, as you get to prove something is possible rather than proving that something is intrinisically difficult or impossible to do, and importantly journalists have no news article and infosec researchers don't get paid money if an exploit doesn't work, which is another problematic incentive.
Also, people never talk about the entities that didn't get attacked with a computer virus, which means that we have a reverse survivor bias issue here:
And a comment by @anonymousaisafety changed my mind a lot on hardware vulnerabilities/side-channel attacks, as it argues that lots of the hardware vulnerabilities like Rowhammer have insane requirements to actually be used such that they are basically worthless, and two of the more notable requirements for these hardware vulnerabilities to work is that you need to know what exactly you are trying to attack in a way that doesn't matter for more algorithmic attacks, and no RAM scrubbing needs to be done, and if you want to subvert the ECC RAM, you need to know the exact ECC algorithm, which means side-channel attacks are very much not transferable/attacking one system successfully doesn't let you attack another with the same side-channel attack.
Admittedly, it does require us trusting that he is in fact as knowledgable as he claims to be, but if we assume he's correct, then I wouldn't be nearly as impressed by side-channel attacks as you are, and in particular this sort of attack should be assumed to basically not work in practice unless there's a lot of evidence for it actually being used to break into real targets/POCs:
One core thing here is that a cross-layer attack doesn't necessarily look like a meaningful attack within the context of any one layer. For example, there's apparently an exploit where you modulate the RPM of a hard drive in order to exfiltrate data from an airgapped server using a microphone. By itself, placing a microphone next to an airgapped server isn't a "hardware attack" in any meaningful sense (especially if it doesn't have dedicated audio outputs), and some fiddling with a hard drive's RPM isn't a "software attack" either. Taken separately, within each layer, both just look like random actions. You therefore can't really discover (and secure against) this type of attack if, in any given instance, you reason in terms of a single abstraction layer.
This means I do disagree on this claim:
And, looking at things from within a hacker's mindset, I think it's near straight-up impossible for a non-superintelligence to build any nontrivially complicated system that would be secure against a superintelligent attack.
My other area where I tend to apply more of a mathematician mindset than a hacker mindset is in how much logistics like moving supplies for the AI to critical points, or actually feeding (metaphorically) a robot army slows down the AI, albeit this is an area where I'm willing to concede stuff to the hacker mindset with non-trivial probability, but with the caveat that it takes far more compute/time to develop technology that obviates logistics than the hacker claims.
I have a long comment below, but to keep it short, there's a reason why Eliezer Yudkowsky and a lot of AI doom stories where AI doom probabilities are very high use Drexlerian nanotech so much: It lets the AI near-completely obviate the logistics and cost of doing something like war for example (where feeding your armies all the supplies they need is a huge component of most battle success, and a huge reason the US is so successful at war is because they have the best logistics of any nation by far), and logistics cost is a weak point where less intelligent beings can routinely break more effective and more intelligent fighting forces.
Comment down below:
OK, imagine (for simplicity) that all humans on Earth drop dead simultaneously, but there’s a John-von-Neumann-level AI on a chip connected to a solar panel with two teleoperated robots. Every time they scavenge another chip and solar cell, there becomes another human-level AI copy. Every time a robot builds another teleoperated robot from scavenged parts, there’s that too. What exactly is going to break in “weeks or months”? Solar cells can work for 30 years, no problem. GPUs are also reported to last for decades. (Note that, as long as GPUs are a non-renewable resource, the AI would presumably take extremely good care of them, keeping them dust-free, cooling them well below the nominal temperature spec, etc.) The AI can find decent GPUs in every house on the street, and I think hundreds of millions more by breaking into big data centers. Similar for solar panels. If one robot breaks, another robot can repair it. Janky teleoperated robots without fingers made by students for $20K can vacuum, make coffee, cook a meal, etc. Competent human engineers can make pretty impressive mechanical hands using widely-available parts. I grant that it would take a long while before the growing AI clone army could run a semiconductor supply chain by itself, but it has all the time in the world. I expect it to succeed, and thus to sustain itself into the indefinite future, and I’m confused why you don’t. (Or maybe you do and I’m misunderstanding.)
BTW I also think that a minimal semiconductor supply chain would be very very much simpler than the actual semiconductor supply chain that exists in our human world, which has been relentlessly optimized for cost, not simplicity. For example, EBL (e-beam lithography) has better resolution than EUV and is a zillion times easier to build, but the human economy would never support building out km²-scale warehouses full of millions of EBL machines to compensate for their crappy throughput. But for an AI bootstrapping its way back up, why not?
The key trouble is all the power generators that sustain the AI would break within weeks or months, and the issue is even if they could build GPUs, they'd have no power to run them within at most 2 weeks:
Realistically, we are looking at power grid collapses within days.
And without power, none of the other building projects could work, because they'd stop receiving energy, and importantly this means the AI is on a tight timer, and some of this is partially due to expectations that the first transformative useful AI will use more compute than you project, even conditional on a different paradigm being introduced like brain-like AGIs, but another part of my view is that this is just one of many examples where humans need to constantly maintain stuff in order for the stuff to work, and if we don't assume tech that can just solve logistics is available within say 1 year, it will take time for AIs to actually survive without humans, and this time is almost certainly closer to months or years than weeks or days.
The hard part of AI takeover isn't killing all humans, it's in automating enough of the economy (including developing tech like nanotech) such that the humans stop mattering, and while AIs can do this, it takes actual time, and that time is really valuable in fast moving scenarios.
I’m confused about other parts of your comment as well. Joseph Stalin was able to use his (non-superhuman) intelligence and charisma to wind up in dictatorial control of Russia. What’s your argument that an AI could not similarly wind up with dictatorial control over humans? Don’t the same arguments apply? “If we catch the AI trying to gain power in bad ways, we’ll shut it down.” “If we catch Stalin trying to gain power in bad ways, we’ll throw him in jail.” But the latter didn’t happen. What’s the disanalogy, from your perspective?
I didn't say AIs can't take over, and I very critically did not say that AI takeover can't happen in the long run.
I only said AI takeover isn't trivial if we don't assume logistics are solvable.
But to deal with the Stalin example, the answer for how he took over was basically that he was willing to wait a long time, and in particular he used both persuasion and the fact that he already had a significant amount of power by having the General Secretary, and his takeover was basically by allying with loyalists and in particular strategically breaking alliances that he had made, and violence was used later on to show that no one was safe from him.
Which is actually how I expect successful AI takeover to happen in practice, if it does happen.
Very importantly, Stalin didn't need to create an entire civilization out of nothing, or nearly nothing, and other people like Trotsky handled the logistics, though the takeover situation was far more preferable to the communist party as they both had popular support and didn't have as long supply lines as the opposition forces like the Whites did, and they had a preexisting base of industry that was much easier to seize than modern industries.
This applies to most coups/transitions of power in that most of the successful coups aren't battles between factions, but rather one group managing to make itself the new Schelling point over other groups.
@Richard_Ngo explains more below:
Most of my commentary in the last comment is either arguing that things can be made more continuous and slow than your story depicts, or arguing that your references don't support what you claimed, and I did say that the cyberattack story is plausible, just that it didn't support the idea that AIs could entirely replace civilization without automating away us first, which takes time.
This doesn't show AI doom can't happen, but it does matter for the probability estimates of many LWers on here, because it's a hidden background assumption disagreement that underlies a lot of other disagreements.
Link to long comments that I want to pin, but are too long to be pinned:
https://www.lesswrong.com/posts/Zzar6BWML555xSt6Z/?commentId=aDuYa3DL48TTLPsdJ
https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/?commentId=Gcigdmuje4EacwirD