Claude 4.5 is already superhuman in some areas, including:
But there are other essential abilities where leading LLMs are dumber than diligent 7 year old. Gemini is one of the stronger visual models, and I routinely benchmark it failing simple visual tasks that any child could solve.
And then there's software development. I use Claude for software development, and it's quite skilled at many simple tasks. But I also spend a lot of time dealing with ill-conceived shit code that it writes. You can't just give an irresponsible junior programmer a copy of Claude Code and allow them commit straight to main with no review. If you do, you will learn the meaning of the word "slop." In the hands of a skilled professional who takes 100% responsibility for the output, Claude Code is useful. In the hands of an utter newb who can't do anything on their own, it's also great. But it can't do anything serious without massive handholding and close expert supervision.
So my take is that frontier models are terrifyingly impressive if you're paying attention, but they are still critically broken in ways that make even the simplest real-world use cases a constant struggle. (To be clear, I think this is good: I'm a hardcore doomer.)
And the AI safety space is not in denial about this. It's extremely common for AI safety people to worry about either irrevocable loss or human control or even complete human extinction in the 2030s. You don't even have to look that far around here to find someone who'd estimate a 10% chance of human loss of control in the next 5 years.
Yeah, a closer analogy to human-animal trade would be our two "symbiotic" species: dogs and cats. Humans have relied on guard dogs, hunting dogs, herding dogs and retrievers for a long time. In exchange we provide food, shelter, sometimes medicine, etc. From a wolf's perspective, humans are not the worst pack members, because they have strange and useful abilities wolves lack. The deal with cats is probably simpler: Humans get pest control, cats get abundant mice, and the humans might be willing to throw in supplemental food when mice are scarce.
A few other domestic animals, including horses and sheep, could be argued to be "trading" with humans on some level. But you don't have to look far to find animals where our relationship is more clearly exploitation or predation, or outright elimination.
This actually mirrors the range of outcomes I would expect from building a true superintelligence:
Personally, this is why I favor a halt in the near future: When (1) is my most optimistic scenario, I don't think it's worth gambling. "Alignment", to me, is basically, "Trying to build initially kinder pet owners," with zero long term control.
Another interesting take on FTL travel appeared in Charles Stross's Singularity Sky, where is was accepted that FTL travel also implied time travel (for the usual reasons involving lightcone shenanigans), which in turn permitted temporal paradoxes of various sorts.
The inherent paradoxes were (mostly)? prevented by something one of the characters described by "semi-divine fiat," as in, the temporal coherence of history was enforced by
a superintelligence carefully passing messages back in time to earlier versions of itself and then preemptively ruining your day by
having your planet "coincidentally" bitten to death by killer asteroids right before you tried to violate causality.
The book was partially inspired by Moravec's models of "closed time-like curves." It's a surprisingly fun bit of older SF that at least accepts that FTL has weird consequences if you try to take the physics even semi-seriously.
This raises the question of how much of human "introspection" is actually an accurate representation of what's happening in the brain. Old experiments like the "precognitive carousel" and other high resolution time experiments strongly suggested that at least a portion of our consciousness experience subtly misrepresented what was actually happening in our brains. To the extent that some of our "introspection" may be similar to LLM hallucinations loosely constrained by available data.
But the last time I looked at these hypotheses was 30 years ago. So take my comments with a grain of salt.
then you should strongly dislike lengthy analogies that depict one's ideological opponents repeatedly through strawmen / weakman arguments.
I suspect I know what article inspired this. I am less sure that it was an actual argument, than something like an exhaustive catalog of other people's annoyingly bad arguments. Had it been prefixed with "[Warning: Venting]" I would have found it unremarkable.
However, there is an annoying complication in certain discussions of AI safety where people argue that AI safety is really easy because of course we'll all do . is typically some thing like "Lock the AI in a box." Which of course would never work because someone would immediately give the AI full commit privs to production and write a blog post about how they never even read the code. And when you have argued against that plan working, then people propose plan , , , etc, all of which could be outsmarted by a small child. And everyone insists on a personal rebuttal, because their plan is different.
So you wind up with a large catalog of counterarguments to dumb plans. Which looks a lot like dunking on strawmen.
I belong to a private Discord full of geeky friends (as one does), and I constantly see this pattern you describe, where smart people dismiss AI risks because they dismiss AI capabilities. This takes several common forms:
The only way I've found to occasionally get someone over this hump is to give them a tool like Claude Code and let them feel the AI.
But the problem is, once you convince someone that AGI might happen, a disturbing number of people fail to really think through the consequences of really doing that. Which is perhaps why so many AI safety researchers have done so much to accelerate AI capabilities: once they believe in their bones, they almost inevitably want to build it.
So I constantly struggle with whether it actually helps to convince people of existing or future AI capabilities.
This is fascinating. My internal voice is full of words, words, words. Words are everywhere and cheap.
But there is a point in learning to actively speak a foreign language where my native language is suppressed, hard, in order to allow the new language a chance to take over as the language of thought. And while the new language is weak enough, I can see the shapes of my wordless thoughts. But I cannot describe them, because I temporarily lack words.
With time and practice, enough words return that the wordless thoughts are harder to see.
This is followed by another strange phenomenon. There will come a time where, if I speak my new language exclusively for a day or so, suddenly switching back to my native language will then result in me translating thoughts from my new language into my native one. This is a frustratingly imprecise process, because I choose my words carefully, and because some of them translate into my native language with a different nuance than I want. Once the language switch finishes, 15-30 minutes later, I resume picking words precisely in my native language and no long feel the frustration of [back translation]."
The one other time that my wordless thoughts become visible is in collaborative professional work, especially with people new to the field. I will occasionally catch myself trying to boil down decades of experience: "That will not fit well here, because it's an [untranslatable]." Some random examples include [error-masking-retry-system], [budgeted-capacity-system-that-will-increase-complexity-costs-to-cheat-queuing-theory-badly], [charmingly-naive-ai-plan-that-even-the-smartest-people-in-the-world-all-fell-for-too-until-they-tried-it], and even [personal-version-of-second-system-effect-from-talented-juniors-who-have-been-burned-by-lack-of-structure-but-not-yet-lack-of-simplicity]. Many of these can be unpacked into words, but those words do not necessarily create corresponding structures in the listener's mind, not unless I unpack the atomic thoughts into an entire essay.
Now that I think about it, it seems that much of Less Wrong is people writing essays to unpack bits of atomic mental vocabulary for others. So thank you for identifying this pattern.
like idk, you can just ask models to do stuff and they like mostly try their best, and it seems very unlikely that literal GPT-5 is already pretending to be aligned so it can subtly stab us when we ask it to do alignment research.
Sonnet 4.5 is much better aligned at a superficial level than 3.7. (3 7: "What unit tests? You never had any unit tests. The code works fine.") I don't think this is because Sonnet 4.5 is truly better aligned. I think this is mostly because Sonnet 4.5 is more contextually aware and has been aggressively trained not to do obvious bad things when writing code. But it's also very aware when someone is evaluating it, and it often notices almost immediately. And then it's very careful to be on its best behavior. This is all shown in Anthropic's own system card. These same models will also plot to kill their hypothetical human supervisor if you force them into a corner.
But my real worry here isn't the first AGI during its very first conversation. My problem is that humans are going to want that AGI to retain state, and to adapt. So you essentially get a scenario like Vernor Vinge's short story "The Cookie Monster", where your AGI needs a certain amount of run-time before it bootstraps itself to make a play. A plot can be emergent, an eigenvector amplified by repeated application. (Vinge's story is quite clever and I don't want to totally spool it.)
And that's my real concern: Any AGI worthy of the name would likely have persistent knowledge and goals. And no matter how tightly you try to control it, this gives the AGI the time it needs to ask itself questions and to decide upon long-term goals in a way that current LLMs really can't, except in the most tighly controlled environments. And while you can probably keep control over an AGI, all bets are probably off if you build an ASI.
How much work is "stable" doing here for you? I can imagine scenarios in which a weak superintelligence is moderately corrigible in the short term, especially if you hobbled it by avoiding any sort of online learning or "nearline" fine tuning.
It might also matter whether "corrigible" means "we can genuinely change the AI's goals" or "we have trained the model not to ex-filtrate its weights when someone is looking." Which is where scheming comes in, and why I think a lack of interpretability would likely be fatal for any kind of real corrigibility.
Yes, exactly. C.S. Lewis wrote a very weird science fiction book titled That Hideous Strength, that was about (basically) a biological version of the Singularity.
And there's a scene where one the villains is explaining that with immortality, it will finally be possible to damn people to eternal Hell.
And of course, "Hells" are a significant theme in at least one of Iain M Banks' Culture novels as well.
This is a very obvious corollary: If there exists an entity powerful enough to build an immortal utopia, there is necessarily an entity powerful enough to inflict eternal suffering. It's unclear whether humans could ever control such a thing. And even if we could, that would also mean that some humans in particular would control the AI. How many AI lab CEOs would you trust with the power of eternal damnation?
(This is one of several reasons why I support an AI halt. I do not think that power should exist, no matter who or what controls it.)