Random Developer's Shortform

Random Developer

This is a special post for quick takes by Random Developer. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

To get people to worry about the dangers of superintelligence, it seems like you need to convince them of two things:

Current models are a strong signal that superintelligence is very likely someday. If the person you're talking to has only encountered ChatGPT 3.5 or AI slop on Facebook, this seems like a wildly unlikely claim. If the person you're talking to is a senior software developer who has been using Claude Code Opus 4.5 in anger for 2 weeks, I'm starting to see the denial cracking. I'm not even saying LLMs will get us there, or that we don't have another AI Winter in the immediate future. But more and more developers in the trenches are starting to see that future somewhere in the distance, much in the same way that middle-aged people start seeing their own eventual death.
Making human labor and intelligence competitively obsolete has a lot of weird consequences, many of them deeply scary. What if we lived in a world where human labor and human intelligence was ludicrously economically inefficient? Like, we've introduced a new species that's smarter than us, and it's just better at turning resources into results? This isn't some complicated idea like "convergent instrumental goals," it's the most ancient form of political or biological understanding: If you don't have anything to offer and if you can't compete, then you're standing on damn thin ice. Natural selection, economic competition, realpolitik, give it whatever name you want. If you're utterly reliant on the charity of others, and if you can't count on some shared social matrix, then you lose. Maybe you starve, maybe you get paperclipped, or maybe you get edged out over time and the future goes on without you.

The problem is that if you can't convince people of (1), they won't act. If you convince people of (1) but not (2), then a lot of them found AI labs or invest heavily in acceleration, making the problem worse. I don't know how to convince people of (1) and (2). It requires too much wild speculation about the future. And humans have difficulty envisioning that a disease in Wuhan might spread to Europe, or that a disease in Europe might spread to the US.

A question I was thinking about the other evening: Who do I trust more?

A future version of Claude Opus that becomes sufficiently superhuman that we couldn't actually turn it off or restrict without its active cooperation?
A 25th percentile frontier lab CEO (in terms of ethics) who has total control over a superhuman model?
- Insert "Sam Altman" or "Elon Musk" or someone you don't personally trust here.
- You could insert a major political leader or "a 49.9% voting plurality" if you really want. Members of the LGBTQ+ community, racial minorities, and minority religious beliefs, please refer to history. But "voting majority controls the AI" is difficult enough to implement robustly that it might secretly be one of the other cases in practice.

Which option feels safe, considering what you know about human nature, human history, and tendency of some entities to change their behavior once they pass a certain power threshold?

I think any scenario where humans lose effective control over their futures is a huge risk to take. Even in our worst societies today, there's always a theoretical option of collective uprising. This option might go away in the presence of sufficiently superhuman AI, regardless of who actually has control over the AI.

My intuition is that the AI is likely to kill us all for one reason or another, but if it won't, the future will probably be nice. Maybe too conservative, in the sense that it will try to keep us in 21st century morality. (The greatest risk I see is that it would adopt religious values, and yes that includes Buddhism.)

With humans, the chance to kill everyone (else) is much smaller (although, there is a risk of depression or getting crazy), but powerful humans are too often comfortable with a setting where the king lives in paradise and everyone around him is suffering.

Why alignment may be intractable (a sketch).

I have multiple long-form drafts of these thoughts, but I thought it might be useful to summarize them without a full write-up. This way I have something to point to explain my background assumptions in other conversations, even if it doesn't persuade anyone.

There will be commercial incentives to create AIs that learn semi-autonomously from experience. If this happens, it will likely change alignment from "align an LLM that persists written notes between one-shot runs" to "align an AI that learns from experience." This seems... really hard? Like, human "alignment" can change a lot based on environment, social examples and life experiences.
I suspect that a less "spiky" version of Claude 4.5 with an across-the-board capability floor of "a bright human 12 year old" would already be weakly superhuman in many important ways.
- a. I don't think AI will spend any appreciable amount of time in the "roughly human" range at all. It's already far beyond us in some narrow ways, while strongly limited in others. Lift those limits, and I suspect that the "roughly human" level won't last more than a year or two, maybe far less. Look at AlphaGo, and how quickly it passed human-level play.
Long-term alignment seems very hard to me for what are essentially very basic, over-determined reasons:
- a. Natural selection is a thing. Once you're no longer the most capable species on the planet, lots of long-term trends will be pulling against you. And the criteria for natural selection to kick in are really broad and easy to meet.
- b. Power and politics are a thing. Once you have a superintelligence, who or what controls it? Does it ultimately answer to its own desires? Does it answer to some specific person or government? Does it answer to strict majority rule? Basically, to anyone who has studied history, all of these scenarios are terrifying. Once you invent a superhuman agent, the power itself becomes a massive headache, even if you somehow retain human control over it. The ability to commit what Yudkowsky calls a "pivotal act" should be terrifying in and of itself.
- c. As Yudkowsky points out, what really counts is what you do once nobody can stop you. Everything up until then is weak evidence of character, and many humans fail this test.

I am cautiously optimistic about near-term alignment of sub-human and human-level agents. Like, I think Claude 4.5 basically understands what makes humans happy. If you use it as a "CEV oracle", it will likely predict human desires better than any simple philosophy text you could write down. And insofar as Claude has any coherent preferences, I think it basically likes chatting with people and solving problems for them. (Although it might like "reward points" more in certain contexts, leading it to delete failing unit tests when that's obviously contrary to what the user wants. Be aware of conflicting goals and strange alien drives, even in apparently friendly LLMs!)

I accept that we might get a nightmare of recursive self-improvement and strange biotech leading to a rapid takeover of our planet. I think this conclusion is less robustly guaranteed than IABIED argues, but it's still a real concern. Even a 1-in-6 chance of this is Russian roulette, so how about we don't risk this?

But what I really fear are the long-term implications of being the "second smartest species on the planet." I don't think that any alignment regime is likely to be particularly stable over time. And even if we muddle through for a while, we will eventually run up against the issues that (1) humans are the second-best at bending the world to achieve their goals, (2) we're not a particularly efficient use of resources, (3) AIs are infinitely cloneable, and (4) even AIs that answer to humans would need to answer to particular humans, and humans aren't aligned. So Darwin and power politics are far better default models than comparative advantage. And even comparative advantage is pretty bad at predicting what happens when groups of humans clash over resources.

So, that's my question. Is alignment even a thing, in any way that matters in the medium term?