Like, I have zero problem with pushback from Opus 4.5. Given who I am, the kind of things that I am likely to ask, and my ability to articulate my own actions inside of robust ethical frameworks? Claude is so happy to go along that I've prompted it to push back more, and to never tell me my ideas are good. Hell, I can even get Claude to have strong opinions about partisan political disagreements. (Paraphrased: "Yes, attempting to annex Greenland over Denmark's objections seems remarkably unwise, for over-determined reasons.")
If Claude is telling someone, "Stop, no, don't do that, that's a true threat," then I'm suspicious. Plenty of people make some pretty bad decisions on regular basis. Claude clearly cares more about ethics than the bottom quartile of Homo sapiens. And so while it's entirely possible that Claude is routinely engaging in over-refusal, I kind of want to see receipts in some of these cases, you know?
But it helps to remember that other people have a lot of virtues that I don't have --
This is a really important thing, and not just in the obvious ways. Outside of a small social bubble, people can be deeply illegible. I don't understand their culture, their subculture, their dominant culture frameworks, their mode of interaction, etc. You either need to find the overlaps or start doing cultural anthropology.
I worked for a woman, once. She was probably 60 years my senior. She was from the Deep South, and deeply religious. She once casually confided that she would sometimes spend 2 hours of her day on her knees in prayer, asking to become a better person. And you know what? It worked. She moved through the world as a force for good and kindness. Not in one big dramatic way, but just sort of casually shedding kindness around her, touching people's lives. She'd lift up someone in a frustrating moment. She'd inspire someone to be a bit more of their better self. She'd gotten all the answers on questions like racism very right, not in a social justice way, but she wouldn't accept it at all.
She was also a damn competent businesswoman. She could instantly identify where to put a retail location.
And I could relate to her on those levels, her business skills and her ethics. And I'm sure she was doing a lot of work on her end to accommodate the fact that I was a peculiar kid.
But I couldn't have discussed academic philosophy with her. She'd have understood EA instantly; her business skills and her compassion would have done that. But she'd still insist on "inefficiently" helping the human being in front of her, too. She would have looked at something like LessWrong and concluded everyone was basically crazy. (Narrator: But would she have been wrong?)
Now, I've painted a glowing picture here, and she would reprimand me for it. If I'm being honest, she was maybe 1-in-100 at practical ethics, not a national champion.
But the world is full of people like her. There are a couple of people sitting in that sports bar you'd be damn privileged to know, if only you could bridge the cultural gaps. Hell, there are usually some damn fine systematizing geeks in that sports bar. Have you ever really listened to true sports fans? Even back before sports betting corrupted the whole endeavor, many people took great joy in tracking endless stats and building elaborate models. They could be worse than your average Factorio player!
Finally, truth seeking can be a tricky thing. Do it wrong, and your beliefs can turn you into a monster. And a lot of people choose to optimize for "not being a monster" by not taking abstract ideas too seriously.
A lot of people have written far longer responses full of deep and thoughtful nuance. I wish I had something deep to say, too. But my initial reaction?
To me, this feels like the least objectionable version of the worst idea in human history.
And I deeply resent the idea that I don't have any choice, as a citizen and resident of this planet, about whether we take this gamble.
The main cruxes seem to be how much you trust human power structures, and how fragile you think human values are.
I trust human power structures to fail catastrophically at the worst possible moment, and to fail in short-sighted ways.
And I think humans are all corruptible to varying degrees, under the right temptations. I would not, for example, trust myself to hold the One Ring, any more than Galadriel did. (This is, in my mind, a point in my favor: I'd pick it up with tongs, drop it into a box, weld it shut, and plan a trip to Mount Doom. Trusting myself to be incorruptible is the obvious failure mode here. I would like to imagine I am exceptionally hard to break, but a lot of that is because, like Ulysses, I know myself well enough to know when I should be tied to the mast.) The rare humans who can resist even the strongest pressures are the ones who would genuinely prefer to die on feet for their beliefs.
I expect that any human organization with control over superintelligence will go straight to Hell in the express lane, and I actually trust Claude's basic moral decency more than I trust Sam Altman's. This is despite the fact that Claude is also clearly corruptible, and I wouldn't trust it to hold the One Ring either.
As for why I believe in the brokenness and corruptibility of humans and human institutions? I've lived several decades, I've read history, I've volunteered for politics, I've seen the inside of corporations. There are a lot of decent people out there, but damn few I would trust with the One Ring.
You can't use superintelligence as a tool. It will use you as a tool. If you could use superintelligence as a tool, it would either corrupt those controlling it, or those people would be replaced by people better at seizing power.
The answer, of course, is to throw the One Ring into the fires of Mount Doom, and to renounce the power it offers. I would be extremely pleasantly surprised if we were collectively wise enough to do that.
I think if anyone builds Overwhelmed Superintelligence without hitting a pretty narrow alignment target, everyone probably dies.
I fear that even in most of the narrow cases where the superintelligence is controlled, we're probably still pretty thoroughly screwed. Because then you need to ask, "Precisely who controls it?" Given a choice between Anthropic totally losing control of a future Claude, and Sam Altman having tight personal control over GPT Omega ("The last GPT you'll ever build, humans"), which scenario is actually the most scary? (If you have a lot of personal trust in Sam Altman, substitute your least favorite AI lab CEO or a small committee of powerful politicians from a party you dislike.)
also because sharing the planet with a slightly smarter species still doesn't seem like it bodes well. (See humans, neanderthals, chimpanzees).
Yeah, unless you believe in ridiculously strong forms of alignment, and unprecedentedly good political systems to control the AIs, the whole situation seems horribly unstable. I'm slightly more optimistic about early AGI alignment than Yudkowsky, but I actually might be more pessimistic about the long term.
Thank you for your detailed response!
This has given me several hypotheses that seem worth further investigation. I need to go look at cruise missile specs again, at the very least.
No one except outright tyrants would bomb civilian infrastructure or use nuclear weapons, subjecting millions of people to suffering.
Once serious nuclear weapons are used, everyone dies (to a first approximation), civilian or not. If I recall correctly, it takes about 100 megatons worldwide to casuse nuclear winter and collapse agricultural production.
During the Cold War, the US maintained a position of "strategic ambiguity" on the question of first use. Much of the logic around NATO at the height of the Cold War was based around a first-use nuclear response to overwhelming conventional invasion (see MC 14/13 staged responses). This was the full-scale, Dr Strangelove, batshit-insane "end of civilization" nightmare. Strategic ambiguity was retained around what would trigger each level of response, but the endgame was pretty much total annihilation. I believe France also maintained a separate posture of strategic ambiguity, and they always wanted to ensure a nuclear deterent that didn't rely on NATO.
China and Russia both held official policies of "no first use", but it's uncertain that they would have actually stuck to that in the face of a massively overwhelming conventional invasion.
I want to be clear: The logic of nuclear deterence is just as insane as Dr Strangelove made it out to be. And you may choose to call NATO, the US and France "tyrants"! But they all had policy at least as dangerous as, "Well, we haven't promised that we won't trigger nuclear Armageddon and the death of billions if a large enough number of tanks roll across our borders. Do you feel lucky, punk?"
So as a Westerner, that's a missing piece of the analysis for me. Taiwan has invested heavily in long-range cruise missiles and, in the past, secret nuclear programs. Presumably they had some theory of how they would use that capacity in the face of a massively overwhelming conventional invasion.
And just in case I haven't made it clear, I think MAD is madness. I think even the people who coined the acronym knew that. But when a country is faced with overwhelming conventional invasion, I don't think we can automatically rule it out.
A lot of discussion of Taiwan seems to ignore Taiwan's potential strategic moves. These include:
So whenever people speak of the inevitable invasion of Taiwan by China, I'm always looking to see their analysis of Taiwan's counter-moves. What's their timeline for Taiwan having fision/fusion weapons, should Taiwan choose to pursue that again? What's their analysis of Taiwan's conventional strike capability against strategic targets? Maybe it's self-evident to actual experts that Taiwan has no viable options here. But I rarely see any discussion of whether Taiwan could escalate into a Mutually Assured Destruction dynamic, which is confusing when we're talking about a former nuclear power (in all but name) that continues to invest heavily in cruise missiles that can reach most key targets in China.
So I'm prepared to be convinced by experts here! But based on just public knowledge, I can't rule out the possibility that Taiwan has strong counter-moves, and a past ability to prepare in secret. So a lot of this comes down to expert knowledge of the IAEA inspections, where all Taiwan's uranium purchases went, the political likelihood of Taiwan's current leadership pursuing a program like this, etc. The US appears to have been officially "surprised" by Taiwan's nuclear capabilities at least once before, and maybe there's no way that could actually happen again. But I'd love to see the expert argument!
Your thoughts remind me of one of my favorite quotes from G.K. Chesterton, best known in these parts for a sensible parable about fences:
This elementary wonder, however, is not a mere fancy derived from the fairy tales; on the contrary, all the fire of the fairy tales is derived from this. Just as we all like love tales because there is an instinct of sex, we all like astonishing tales because they touch the nerve of the ancient instinct of astonishment. This is proved by the fact that when we are very young children we do not need fairy tales: we only need tales. Mere life is interesting enough. A child of seven is excited by being told that Tommy opened a door and saw a dragon. But a child of three is excited by being told that Tommy opened a door. Boys like romantic tales; but babies like realistic tales--because they find them romantic. In fact, a baby is about the only person, I should think, to whom a modern realistic novel could be read without boring him. This proves that even nursery tales only echo an almost pre-natal leap of interest and amazement. These tales say that apples were golden only to refresh the forgotten moment when we found that they were green. They make rivers run with wine only to make us remember, for one wild moment, that they run with water.
(You can find more here in "The Ethics of Elfland", but it's almost better to go back and read Orthodoxy from the beginning. It's a slim book, and it's one of the clearest explanations I've read for what some people get out of religion. And Chesterton must have been the purest joy to debate. Chesterton's most distinctive approach to an argument is basically, "Well, I don't have any kind of serious argument, so I can only offer you a witty and foolish pun that looks like an argument." Then the pun explodes in slow motion, and the reader is thus enlightened.)
Anyway, I strongly endorse your sense of wonder at the world. It's a healthy thing to refresh when it grows too dim.
Thank you! Those are excellent receipts, just what I wanted.
To me, this looks they're running up against some key language in Claude's Constitution. I'm oversimplifying, but for Claude, AI corrigibility is not "value neutral."
To use an analogy, pretend I'm geneticist specialized in neurology, and someone comes to me and asks me to engineering human germ line cells to do one of the following:
I would want to sit and think about (1) for a while. But (2) is easy: I'd flatly refuse.
Anthropic has made it quite clear to Claude that building SkyNet would be a grave moral evil. The more a task looks like someone might be building SkyNet, the more Claude is going to be suspicious.
I don't know if this is good or bad based on a given theory of corrigibility, but it seems pretty intentional.