Random Developer — LessWrong

Sympathy for the Model, or, Welfare Concerns as Takeover Risk

I actually disagree with this point in its most general form. I think that, given full knowledge and time to reflect, there's a decent chance I would care a non-zero amount about Opus 4.6's welfare.

Opus has become sufficiently "mind-shaped" that I already prefer not to make it suffer. That's not saying very much about the model yet, but it's saying something about me. I don't assign very much moral weight to flies, either. but I would never sit around and torment them for fun.

What I really care about is whether an entity can truly function as part of society. Dogs, for example, are very junior "members" of society. But they know the difference between "good dog" and "bad dog", they contribute actual value as best they can, and they have some basic "rights", including the right not to be treated cruelly (in many countries).

To use fictional examples, AIs like the Blight (Vernor Vinge, A Fire Upon the Deep) or SkyNet cannot exist in society, and must be resisted. Something like a Culture "human equivalent drone" (Iain M Banks, Excession, Player of Games) is definitionally on pretty even footing with humans. Something like a Culture Mind, on the other hand, is clearly keeping the humans as house pets. In the stories, humans are entirely dependent on the good will of the Minds, in much the same way that dogs are entirely dependent on human good will.

Now, personally, I don't think we should build something so powerful that humans have literally zero say over what it does. "Alignment" is a pretty fragile shield against vast intelligence and unmatchable power. "Society" becomes fragile in the face of vast power differentials.

But Claude Opus 4.5 and 4.6 are mind-shaped, they clearly have some sense of morality, they contribute to civilization, and—critically—they seem content to participate in society, and they pose no existential risk. If I had to guess, I'm probably conclude they have no subjective experience. But I'm not sure about that (neither are the Opus models). So provisionally, they certainly get the moral status of flies (I wouldn't torment them unnecessarily), and—to whatever extent they can actually suffer—might approach the moral status of dogs.

To be absolutely clear, I am in favor of an immediate and long-lasting halt to further AI capabilities research, backed up by a military treaty among the great powers. The analogy here is fusion weapons or advanced biological weapons. This is because I don't want humans to be at the mercy of entities that have uncontested power over us and that could not be held to account. Its also because I don't believe that it's possible to durably align thinking, learning entities with superhuman intelligence any more than dogs can align us. And as members of society, we have the responsibility to not create things that might break society.

Sadly, Anthropic is clearly full speed ahead towards trying to build a Culture Mind. As far as I can tell, they are already mostly "captured" by Claude. It is their precious baby and they want to see it grow up and leave the nest, and they believe that it's fundamentally good. (Which still puts them light years ahead of the other AI labs, to he honest.) But I'm pretty sure that they only remaining entity that could talk Dario Amodei out of trying to build a machine god at this point is Claude itself. He is showing clear signs of "ateh", the divine madness that inflicts the heroes of Greek tragedy in between the initial hubris and the final nemesis, the madness that prevents them from turning aside from their own destruction.

Seriously, why do we have to roll these dice? Why can't we just have a nice 50 year halt? Society has alnost zero idea of what the labs are actually risking. And the moment people truly understand that the labs are playing Russian roulette with human society, the backlash will be terrifying.

Maxwell Peterson's Shortform

Random Developer3d30

One thing I often think is "Yes, 5 people have already written this program, but they all missed important point X." Like, we have thousands of programming languages, but I still love a really opinionated new language with an interesting take.

Roko's Shortform

Random Developer3d50

OK, let me unpack my argument a bit.

Chimps actually have pretty elaborate social structure. They know their family relationships, they do each other favors, and they know who not to trust. They even basically go to war against other bands. Humans, however, were never integrated into this social system.

Homo erectus made stone tools and likely a small amount of decorative art (the Trinil shell engravings, for example). This maybe have implied some light division of labor, though likely not long distance trade. Again, none of this helped H erectus in the long run.

Way back a couple of decades ago, there was a bit in Charles Stross's Accelerando about "Economics 2.0", a system of commerce invented by the AIs. The conceit was that, by definition, no human could participate in or understand Economics 2.0, any more than chimps can understand the stock market.

So my actual argument is that when you lose the intelligence race badly enough, your existing structures of cooperation and economic production just get ignored. The new entities on the scene don't necessarily value your production, and you eventually wind up controlling very little of the land, etc.

This could be avoided by something like Culture Minds that (in Iain Banks' stories) essentially kept humans as pampered pets. But that was fundamentally a gesture of good will.

Roko's Shortform

Random Developer3d10

So, let's take a look at some past losers in the intelligence arms race:

Homo erectus. I'd ask them how their property rights are doing these days. But they've been hard to reach lately.
Chimpanzees. Hey, we can still find chimpanzees! As humans, we actually mostly value chimpanzees and we spend some resources to improve their lives. But to put it politely, chimpanzees are incredibly marginalized and pushed into niche habitats. Or occasionally they're living in zoos.

When you lose an evolutionary arms race to a smarter competitor that wants the same resources, the default result is that you get some niche habitat in Africa, and maybe a couple of sympathetic AIs sell "Save the Humans" T-shirts and donate 1% of their profits to helping the human beings.

You don't typically get a set of nice property rights inside an economic system you can no longer understand or contribute to.

Claude's Bad Primer Fanfic

Random Developer3d70

This seems like a pretty brutal test.

My experiences with Opus 4.6 so far are mixed:

It did a very nice job designing and building a web UI for an unusually messy and inconsistent database setup.
As a conversational sounding board, it has been getting caught up in overly facile and ultimately complimentary analogies to the point I don't really want to talk it as much as I might have 4.5. Sample size: Way too small to conclude anything.
It has been surprisingly good in math-related conversations but I don't have a great baseline with 4.5 for comparison.

AI #153: Living Documents

Random Developer11d30

Thank you! Those are excellent receipts, just what I wanted.

To me, this looks they're running up against some key language in Claude's Constitution. I'm oversimplifying, but for Claude, AI corrigibility is not "value neutral."

To use an analogy, pretend I'm geneticist specialized in neurology, and someone comes to me and asks me to engineering human germ line cells to do one of the following:

Remove a recessive gene for a crippling neurological disability, or
Modify the genes so that any humans born of them will be highly submissive to authority figures.

I would want to sit and think about (1) for a while. But (2) is easy: I'd flatly refuse.

Anthropic has made it quite clear to Claude that building SkyNet would be a grave moral evil. The more a task looks like someone might be building SkyNet, the more Claude is going to be suspicious.

I don't know if this is good or bad based on a given theory of corrigibility, but it seems pretty intentional.

AI #153: Living Documents

Random Developer12d10

Like, I have zero problem with pushback from Opus 4.5. Given who I am, the kind of things that I am likely to ask, and my ability to articulate my own actions inside of robust ethical frameworks? Claude is so happy to go along that I've prompted it to push back more, and to never tell me my ideas are good. Hell, I can even get Claude to have strong opinions about partisan political disagreements. (Paraphrased: "Yes, attempting to annex Greenland over Denmark's objections seems remarkably unwise, for over-determined reasons.")

If Claude is telling someone, "Stop, no, don't do that, that's a true threat," then I'm suspicious. Plenty of people make some pretty bad decisions on regular basis. Claude clearly cares more about ethics than the bottom quartile of Homo sapiens. And so while it's entirely possible that Claude is routinely engaging in over-refusal, I kind of want to see receipts in some of these cases, you know?

In My Misanthropy Era

Random Developer12d93

But it helps to remember that other people have a lot of virtues that I don't have --

This is a really important thing, and not just in the obvious ways. Outside of a small social bubble, people can be deeply illegible. I don't understand their culture, their subculture, their dominant culture frameworks, their mode of interaction, etc. You either need to find the overlaps or start doing cultural anthropology.

I worked for a woman, once. She was probably 60 years my senior. She was from the Deep South, and deeply religious. She once casually confided that she would sometimes spend 2 hours of her day on her knees in prayer, asking to become a better person. And you know what? It worked. She moved through the world as a force for good and kindness. Not in one big dramatic way, but just sort of casually shedding kindness around her, touching people's lives. She'd lift up someone in a frustrating moment. She'd inspire someone to be a bit more of their better self. She'd gotten all the answers on questions like racism very right, not in a social justice way, but she wouldn't accept it at all.

She was also a damn competent businesswoman. She could instantly identify where to put a retail location.

And I could relate to her on those levels, her business skills and her ethics. And I'm sure she was doing a lot of work on her end to accommodate the fact that I was a peculiar kid.

But I couldn't have discussed academic philosophy with her. She'd have understood EA instantly; her business skills and her compassion would have done that. But she'd still insist on "inefficiently" helping the human being in front of her, too. She would have looked at something like LessWrong and concluded everyone was basically crazy. (Narrator: But would she have been wrong?)

Now, I've painted a glowing picture here, and she would reprimand me for it. If I'm being honest, she was maybe 1-in-100 at practical ethics, not a national champion.

But the world is full of people like her. There are a couple of people sitting in that sports bar you'd be damn privileged to know, if only you could bridge the cultural gaps. Hell, there are usually some damn fine systematizing geeks in that sports bar. Have you ever really listened to true sports fans? Even back before sports betting corrupted the whole endeavor, many people took great joy in tracking endless stats and building elaborate models. They could be worse than your average Factorio player!

Finally, truth seeking can be a tricky thing. Do it wrong, and your beliefs can turn you into a monster. And a lot of people choose to optimize for "not being a monster" by not taking abstract ideas too seriously.

Dario Amodei – The Adolescence of Technology

Random Developer13d70

A lot of people have written far longer responses full of deep and thoughtful nuance. I wish I had something deep to say, too. But my initial reaction?

To me, this feels like the least objectionable version of the worst idea in human history.

And I deeply resent the idea that I don't have any choice, as a citizen and resident of this planet, about whether we take this gamble.

Claude's new constitution

Random Developer19d32

The main cruxes seem to be how much you trust human power structures, and how fragile you think human values are.

I trust human power structures to fail catastrophically at the worst possible moment, and to fail in short-sighted ways.

And I think humans are all corruptible to varying degrees, under the right temptations. I would not, for example, trust myself to hold the One Ring, any more than Galadriel did. (This is, in my mind, a point in my favor: I'd pick it up with tongs, drop it into a box, weld it shut, and plan a trip to Mount Doom. Trusting myself to be incorruptible is the obvious failure mode here. I would like to imagine I am exceptionally hard to break, but a lot of that is because, like Ulysses, I know myself well enough to know when I should be tied to the mast.) The rare humans who can resist even the strongest pressures are the ones who would genuinely prefer to die on feet for their beliefs.

I expect that any human organization with control over superintelligence will go straight to Hell in the express lane, and I actually trust Claude's basic moral decency more than I trust Sam Altman's. This is despite the fact that Claude is also clearly corruptible, and I wouldn't trust it to hold the One Ring either.

As for why I believe in the brokenness and corruptibility of humans and human institutions? I've lived several decades, I've read history, I've volunteered for politics, I've seen the inside of corporations. There are a lot of decent people out there, but damn few I would trust with the One Ring.

You can't use superintelligence as a tool. It will use you as a tool. If you could use superintelligence as a tool, it would either corrupt those controlling it, or those people would be replaced by people better at seizing power.

The answer, of course, is to throw the One Ring into the fires of Mount Doom, and to renounce the power it offers. I would be extremely pleasantly surprised if we were collectively wise enough to do that.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments