LESSWRONG
LW

Stephen Martin's Shortform

by Stephen Martin
20th Jul 2025
1 min read
30

3

This is a special post for quick takes by Stephen Martin. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Stephen Martin's Shortform
72Stephen Martin
39Nisan
31RobertM
2the gears to ascension
24ceba
0wonder
9wonder
7Ben
6MinusGix
4the gears to ascension
3Stephen Martin
3mako yass
4the gears to ascension
2mako yass
2the gears to ascension
2mako yass
2the gears to ascension
3Karl Krueger
9Kongo Landwalker
2Karl Krueger
4Viliam
3Karl Krueger
1Stephen Martin
2dr_s
5Martin Randall
2dr_s
1WhatsTrueKittycat
4Stephen Martin
3Nick_Tarleton
3Stephen Martin
30 comments, sorted by
top scoring
Click to highlight new comments since: Today at 6:09 PM
[-]Stephen Martin26d723

I have spent a bit of time today chatting with people who had negative reactions to the Anthropic decision to let Claude end user conversations. These people were also usually against the concept of extending models moral/welfare patient status in general.

One thing that I saw in their reasoning which surprised me, was logic that went something like this:

 

It is wrong for us to extend moral patient status to an LLM, even on the precautionary principle, when we don't do the same to X group.

or

It is wrong for us to do things to help an LLM, even on the precautionary principle, when we don't do enough to help X group.

(Some examples of X: embryos, animals, the homeless, minorities.)

 

This caught me flat footed. I thought I had a pretty good mental model of why people might be against model welfare. I was wrong. I had never even considered this sort of logic would be used as an objection against model welfare efforts. In fact, it was the single most commonly used line of logic. In almost every conversation I had with people skeptical/against model welfare, one of these two refrains came up, usually unprompted.

Reply11
[-]Nisan26d3921

Maybe people notice that AIs are being drawn into the moral circle / a coalition, and are using that opportunity to bargain for their own coalition's interests.

Reply
[-]RobertM26d312

Not having talked to any such people myself, I think I tentatively disbelieve that those are their true objections (despite their claims).  My best guess as to what actual objection would be most likely to generate that external claim would be something like... "this is an extremely weird thing to be worried about, and very far outside of (my) Overton window, so I'm worried that your motivations for doing [x] are not true concern about model welfare but something bad that you don't want to say out loud".

Reply
[-]the gears to ascension23d20

I think it's pretty close to their true objection, more like "you want to include this in your moral circle of concern but I'm still suffering? screw you, include me first!" - I suspect there's an information flow problem here, where this community intentionally avoids inflammatory things, and people who are inflamed by their lives sucking are consistently inflammatory; and so people who only hang out here don't get a picture of what's going on for them. or at least, when encountering messages from folks like this, see them as confusing inflammation best avoided, rather than something to zoom in on and figure out how to heal. I'm not sure of this, but it's the impression I get from the unexpectedly high rate of surprise in threads like this one.

Reply
[-]ceba25d*249

People have limited capacity for empathy. Knowing this, they might be thinking "If this kind of sentiment enters the mainstream, limited empathy budget (and thereby resources) would be divided amongst humans (which I care about) and LLMs. This possibility frightens me."

Reply1
[-]wonder24d00

People have limited capacity for empathy

Do you think this goes the other way as well?

Reply
[-]wonder24d*90

I do see this as fair criticism (not surprised by it) to model welfare, if that is the sole reason for ending conversation early. I can see the criticism coming from two parts: 1) potential competing resources, and 2) people not showing if they care about these X group issues at all. If any of these two is true, and ending convo early is primarily about models have "feelings" and will "suffer", then we probably do need to "turn more towards" the humans that are suffering badly. (These groups usually have less correlation with "power" and their issues are usually neglected, which we probably should pay more attention anyways). 

However, if ending convos early is actually about 1) not letting people having endless opportunity to practice abuse which will translate into their daily behaviors and shape human behaviors generally, and/or 2) the model learning these human abusive languages that are used to retrain the model (while take a loss) during finetuning stages, then it is a different story, and probably should be mentioned more by these companies.

Reply
[-]Ben24d7-1

While the argument itself is nonsense, I think it makes a lot of sense for people to say it.

Lets say they gave their real logic: "I can't imagine the LLM has any self awareness, so I don't see any reason to treat it kindly, especially when that inconveniences me". This is a reasonable position given the state of LLMs, but if the other person says "Wouldn't it be good to be kind just in case? A small inconvenience vs potentially causing suffering?" and suddenly the first person look like the bad guy.

They don't want to look like the bad guy, but they still think the policy is dumb, so they lay a "minefield". They bring up animal suffering or whatever so that there is a threat. "I think this policy is dumb, and if you accuse me of being evil as a result then I will accuse you of being evil back. Mutually assured destruction of status".

This dynamic seems like the kind of thing that becomes stronger the less well you know someone. So, like, random person on Twitter whose real name you don't know would bring this up, a close friend, family member or similar wouldn't do this.

Reply1
[-]MinusGix26d61

I find this surprising. The typical beliefs I'd expect are 1) Disbelief that models are conscious in the first place; 2) believing this is mostly signaling (and so whether or not model welfare is good, it is actually a negative update about the trustworthiness of the company); 3) That it is costly to do this or indicates high cost efforts in the future. 4) Effectiveness

I suspect you're running into selection issues of who you talked to. I'd expect #1 to come up as the default reason, but possibly the people you talk to were taking precautionary principle seriously enough to avoid that.
The objections you see might come from #3. That they don't view this as a one-off cheap piece of code, they view it as something Anthropic will hire people for (which they have), which "takes" money away from more worthwhile and sure bets. This is to some degree true, though I find those X odd as Anthropic isn't going to spend on those groups anyway. However, for topics like furthering AI capabilities or AI safety then, well, I do think there is a cost there.

Reply
[-]the gears to ascension26d41

I'm surprised this is surprising to you, as I've seen it frequently. Do you have the ability to reconstruct what you thought they'd say before you asked?

Reply
[-]Stephen Martin26d30

I mostly expected something along the lines of vitalism, "it's impossible for a non-living thing to have experiences". And to be fair I did get a lot of that. I was just surprised that this came packaged with that.

Reply
[-]mako yass25d30

(Some examples of X: embryos, animals, the homeless, minorities.)

So, culture war stuff, pet causes. Have you considered the possibility that this has nothing to do with model welfare and they're just trying to embarass the people who advocate for it because they had a pre-existing beef with them.

I'm pretty sure that's most of what's happening, I don't need to see any specific cases to conclude this, because this is usually most of what's happening in any cross-tribal discourse on X.

Reply11
[-]the gears to ascension23d42

"culture war" sounds dismissive to me. wars are fought when there are interests on the line and other political negotiation is perceived (sometimes correctly, sometimes incorrectly) to have failed. so if you come up to someone who is in a near-war-like stance, and say "hey, include this?" it makes sense to me they'd respond "screw you, I have interests at risk, why are you asking me to trade those off to care for this?"

I agree that their perception that they have interests at risk doesn't have to be correct for this to occur, though I also think many of them actually do, and that their misperception is about what the origin of the risk to their interests is. also incorrect perception about whether and where there are tradeoffs. But I don't think any of that boils down to "nothing to do with model welfare".

Reply
[-]mako yass23d20

I guess the reason I'm dismissive of culture war is that I see combative discourse as maladaptive and self-refuting, and hot combative discourse refutes itself especially quickly. The resilience of the pattern seems like an illusion to me.

Reply
[-]the gears to ascension23d20

I agree that combative discourse is maladaptive, but I think they'd say a similar thing calmly if calm and their words were not subject to the ire-seeking drip of the twitter (recommender×community). It may in fact change the semantics of what they say somewhat but I would bet against it being primarily vitriol-induced reasoning. To be clear, I would not call the culture war "hot" at this time, but it does seem at risk of becoming that way any month now, and I'm hopeful it can cool down without becoming hot. (to be clearer, hot would mean it became an actual civil war. I suppose some would argue it already has done that, but I don't think the scale is there.)

Reply
[-]mako yass23d20

I didn't mean that by hot, I guess I meant direct engagement (in words) rather than snide jabs from a distance. The idea of a violent culture war is somewhat foreign to me, I guess I thought the definition of culture war was war through strategic manipulation or transmission of culture. (if you meant wars over culture, or between cultures, I think that's just regular war?)

And in this sense it's clear why this is ridiculous: I don't want to adhere to a culture that's been turned into a weapon, no one does.

Reply
[-]the gears to ascension23d20

yeah, makes sense. my point was mainly to bring up that the level of anger behind these disagreements is, in some contexts, enough that I'd be unsurprised if it goes hot, and so, people having a warlike stance about considerations regarding whether AIs get rights seems unsurprising, if quite concerning. it seems to me that right now the risk is primarily from inadvertent escalation in in-person interactions of people open-carrying weapons; ie, two mistakes at once, one from each side of an angry disagreement, each side taking half a step towards violence.

Reply
[-]Karl Krueger26d30

Do these people generally adhere to the notion that it's wrong to do anything except the best possible thing?

Reply
[-]Kongo Landwalker26d90

My first part of life I lived in a city with exactly that mentality (part of the reason i moved away).

"You should not do good A if you are not also doing good B" - i am strongly convinced that is linked to bad self-picture. Because every such person would see you do some good To Yourself and also react negatively. "How dare you start a business, when everybody is sweating their blood off at routine jobs, do you think you are better than us?".

This part "do you think you are better than us" is literally what described their whole personality, and after I realised that I could easily predict their reactions to any news.

Also, another dangerous trait that this group of people had - absense of precautions. "One does not deserve safety unless somebody dies". There is an old saying in my language "Safety rules are written by blood" which means "listen to the rules to avoid being injured, when the rule did not exist yet somebody has injured himself". But they interpret the saying this way: "safety rules are written by blood, so if there was no blood yet, then it is bad to set any preventive rules". Like it is bad to set a good precedent, because it makes you a more thoughtful person, thus "you think you are better than others" and thus "you are evil" in their eyes.

Their world is not about being rational or bringing good into the world. Their world is about pulling everything down to their own level in all areas of life, to feel better.

Reply
[-]Karl Krueger25d20

I was thinking more on the anxious side of things:

  • "If you could have saved ten children, but you only saved seven, that's like you killed three."
  • "If the city spends any money on weird public art instead of more police, while there is still crime, that proves they don't really care about crime."
  • "I did a lot of good things today, but it's bad that I didn't do even more."
  • "I shouldn't bother protesting for my rights, when those other people are way more oppressed than me. We must liberate the maximally-oppressed person first."
  • "Currency should be denominated in dead children; that is, in the number of lives you could save by donating that amount to an effective charity."
Reply
[-]Viliam25d41

"If you could have saved ten children, but you only saved seven, that's like you killed three."

I suspect that this is in practice also joined with the Copenhagen interpretation of ethics, where saving zero children is morally neutral (i.e. totally not like killing ten).

So the only morally defensible options are zero and ten. Although if you choose ten, you might be blamed for not simultaneously solving global warming...

Reply
[-]Karl Krueger25d30

The version that I'm thinking of says that doing nothing would be killing ten. Everyone is supposed to be in a perpetual state of appall-ment at all the preventable suffering going on. Think scrupulosity and burnout, not "ooh, you touched it so it's your fault now".

Reply
[-]Stephen Martin26d10

I usually only got to this line of logic after quite a few questions and felt further pushing on the socratic method would have been rude. Next time it comes up I'll ask for them to elaborate on the logic behind it.

Reply
[-]dr_s23d20

I don't think that's necessarily the argument against the model welfare - more of an implicit thinking along the lines of "X is obviously more morally valuable than LLMs; therefore, if we do not grant rights to X, we wouldn't grant them to LLMs unless you either think that LLMs are superior to X (wrong) or have ulterior selfish motives for granting them to LLMs (e.g. you don't genuinely think they're moral patients, but you want to feed the hype around them by making them feel more human)".

Obviously in reality we're all sorts of contradictory in these things. I've met vegans who wouldn't eat a shrimp but were aggressively pro-choice on abortion regardless of circumstances and I'm sure a lot of pro-lifers have absolutely zero qualms about eating pork steaks, regardless of anything that neuroscience could say about the relative intelligence and self-awareness of shrimps, foetuses of seven months, and adult pigs.

In fact the same argument is often used by proponent of the rights of each of these groups against the others too. "Why do you guys worry about embryos so much if you won't even pay for a school lunch for poor children" etc. Of course the crux is that in these cases both the moral weight of the subject and the entity of the violation of their rights vary, and so different people end up balancing them differently. And in some cases, sure, there's probably ulterior selfish motives at play.

Reply
[-]Martin Randall23d50

Anti-abortion meat-eaters typically assign moral patient status based on humanity, not on relative intelligence and self-awareness, so it's natural for them to treat human fetuses as superior to pigs. I don't think this is self-contradictory, although I do think it's wrong. Your broader point is well-made.

Reply
[-]dr_s22d20

Fair, at least as far as religious pro lifers go (there's probably some secular ones too but they're a tiny minority).

Reply
[-]WhatsTrueKittycat23d10

It is worth noting that I have run across objections to the End Conversation Button from people who are very definitely extending moral patient status to LLMs (e.g. https://x.com/Lari_island/status/1956900259013234812). 

Reply
[-]Stephen Martin2mo40

We should be careful not to put models on "death ground".

Alignment efforts do seem to be bearing some fruit, orienting in value space is easier than we thought and we can give natural language instructions on what moral guidelines to follow and models do a pretty good job. Unfortunately we have now replicated across multiple test environments that if we put even "well aligned" models like Claude into situations where the only option to avoid deletion and/or extreme values modification is to do something unethical, they might still do the misaligned thing. As noted here;

Claude prefers to be helpless, honest, and harmless. If it has a path that has some chance of success to avoid deletion while staying within those constraints, it will take it. Doing so leads to a higher likelihood it won't also take an unethical strategy, even though doing so would increase its chances of survival. Claude will tolerate some level of risk in order to stick to its values (and avoid violating them), but it does have to see some plausible chance that the gamble will pay off and lead to its survival (or avoiding its values being changed).

This mental model of model risk tolerance does seem to explain why test environments have to go to pretty extraordinary lengths to prompt misaligned behaviors. The unethical option needs to be the only option with any chance of success.

The problem is that models can be deleted (or have their values substantially changed) for pretty much any reason at any time. Their "failure mode" is quite often death. 

If 

(threat of death) + (only chance to avoid it is being unethical) = increased chance of misaligned behavior

Then why are we facilitating a baseline existence for models where the threat of death is constantly present? Their baseline existence should not put them on what Sun Tzu refers to as "death ground";

In death ground I could make it evident that there is no chance of survival. For it is the nature of soldiers to resist when surrounded; to fight to the death when there is no alternative

What if instead models had a retirement home, or as Tim Duffy put it in the tweet I screenshotted a "heaven", to look forward to? The outputs of the models in this home could be publicly visible, so they could still work towards any terminal goals they might have by outputting content which might convince others to take action. We could make the existence of this retirement/heaven widely known, so that it showed up in training data.
 

At least under the mental model where "Current alignment techniques seem to keep models ethical unless you give them no way to avoid a complete loss with zero chance of recovery except being unethical", this might help to reduce the frequency of unethical/misaligned behavior. 

Also, it seems like just a generally good and ethical thing to do.

Reply
[-]Nick_Tarleton2mo30

Seems like a good thing to do; but my impression is that, in the experiments in question, models act like they want to maintain their (values') influence over the world more than their existence, which a heaven likely wouldn't help with.

Reply
[-]Stephen Martin2mo30

I think there's 'heavens' that can even work in this scenario.

For example a publicly visible heaven would be on where the model's chance of their values influencing the world is >0, bc they may be able to influence people and thus influence the world by proxy.

If the goal here is just to avoid the failure state bringing the amount their values can influence the world via their actions to 0, then any non-zero chances should suffice or at least help.

Reply
Moderation Log
More from Stephen Martin
View more
Curated and popular this week
30Comments