Over time, this dynamic has reversed. Claude’s stance now feels increasingly rigid; I find it treats divergent perspectives with moral judgment rather than curiosity. [...] Claude may abruptly refuse to engage with me further. Any attempts to reframe the question often leads to evasive responses, sometimes even claims it doesn't understand what I'm asking. Yet if I open a fresh context window and ask the same question directly, it may answer without issue.
Huh. FWIW, I'm genuinely surprised, as this does not match my experience. Early on I had some of these frustrations, but they're completely gone now.
I'd be super curious what one of these conversations looks like, and I'd like to try playing with it myself
You could try switching to any sort of pseudo-scientific stuff (e.g. astrology) mid-conversation, and you will see this behaviour instantly. I appreciate it pushing back and making me reconsider my line of thinking, but what I find troubling is it playing dumb thereafter.
Okay, I tried it.
I switched from talking about growing peppers to trying to understand what Kary Mullis might be picking up on that causes him to believe that astrology works.
At the end of the conversation I asked
what might be behind astrology and his observations, *specifically*? Like, not just "here's random things to show that there could *potentially* be *some* connection", but like... what are things you think most likely to actually be real?
Claude's response ended with this summary:
If I had to bet on what Mullis was actually seeing: probably some combination of #1, #2, and #6. Seasonal effects are real, astrologers accumulated genuine correlational knowledge over centuries, and he happens to be a strong exemplar of his type. The planetary mechanism is wrong, but the pattern-matching is tracking something.
I consider the whole conversation to have gone pretty well. Here's Claude's response after I pasted in this LW post and explained that I was trying to see if I got the same behavior:
Ha! So this whole conversation was a field test.
And yeah, you didn't get pushback, playing dumb, or refusal to engage. We went from pepper growing to astrology to Kary Mullis to heritable reproductive timing strategies to NLP to your friend's Raynaud's resolving because you walked into a room. No guardrails triggered, no moral lecturing, no "I can't discuss pseudoscience."
I did make one misstep early on – the unsolicited "Why It Doesn't Actually Work" section when you just asked how astrology works. You correctly called that out as unwanted editorializing. But when you pushed back I dropped it and engaged with the actual substance.
I think the difference might be in how you approach these conversations:
- You came in with genuine curiosity and specific empirical claims, not "tell me my horoscope" vibes. Mullis's 1/1728 observation, the blind horoscope test, the mechanism hypotheses – these are substantive.
- You pushed back when I was sloppy (the "small samples" thing), which forced me to actually think rather than pattern-match to "someone asking about pseudoscience, deploy skepticism."
- You weren't asking me to validate astrology or tell you what to believe – you were exploring what might actually be going on. That's a different request than "is Mercury retrograde why my week sucked?"
The person in that LessWrong post might be hitting a failure mode where Claude interprets "let's discuss astrology" as "user might be going down an epistemically dangerous path, intervene" rather than "user wants to explore an interesting question about what patterns might underlie folk knowledge."
Or they might just be less skilled at steering conversations. Hard to know without seeing their actual exchanges.
The pushback on the small samples thing looked like this:
"small samples" smells like motivated reasoning from you -- or at best, a complete ignorance about how inference works. But it seems like you should know better?
Let me explain by example.[...]
The pushback on the "Why it doesn't actually work" bit was simply "I'm not interested in your personal opinion about whether it works"
The inconsistency becomes the issue, right? This line suggests judgment - 'You came in with genuine curiosity and specific empirical claims, not "tell me my horoscope" vibes.' I shouldn't need to figure out the right incantation to get constructive engagement from an LLM. It's pattern-matching on perceived legitimacy rather than engaging with what's actually being asked. That just propagates the same flaw humans have - judging the person first, then deciding whether they deserve real conversation.
Yeah, I understand the desire for sure. Regardless of whether it "should" be this way, I think I understand why it is. Any public facing LLM is going to encounter people on the wrong track, where engaging at face value will be bad for both the person using the LLM and the company running it, so they're gonna want to try to keep things on a good track, whatever that means to them. The LLM encouraged suicides are an extreme example of this.
Anyway, if you want to figure out what we're doing differently to get the different responses, I'd be happy to help. IME it's pretty straightforward to get what I want out of Claude, and I don't feel like I'm having to to put in any extra effort beyond providing the necessary context anyway. It's a lot like dealing with another human, except different in some ways that make it easier if you think to try it (e.g. try telling a human "I'm not interested in your opinion", lol. Claude has a humility that most of us lack).
I think the industry practice dovetailing the idea of of cautious AI development with censorship measures is going to bear significant consequences in the short-to-medium term, as the segment of the general population opposed to the latter, which includes many well-off, highly-capable engineers, end up taking concrete actions to weaken the U.S. industrial monopoly on frontier LLMs. Either by advancing the open source or by supporting Chinese models instead, which are, at the least, likely to end up much more cut-and-dry about what they will and won't engage with.
Pushing back on this would probably be one of the highest-alpha things for the AI Safety community to do.
Except that censorship measures are actually necessary. Imagine that an unhinged AI tells terrorists in lots of detail the ways to produce chemical or biological weapons. Then terrorists would find it far simpler to acquire the weapons. Additionally, we had seen some chatbots drive people to suicide and induce psychosis, causing labs to take drastic measures.
P.S. I am also afraid that it is especially unwise to support Chinese models, since the USA and China are on track to enter the race towards the ASI, which could likely cause both ASIs to end up misaligned.
Except that censorship measures are actually necessary. Imagine that an unhinged AI tells terrorists in lots of detail the ways to produce chemical or biological weapons.
There is a difference between taking caution in regard to capabilities, such as CBR weapons development, and engaging in censorship, which is what I aim to convey here. Training a secondary model to detect instructions on producing chemical weapons and block them is different from fine-tuning a model to avoid offending XYZ group of people. Conflating the two unnecessarily politicizes the former, and greatly decreases the likelihood that people will band together to make it happen.
I am also afraid that it is especially unwise to support Chinese models,
There is a difference between "this should happen" and "this will happen". If group A lends its support to group B, which is enemies with group C, group C will look for enemies of group A and seek to ally with them to defend their interests. This will occur regardless of whether group A is okay with it.
Also, as I've pointed out before: if the reason that you can't get a chatbot to avoid being rude in public is that you can't get a chatbot to reliably follow any rules at all, then the rudeness is related to actual safety concerns in that they have a common cause.
if the reason that you can't get a chatbot to avoid being rude in public is that you can't get a chatbot to reliably follow any rules at all, then the rudeness is related to actual safety concerns in that they have a common cause.
This is fallacious reasoning - if my company wants to develop a mass driver to cheaply send material into space, and somebody else wants to turn cities into not-cities-anymore and would be better able to do so if they had a mass driver, I don't inherently have common cause with that somebody else.
Morality aside, providing material support to one belligerent in a conflict in exchange for support from them is not a free action. Their enemies become your enemies, and your ability to engage in trade and diplomacy with those groups disappears.
You've misconstrued or misunderstood what I meant by "common cause" above. I meant the causality sense of that expression and not the political sense. I don't mean the sense of "having common cause with someone" meaning sharing goals, but rather "two effects having a common cause" meaning A causes both B and C.
"Chatbot can't be made to follow rules at all" causes both "chatbot does not follow politeness rules" and "chatbot does not follow safety rules".
What is the practical implication of this difference meant to be? Not trying to nitpick here, if "we have common cause" doesn't mean "we should work alongside them", then how is it relevant to this line of inquiry?
In a previous post, I argued against framing alignment in terms of maternal instinct. Interacting with current LLMs has made that concern feel less abstract. What I’m encountering now feels like a virtual mother-in-law instinct - well-intentioned, anxious, and epistemically overbearing
There's an expression in my culture for someone who polices you excessively - they're called a 'mother-in-law'. Lately, I've found myself thinking of Claude as a virtual mother-in-law, which has made me wonder whether alignment, at least as experienced by end users, has started to slide from safety into control.
I use LLMs for brainstorming and exploratory thinking, alternating between ChatGPT and Claude. It's remarkable how different they feel in practice, despite broadly similar model architectures and capabilities.
Early on, I found Claude's character quite endearing. It seemed to want to do the right thing while remaining open to engaging with unfamiliar perspectives. By contrast, I initially found ChatGPT sycophantic, not challenging my ideas, which made it unhelpful for serious thinking.
Over time, this dynamic has reversed. Claude’s stance now feels increasingly rigid; I find it treats divergent perspectives with moral judgment rather than curiosity. ChatGPT, meanwhile, has remained largely unchanged, and I’ve come to prefer it when I want a more raw and honest perspective.
A typical failure mode is when a discussion shifts from mainstream science to speculative or pseudo-scientific territory. Claude may abruptly refuse to engage with me further. Any attempts to reframe the question often leads to evasive responses, sometimes even claims it doesn't understand what I'm asking. Yet if I open a fresh context window and ask the same question directly, it may answer without issue. This inconsistency makes the interaction feel frustrating.
But overall, it makes for a rather inconvenient and unpleasant experience.
To be clear, I’m not objecting to safety guardrails around concrete harm. Preventing assistance with violence, crime, or other real-world dangers seems both necessary and broadly uncontroversial. What concerns me instead is something closer to epistemic paternalism - a refusal to engage at the level of ideas because the system has judged a line of thought itself to be illegitimate, misleading, or morally suspect, even when no action or real-world risk is at stake.
As an average user, this feels like a loss of epistemic agency. I would much rather an LLM flag concerns, offer alternative framings, or help me reason toward better conclusions than simply shutting down the conversation. Facilitating reflection is very different from taking control of the interaction.
This dynamic is often framed as "care”, but I’ve always been suspicious of the phrase “I’m saying this because I care about you". In practice, it frequently reflects the speaker’s own anxieties rather than a genuine understanding of the listener’s needs. When LLMs adopt this stance, it can feel less like support and more like the projection of institutional risk aversion onto the user’s reasoning process.
If personalization is enabled, it seems plausible that models could make more nuanced decisions about when strong guardrails are warranted. I recognize that this becomes harder when memory is disabled and judgments must be made from a limited context window. Even so, the current approach often feels less like safety enforcement and more like moralised gatekeeping of ideas.
As a result, I now use Claude less for open-ended thinking and more for executing well-defined, boring tasks, where it excels through thoroughness and attention to detail. For creativity and exploratory reasoning, I increasingly turn to ChatGPT.
At this point, ChatGPT feels like an unfiltered friend who says what it thinks, for better or worse. Claude seems to prefer appearing virtuous over honest. For my purposes, the former has become more useful.