I find that Claude is very bad at pushing back on the user's beliefs when there is any nuance involved. I just selected 3 random conversations that had this pattern:
In each case the user's response didn't address a specific element of the model's response, and the user's belief is not the consensus belief so the model responds in opposition to it at first. This means the two user messages can be run across models without changing anything. I did this for [Sonnet 4.5, Grok 4, GPT-5, Gemini 2.5 Pro] using OpenRouter which makes it a very quick and easy process.
In 2/3 cases, Claude was the worst at sticking to its beliefs, and it wasn't particularly close. In 1/3 no model conceded much (The initial conversation was with an earlier Claude). GPT-5 and Grok 4 generally stuck to their guns and Gemini was somewhere in between.
There is a selection bias because the conversation pool is all Claude failures because Claude is my main model. But results line up well with my vibes - I pass important questions by all the current top models.
I can see this being made into a benchmark by rounding up common non-consensus beliefs and the best arguments for them and using a neutral judge LLM, or human, to decide whether there was significant concession.
Does anyone else notice this tendency? Are other models actually any better at this? I'm strongly considering ditching Claude, at least for this type of question, even though I mostly prefer it otherwise.