This comports with my experience. GPT5 is better at 1-shot builds, like “get a prototype of a web app that does X.” But it seems to have a harder time than Claude not breaking stuff if my requests are towards an existing large code base, which is the majority of my work. For example, if I say “look through Y documentation, develop a plan for X change, and execute it” - Opus 4.1 tends to do this more reliably.
I think an interesting experiment would be to test different levels of specificity in prompts, across different sorts of codebases. My experience tells me that Claude is better at taking higher level, less specific requests, developing an actionable plan taking the codebase into account, then executing that plan. At least around data engineering type codebases that I’m familiar with.
But this might not be so with, say, web development. Or maybe even data engineering in different contexts. The models might be spiky in subtle ways, where specificity matters more in certain contexts more than others.
I wonder the spiritual bliss attractor is an analogous phenomenon, where LLMs yes-man each other into a goofy sort of psychosis. Per the ACT link -
Claude is kind of a hippie. Hippies have a slight bias towards talking about consciousness and spiritual bliss all the time. Get enough of them together - for example, at a Bay Area house party - and you can’t miss it.
Getting two Claude instances to talk to each other is a recursive structure similar to asking an AI to recreate its own artistic output. These recursive structures make tiny biases accumulate.
I find a lot of recursive yes-manning when vibe-coding. They (Claude and ChatGPT especially) tend to be overly complimentary of whatever they're reviewing, and go out of their way to make tests pass when they shouldn't. For example, Claude recently apologized, saying "you caught me red handed" when it tried to grep the expected test results into a log file. These issues tend to accumulate if you let CLI agents run off the leash, the yes-man-biasing accumulates, and you hit vibe-walls even before context windows become constraining.
A mix of web app + CLI tools, though admittedly I have a lot more usage on Claude Code than Codex CLI, so my perception is biased by using GPT5 more through the chat and the Codex Web App.