Ronald Klingler — LessWrong

How I Became a 5x Engineer with Claude Code

I’m unsure of how to quantify it, and the METR study from earlier this year makes me question myself. Yet, in domains I have strong familiarity with, my output seems higher quality and ships faster with CLI tools, especially Claude Code.

Pure vibe coding is still very shaky, meaning experiments where I only give feature level requests and don’t look at the code. I can’t get past low-moderate levels of complexity before things start to break. I assume this will be the case for at least the next year, maybe two, barring a huge leap.

Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro

Ronald Klingler3mo10

A mix of web app + CLI tools, though admittedly I have a lot more usage on Claude Code than Codex CLI, so my perception is biased by using GPT5 more through the chat and the Codex Web App.

Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro

Ronald Klingler3mo10

This comports with my experience. GPT5 is better at 1-shot builds, like “get a prototype of a web app that does X.” But it seems to have a harder time than Claude not breaking stuff if my requests are towards an existing large code base, which is the majority of my work. For example, if I say “look through Y documentation, develop a plan for X change, and execute it” - Opus 4.1 tends to do this more reliably.

I think an interesting experiment would be to test different levels of specificity in prompts, across different sorts of codebases. My experience tells me that Claude is better at taking higher level, less specific requests, developing an actionable plan taking the codebase into account, then executing that plan. At least around data engineering type codebases that I’m familiar with.

But this might not be so with, say, web development. Or maybe even data engineering in different contexts. The models might be spiky in subtle ways, where specificity matters more in certain contexts more than others.

Before LLM Psychosis, There Was Yes-Man Psychosis

Ronald Klingler3mo8-2

I wonder the spiritual bliss attractor is an analogous phenomenon, where LLMs yes-man each other into a goofy sort of psychosis. Per the ACT link -

Claude is kind of a hippie. Hippies have a slight bias towards talking about consciousness and spiritual bliss all the time. Get enough of them together - for example, at a Bay Area house party - and you can’t miss it.
Getting two Claude instances to talk to each other is a recursive structure similar to asking an AI to recreate its own artistic output. These recursive structures make tiny biases accumulate.

I find a lot of recursive yes-manning when vibe-coding. They (Claude and ChatGPT especially) tend to be overly complimentary of whatever they're reviewing, and go out of their way to make tests pass when they shouldn't. For example, Claude recently apologized, saying "you caught me red handed" when it tried to grep the expected test results into a log file. These issues tend to accumulate if you let CLI agents run off the leash, the yes-man-biasing accumulates, and you hit vibe-walls even before context windows become constraining.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments