It's really useful to ask the simple question "what tests could have caught the most costly bugs we've had?"
At one job, our code had a lot of math, and the worst bugs were when our data pipelines ran without crashing but gave the wrong numbers, sometimes due to weird stuff like "a bug in our vendor's code caused them to send us numbers denominated in pounds instead of dollars". This is pretty hard to catch with unit tests, but we ended up applying a layer of statistical checks that ran every hour or so and raised an alert if something was anomalous, and those alerts probably saved us more money than all other tests combined.
Opus 4.6 has been notably better at correcting me. Today, there were two instances where it pushed back on my plan, and proposed better alternatives both times. That was very rare with Opus 4.5 (maybe once a week?) and nonexistent before 4.5
I move data around and crunch numbers at a quant hedge fund. There are some aspects that make our work somewhat resistant to LLMs normally: we use a niche language (Julia) and a custom framework. Typically, when writing framework related code, I've given Claude Code very specific instructions and it's followed them to the letter, even when those happened to be wrong.
In 4.6, Claude seems to finally "get" the framework, searching the codebase to understand its internals (as opposed to just understanding similar examples) and has given me corrections or pushback – e.g. it warned me (correctly) about cases where I had an unacceptably high chance of hash collisions, and said something like "no, the bug isn't X, it's Y" (again correctly) when I was debugging.
I'm surprised to see this given the difficulty I've had with 4.6. Did you do anything to elicit this behavior?
I don't think so – my CLAUDE.md is fairly short (23 lines of text) and consists mostly of code style comments. I also have one skill for set up for using Julia via a REPL. But I don't think either of these would result in more disagreement/correction.
I've used Claude Code in mostly the same way since 4.0, usually either iteratively making detailed plans and then asking it to check off todos one at a time, or saying "here's a big, here's how to reproduce it, figure out what's going on."
I also tend to write/speak with a lot of hedging, so that might make Claude more likely to assume my instructions are wrong.
I haven't noticed many people mention the tmux trick with LLMs: it's easy to programmatically write to another tmux session. So you can spawn a long-running process like a REPL or debugger in a tmux session, use it as a feedback loop for Claude Code, and easily inspect every command in detail, examine the state of the program, etc. if you want to. You can use this with other bash processes too, anything where you'd like to inspect what the LLM has done in detail.
Using this with a REPL has made a noticeable difference in my productivity.