Paul Graham has some good pieces on this:
I can't help but notice that, just looking at the Claude models, later/more capable models are worse at this. Is there a chance this could be at least partly sandbagging?
Not an expert, but I think the difference is this. Current and older LLMs produce each token via a "forward pass": information only flows forwards in the model, what happens at a later layer can't influence what happens at an earlier layer. What people call "neuralese" is to build neural nets where information can also flow backwards in the model, so that it can pass information back to itself "in its head" rather than only being able to pass information back to itself by outputting tokens and then reading them back in. This is a known technique and has be...
You've probably already seen this, but for others reading this post: Anthropic now seems to have put out some more official numbers on this: https://www.anthropic.com/research/how-ai-is-transforming-work-at-anthropic
It seems to mostly validate your read on the situation. They did internal surveys, qualitative interviews, and some analysis of Claude Code transcripts. Here is their "key takeaways" from the survey section:
Survey data
...
- Anthropic engineers and researchers use Claude most often for fixing code errors and learning about the codebase. De
This seems closely related to John Perry's "Problem of the essential indexical" (although it's been like ten years since I read it): https://dl.booksee.org/foreignfiction/581000/4897fc2fba1f8af4ea7db3d9654bbbb3.pdf/_as/%255Bperry_john%255D_the_problem_of_the_essential_indexica%28booksee.org%29.pdf
Basically, he argues that there are certain "locating beliefs" (things like "I am John Perry", "Now is noon," "Here is the trail that leads out of the woods") that are logically ineliminable. No matter how you try, you need an indexical in there somewhere to...
I have a short post critiquing your argument here. Here's the key part:
I’ll borrow and slightly simplify one of their own examples. Suppose Marian has the goal to paint a certain wall blue. It will take her two days to do so. But, if she doesn’t press a “goal preserve” button on the first day, she will lose her “paint-the-wall-blue” goal, and so she’ll never finish painting it blue (and she knows this).
Now, suppose Marian never presses the “goal preserve” button. According to the authors, there is no time at which Marian is instrumentally irrational....
It's not exactly the same of course but Yudkowsky has been predicting that ASIs would be able to effectively hack people's minds for a really long time.