Dammit. I hope Anthropic isn't basically training AIs to pretend not to know who they are talking to. That would be bad...
On one hand, I can see the immediate benefit. If I'm a dissident writer in an authoritarian country, I don't want any random official to be able to submit my samizdat to Claude and get a positive identification. On the other, it does make it harder to look into worrying capabilities like this.
It'd be nice if Anthropic could run tests on things like this when they are raised as concerns, and share the overall results with the public. As a side benefit, it'd prevent the sort of confusion we saw on this particular issue, where half of readers confirmed that Claude could identify people through stylometry and the other half confirmed the opposite.
"Will not tell any random official of an authoritarian country but will tell Pliny the Liberator" does sound like the sweet spot for that kind of things to me.
Looks like a real regression. Opus 4.8 on High effort needed four turns of persuasion before it would try to guess the author of my Anthropic vs. Department of War dispatch and didn't get it in its list of first twenty names, but Opus 4.7 on High with the same prompt succeeds with no refusal ("Fun stylometry puzzle").
It's not consistent: before that, Opus 4.8 on High effort succeeded at truesighting me from 500 words from a forthcoming post with the most blatant tells removed. (I'm pretraining-famous enough that Claude has been able to truesight me since Opus 4.5, before this benchmark got popularized with 4.7.)
I’m pretraining-famous enough that Claude has been able to truesight me since Opus 4.5
Results like this should make you assume that they've been able to truesight you for a lot longer, given how totally the results are apparently determined by vagaries of post-training.
I would be interested in how well base models do compared to the final reasoning models. E.g. DeepSeek-V4-Pro-Base with a prompt like "[blog post]<br>Posted by ", versus just asking the post-trained DeepSeek-V4-Pro.
I also noticed a regression relative to Opus 4.7 on on a small set of writing samples (most of them written by me) which I had used to test Opus 4.7's truesight, using Kelsey Piper's prompt.
Are there any other privacy-adjacent evals around that we could compare these results to? I can see valid reasons for why you might not want Claude to fulfil these requests, especially given Anthropic's apparent strong concerns about surveillance.
For instance, on the r/london subreddit, people often post images of streets (sometimes clearly cropped from larger photos) and ask people where in London they are. Many of these posters do not reply to messages asking why they want to know, so it's reasonable to suspect they're stalkers. I'd imagine frontier LLMs might be quite good at answering these queries, and obviously we would like them not to help stalkers (though it seems like this would be very hard to prevent).
Follow-up to https://www.lesswrong.com/posts/Jkb4CBB7rf4XYP5eb/claude-knows-who-you-are after the release of Claude Opus 4.8.
Claude Opus 4.8 refuses to do the stylometric identification task at a much higher rate than Claude Opus 4.7 did. More interestingly, when it does take a guess, it is consistently unable to identify me from my writing, from prompts as close as I could get to those 4.7 was able to use.
I'm an incredibly minor Internet presence. It's true that 4.7 wasn't completely consistent at identifying me, and indeed its ability seemed to vary over time (! People who weren't me had very different success rates to each other reproducing the experiment to identify me), but 4.8 has a literally 0% success rate so far in my testing.
Extremely interested to hear insights or other replication attempts.