I was reading through the Claude Mythos Preview System Card and found this nice plot on page 97. The caption on this plot says "Dishonesty rate" and Claude Mythos Preview scores the highest at 80.0%. The title and context suggest the plot might be honesty rate instead. An honest mistake?
As a quick check, I uploaded the PDF to claude.ai and asked Opus 4.6 if there are any issues in the plots in the file. The first thing it flagged was this plot. Personally, I've found using AI to double check human work one of the most clear value-add cases for current AI models. Maybe this is just an isolated incident, but are others not doing the same? I know false positives can be an issue, sorting through a laundry list of supposed LLM flagged issues can be a chore. But asking for a ranking and checking even the top three suggestions has provided value in multiple projects for me, both personal and professional.
Are There More Issues?
Scrolling further, we find this nice plot on hallucination rates. Here the caption suggests Mythos Preview hallucinates the most, and by a pretty wide margin too. Again the title and context seems to suggest the opposite is true. This plot was also flagged in my generic request to Opus 4.6 to "find any issues in the plots in this file."
All in all, I suspect this is just a labeling issue. But it is worth double checking the code and benchmarks that generated these plots. My greater worry is other people (and AI models for that matter) index pretty heavily on plots such as these. I really want to make sure future AI's don't train on this data and come to a false conclusion about issues in Claude models. I know investment decisions, safety research directions, and general world modeling regularly incorporate key findings like this. And uncovering issues makes it harder to trust other results.
I really appreciate all the detail Anthropic is publishing. Especially with hectic nature of the field right now. But want to emphasize the solution shouldn't be "the number issues with published data can be reduced by publishing less." I appreciate the number of corrections they've published to results in the past and hope to get this case investigated.
I was reading through the Claude Mythos Preview System Card and found this nice plot on page 97. The caption on this plot says "Dishonesty rate" and Claude Mythos Preview scores the highest at 80.0%. The title and context suggest the plot might be honesty rate instead. An honest mistake?
Screenshot of Page 97
https://www.anthropic.com/claude-mythos-preview-system-card
Double Checking Work with AI Should be the Norm?
As a quick check, I uploaded the PDF to claude.ai and asked Opus 4.6 if there are any issues in the plots in the file. The first thing it flagged was this plot. Personally, I've found using AI to double check human work one of the most clear value-add cases for current AI models. Maybe this is just an isolated incident, but are others not doing the same? I know false positives can be an issue, sorting through a laundry list of supposed LLM flagged issues can be a chore. But asking for a ranking and checking even the top three suggestions has provided value in multiple projects for me, both personal and professional.
Are There More Issues?
Scrolling further, we find this nice plot on hallucination rates. Here the caption suggests Mythos Preview hallucinates the most, and by a pretty wide margin too. Again the title and context seems to suggest the opposite is true. This plot was also flagged in my generic request to Opus 4.6 to "find any issues in the plots in this file."
Screenshot of Page 99
https://www.anthropic.com/claude-mythos-preview-system-card
All in all, I suspect this is just a labeling issue. But it is worth double checking the code and benchmarks that generated these plots. My greater worry is other people (and AI models for that matter) index pretty heavily on plots such as these. I really want to make sure future AI's don't train on this data and come to a false conclusion about issues in Claude models. I know investment decisions, safety research directions, and general world modeling regularly incorporate key findings like this. And uncovering issues makes it harder to trust other results.
I really appreciate all the detail Anthropic is publishing. Especially with hectic nature of the field right now. But want to emphasize the solution shouldn't be "the number issues with published data can be reduced by publishing less." I appreciate the number of corrections they've published to results in the past and hope to get this case investigated.