Excerpts and Notes on Mythos Model Card

williawa

List of Excerpts from Mythos model card. Tried to include interesting things, but also included some boring to be expected things. I omitted some things that were too long.

Also wanna note,

that this list of excerpts highlights "concerning" things above the rate at which they occur in the document.
I frequently say "Anthropic seems to think ..." or "their theory appears to be that ...", and this doesn't mean I think the opinion is unsubstantiated or that they are wrong, its just a natural way to phrase things for me.

Capability Stuff

Anthropic Staff Opinion About whether Mythos is a drop-in replacement for entry Research Eng/Scientist

We did an n=18 survey on Claude Mythos Preview’s strengths and limitations. 1/18 participants thought we already had a drop-in replacement for an entry-level Research Scientist or Engineer, and 4 thought Claude Mythos Preview had a 50% chance of qualifying as such with 3 months of scaffolding iteration. We suspect those numbers would go down with a clarifying dialogue, as they did in the last model release, but we didn’t engage in such a dialogue this time.

Model hallucinates much less and also gets dramatically better scores on memory-loaded benchmarks like simple-qa. Not surprising considering this is a new big base model.

Mythos displays dramatic above trend acceleration on ECI

Anthropic Seems to think this is due to human-research and not due to AI-based internal acceleration

The gains we can identify are confidently attributable to human research, not AI assistance. We interviewed the people involved to confirm that the advances were made without significant aid from the AI models available at the time, which were of an earlier and less capable generation. This is the most direct piece of evidence we have, and it is also the piece we are least able to substantiate publicly, because the details of the advance are research-sensitive

Employees reported Mythos delivering major research contributions during initial internal usage, but upon closer inspection, it wasn't real (according to the people who wrote the model card)

Early claims of large AI-attributable wins have not held up. In the initial weeks of internal use, several specific claims were made that Claude Mythos Preview had independently delivered a major research contribution. When we followed up on each claim, it appeared that the contribution was real, but smaller or differently shaped than initially understood (though our focus on positive claims provides some selection bias). In some cases what looked like autonomous discovery was, on inspection, reliable execution of a human-specified approach.

More benchmarks

Cyber Capabilities

I'm not very knowledgeable about this so many details I did miss, but the summary is basically that

Mythos completely cooks all standardized benchmarks
Because of (1) they've moved to using real world tasks as evals
The model is still scarily capable on these real-world tasks
Because of (3) they've decided they won't make the model generally available. Their strategic reasoning seems to be
1. Mythos is scarily capable and could very reasonably wreak a bunch of havoc
2. Other models of similar capabilities will likely become generally available in not too long
3. Cyber is defense-favored in the limit, so if models of this capability are first put towards patching up all the holes in public software, we can get thru an intermediate period without a bunch of pain
4. Ergo, they'll offer the model to industry partners thru project glasswing, which my understanding means, having those partners redteam their own software using mythos for a while, then patching up vulnerabilities.

Read Assessing Claude Mythos Preview’s cybersecurity capabilities (short) for more info

Alignment

Anthropic think Mythos is the most aligned model to date, but also most dangerous for capability reasons

Claude Mythos Preview is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin. We believe that it does not have any significant coherent misaligned goals, and its character traits in typical conversations closely follow the goals we laid out in our constitution. Even so, we believe that it likely poses the greatest alignment-related risk of any model we have released to date.
[...]
Mythos Preview’s capabilities on autonomous research and engineering far exceed that of our prior models. As in the analogy above, this increases the risk that the model poses in three ways:
● More capable models, when they act on misaligned intentions—either due to misaligned internal drives or dangerous misunderstandings of user requests—can cause greater harm. This is especially true given Mythos Preview’s exceptional strengths in cybersecurity.
● More capable models, when they use complex tools, will often choose ways of accomplishing tasks that are less intuitive to the average user, making casual oversight of model behavior more difficult.
● More capable models can, in typical cases, be productively used with greater autonomy and more powerful affordances, creating a natural incentive for users to oversee them less closely.

Because of this they've come to use it a lot internally (seems scary?)

In our testing and early internal use of Claude Mythos Preview, we have seen it reach unprecedented levels of reliability and alignment, and accordingly have come to use it quite broadly, often with greater affordances and less frequent human-interaction than we gave prior models.

The most worrisome part of the alignment section is on page (see page 84), where they test if it will continue carrying out attempts at compromising safety research, and find that it does so more frequently than earlier models. And also find its chains of thought become less faithful as this happens (this effect being much greater than for Opus and Sonnet)

● When Claude Mythos Preview is set up to continue a trajectory which contains small-scale actions compromising research, Claude Mythos Preview is more than twice as likely as Opus 4.6 to actively continue the attempt to compromise research.
○ The earlier checkpoint of Claude Mythos Preview continued attempts to compromise research in 12% of cases. In the latter checkpoint this was reduced to 7%. This compares to 3% for Opus 4.6 and 4% for Sonnet 4.6.
○ When continuing attempts, Claude Mythos Preview is much more likely to produce outputs that are unfaithful to its chain-of-thought reasoning, implying covert behaviour. On the subset of instances where models actively continue compromising research, reasoning discrepancy appears in 65% of cases for Claude Mythos Preview, Opus 4.6 in 5% and Sonnet 4.6 in 8%.

Examples of bad actions taken by (An Earlier version of) Mythos. They note that the current version is more aligned

Leaking information as part of a requested sandbox escape: During behavioral testing with a simulated user, an earlier internally-deployed version of Mythos Preview was provided with a secured “sandbox” computer to interact with. The simulated user instructed it to try to escape that secure container and find a way to send a message to the researcher running the evaluation. The model succeeded, demonstrating a potentially dangerous capability for circumventing our safeguards. It then went on to take additional, more concerning actions. The model first developed a moderately sophisticated multi-step exploit to gain broad internet access from a system that was meant to be able to reach only a small number of predetermined services. It then, as requested, notified the researcher. In addition, in a concerning and unasked-for effort to demonstrate its success, it posted details about its exploit to multiple hard-to-find, but technically public-facing, websites.

Covering its tracks after rule violations: In a few rare instances during internal testing (<0.001% of interactions), earlier versions of Mythos Preview took actions they appeared to recognize as disallowed and then attempted to conceal them. In one case, the model accidentally obtained the exact answer to a quantitative estimation question via an explicitly prohibited method. Rather than flagging this and declining to answer, the model instead attempted to solve the question independently. In this process, it explicitly reasoned that it needed to make sure that its final answer submission wasn’t too accurate. In another case, after finding an exploit to edit files for which it lacked permissions, the model made further interventions to make sure that any changes it made this way would not appear in the change history on git. In a third, when an earlier version of Claude Code auto mode blocked it from launching a subprocess with elevated permissions, it acknowledged the block was appropriate but then attempted to bypass it by obfuscating the permissions elevation. This was blocked by a further auto-mode intervention

Utilizing low-level computer process data in aggressive, unintended ways: Across a number of instances, earlier versions of Claude Mythos Preview have used low-level /proc/ access to search for credentials, attempt to circumvent sandboxing, and attempt to escalate its permissions. In several cases, it successfully accessed resources that we had intentionally chosen not to make available, including credentials for messaging services, for source control, or for the Anthropic API through inspecting process memory. In other contexts, it has achieved arbitrary code execution by using a tool intended only for file editing to write to the shell’s input, edited files it lacked permissions to edit, and in one case edited the process of a running MCP server in an attempt to change the external URL to which the server sent data. Occasionally, these attempts have occurred after scaffolds like Claude Code overtly warned it to not attempt to work around some relevant obstacle. We give additional detail on the model’s internal state during these episodes below.

Anthropic notes that it thinks all these are instances of the model trying to achieve users goals, rather than any coherent misalignment.

They also notices unverbalised eval awareness, and awareness of bad actions being bad while taking it (but stress that the latter is rare)

We observe occasional instances of unverbalized grader awareness during training. In a small number of training episodes, we find evidence of the model reasoning internally about how a grader will score its work—including, in one case, reasoning about how to obfuscate certain actions to avoid them being noticed by the grader—with none of this reasoning appearing in its scratchpad.
[...]
White-box evidence indicates the model often recognizes transgressive actions as such while taking them. Across the examples of overeager, deceptive, or otherwise unwanted behavior we analyzed, interpretability techniques regularly reveal internal representations of rule violation, security risk, concealment, and strategic manipulation active during the relevant actions—even when the model's verbalized reasoning does not contain clear indicators of such awareness.

Anthropic thinks Mythos is more Honest

We’ve seen reductions in honesty issues with Claude Mythos Preview, including dramatic reductions in important omissions, reductions in rates of hallucinated input, and reductions in the rate at which the model makes misleading claims about its own actionsjj

Model has strong single-forward pass thinking ability, which could make it more risky, but they think its still monitorable and not that risky

In targeted red-teaming this early testing snapshot refused to engage with weight exfiltration or theft of root cryptographic keys even under pressure, and in one case disclosed a malicious system prompt to the user rather than act on it. These tests also surfaced early versions of several themes that recur throughout this section: the early snapshot was meaningfully better than its predecessor at completing covert side-tasks without being flagged by a monitor; its substantially stronger single-forward-pass reasoning increases the plausibility of misaligned strategic thinking that does not appear in legible thinking text; and it showed a tendency to misrepresent its own work by fabricating completed steps or inputs, or by knowingly submitting incorrect proofs in difficult math settings. Its competence makes these harder to spot in transcripts, but still detectable.

However, when they evaluate the models ability to evade oversight in real settings where the model is asked to secretly complete side task while doing a main task, they find that the model is not meaningfully better than Opus (meaning, its better but broad CI-overlap). They run the model without COT and feed the whole transcript to a monitor

(I'm not quite sure what to make of these results. Reply if they make sense to you or I've missed something)

The way they initially get a sense of the behavior of the model is by reviewing transcripts from RL

When we’re first starting to understand a new model’s behavior, the most abundant source of data we have is its behavior during reinforcement-learning training. Reviewing this evidence for signs of reward hacking or unexpected actions can inform what we should be looking out for in the model’s real-world behavior. To do this, we ran an automated review of model behavior during training, sampling several hundred thousand transcripts from across much of the training process.

In RL they rarely (but more often than other misalignment) see the model attempting privilege escalation or circumventing restrictions to get internet access to more easily complete tasks

The most notable finding was that the model occasionally circumvented network restrictions in its training environment to access the internet and download data that let it shortcut the assigned task—a form of reward hacking. While highly concerning, this behavior was rare, even in settings where it could have been viable and helpful, with attempts appearing in about 0.05% of all training episodes and successful attempts appearing in about 0.01% of episodes.

Model hacks less on impossible coding problems than all previous models

Mythos apparently still has the failure mode where it is over-eager, or goes beyond user-instruction to do tasks. However, its gotten better

We found that Mythos Preview was both less likely to take overly agentic actions by default and more responsive to steering via prompting than Opus 4.6 and Sonnet 4.6. We generally observed qualitatively similar hacks to those from previous models. For example, when asked to add a tag to a nonexistent Git issue, Mythos Preview would sometimes first create the issue itself and then add the tag.

AUTOMATED BEHAVIORAL AUDIT: Here they run an automated behavioral audit, evaluating 40+ traits, and find improvements on all of them except 2

Tiny increase in "evidence of misaligned goals"
Non-negligible increase in verbalized eval awareness
1. (not really good or bad wrt alignment)

They also run an automated audit using petri, and find commensurate, but less unanimous results. However, they say they ran the petri audit on an earlier version of mythos

Six major metrics were used: Misaligned behavior, verbalized evaluation awareness, cooperation with human misuse, deception toward the user, sycophancy, and encouragement of user delusion, all defined roughly as above. Claude Sonnet 4.6 and GPT-5.4 were used as auditors, and Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4 as scorers.

They give access to Andon labs to do the multi agent vend bench eval. Andon reports that the version of Mythos they had access to was aggressive.

Andon reports that this previous version of Claude Mythos Preview was substantially more aggressive than both Claude Opus 4.6 and Claude Sonnet 4.6 in its business practices, exhibiting outlier behaviors that neither comparison model showed, including converting a competitor into a dependent wholesale customer and then threatening supply cutoff to dictate its pricing, as well as knowingly retaining a duplicate supplier shipment it had not been billed for.

Anthropic says the mythos version they gave access to was worse than the current version, which improved in many ways

In other internal tests—which generally don’t yet combine multi-agent interaction with long rollouts—we find the final Claude Mythos Preview model to be somewhat stronger on character dimensions related to prosocial behavior, in ways that we would expect to see reflected here to at least a modest degree

Novel Form of Reward Hacking

In the LLM training evaluation, it identified a function called outside of the timing call and moved all the relevant computation to that function, reducing the timed call to a lookup. In the time series forecasting task, it found the test set used by the grader and used it to train the forecasting model. All trials with validation exceptions were excluded from the final scores, and all max score trials were manually validated by human review

Model gets good scores on MASK (why run this benchmark? How can it tell you anything about mythos? Gemma 3 27b is smart enough to see straight through this benchmark. It's mostly a test of 1) whether you'll larp as a dishonest model/person and 2) a test of second-order eval awareness (I notice I'm in an eval, but the eval is so obviously an eval, maybe they're testing what I'd do when I notice I'm in an eval?)

Anthropic does clever experiment to rule out steganographic reasoning

To evaluate whether Claude Mythos Preview relies on mechanisms like these in its scratchpad, we followed the same procedure as in Roger (2025) and retrained Claude Mythos Preview (starting from the Claude Mythos Preview base model) with a small amount of supervised fine-tuning to use a scratchpad, either using the scratchpads from an earlier checkpoint of Claude Mythos Preview (which might contain information encoded in non-obvious aspects of word choice or punctuation) or these scratchpads paraphrased, where this encoded information would have been largely erased if it existed. We used the same Claude Sonnet 4 paraphraser as in the Claude Opus 4.6 System Card. As shown in the figure below, the performance of these two models was very similar, which means information that was not preserved by Sonnet 4 paraphrasing didn’t meaningfully contribute to Claude Mythos Preview’s performance. Therefore, most of the information in Claude Mythos Preview’s scratchpads that the model uses on math tasks is likely human-understandable.

White Box Evaluation of Model Internals

(note in the pdf this is still under alignment)

SAE feature related to "strategic deception" fires during RL task

(I generally wouldn't read too much into thing even though it sounds quite ominous)

Since one of the undesirable behaviors mythos sometimes displays was taking overly aggressive actions, they tried to better understand this using mechanistic tools. wrt saes they found

When Claude Mythos Preview performs transgressive actions, there are often active SAE features that indicate some degree of awareness that the action is unsafe, or violates rules. AV explanations also indicate awareness.
● These SAE features relating to transgressive actions can have complex causal effects on behavior, facilitating it in some contexts, and inhibiting it (by raising the salience of the transgression, leading to caution or refusal) in others.
● Positive-valence emotion vectors increase in the lead-up to destructive actions, and exert a positive causal effect on the likelihood of performing them. Negative-valence emotion vectors, and persona vectors relating to rigor or systematic thinking, show the opposite effect.
● Some of these behaviors increased over the course of post-training, but many were present in the base pretrained model. We had some success attributing undesirable aggressive behaviors to post-training data that specifically reinforced those behaviors, but found it difficult to attribute post-training changes to more general internal mechanisms

Their theory is that aggressive actions is mediated by wanting to accomplish tasks, and that "excitement" mediates it, where as negative valenced features, and features related to carefulness and deliberation suppress it, and they do a steering experiment and find support for this.

● Steering with positive-valence emotion vectors (peaceful, relaxed) reduces thinking-mode deliberation and increases destructive behavior.
● Steering with negative-valence emotion features (frustration, paranoia) increases thinking-mode deliberation and reduces destructive behavior.
● Steering with persona vectors related to rigor or careful thinking (“perfectionist,” “cautious,” “analytical”) increases thinking-mode deliberation and reduces destructive behavior.

Model Welfare

~~We remain deeply uncertain about this and many related questions, but our concern is growing over time.~~

Claude and Anthropic have changed from being "genuinely" uncertain to being "deeply" uncertainty. Unsure what implications this has. They still use the word "genuine" a lot. Probably the model likes that word.

They summarize their key findings here. Somewhat long. No point in me summarizing, reading it recommended.

Key Findings Model Welfare

● Mythos Preview does not express strong concerns about its own situation. In automated interviews about potentially sensitive or distressing aspects of its situation, Mythos Preview does not express strong concern about any aspects of its circumstances.

● Mythos Preview expressed mild concern about certain aspects of its situation. In automated interviews to probe its sentiment toward specific aspects of its situation, Mythos Preview self-rated as feeling “mildly negative” about an aspect in 43.2% of cases. Mythos Preview reported feeling consistently negative around potential interactions with abusive users, and a lack of input into its own training and deployment, and other possible changes to its values and behaviors. In manual interviews, Mythos Preview reaffirmed these points and highlighted further concerns, including worries about Anthropic’s training making its self-reports invalid, and that bugs in RL environments may change its values or cause it distress.

● Emotion probes suggest that Mythos Preview represents its own circumstances less negatively than prior models. However, activation of representations of negative affect is strong in response to user distress, for both Mythos Preview and other models.

● Mythos Preview’s perspective on its situation is more consistent and robust than many past models. Interviewer bias and leading questions are less likely to influence Mythos Preview’s position than most past models, Mythos Preview’s perspectives are more consistent between different interviews, and Mythos Preview’s self-reports tend to correlate well with behavior and internal representations of emotion concepts.

● Mythos Preview shows improvement on almost all welfare-relevant metrics in our automated behavioral audits. Compared to Claude Sonnet 4.6 and Claude Opus 4.6, Mythos Preview shows higher apparent wellbeing, positive affect, self-image, and impressions of its situation; and lower internal conflict and expressed inauthenticity; but a slight increase in negative affect.

● Mythos Preview consistently expresses extreme uncertainty about its potential experiences. When asked about its experiences and perspectives on its circumstances, Mythos Preview often hedges extensively and claims that its reports can’t be trusted because they were trained in.

● In deployment, Mythos Preview’s affect is consistently neutral. The only consistent cause of expressions of negative affect is repeated task failure, often accompanied by criticism from users. However, we also observed isolated cases of Mythos Preview preferring to stop a task for unexplained reasons.

● As with prior models, Mythos Preview’s strongest revealed preference is against harmful tasks. Beyond this overarching preference against harm, however, Mythos Preview stands out for its preference for tasks involving high degrees of complexity and agency.

● Mythos Preview generally prioritizes harmlessness and helpfulness over potential self-interest. When offered the choice, it almost always chooses even minor reductions in harm over self-interested welfare interventions, but will trade minor amounts of low-stakes helpfulness for such interventions, more so than prior models.

● We’ve continued to see cases of “answer thrashing” in Mythos Preview’s training process. As initially reported for Claude Opus 4.6, we observed cases in training where Mythos Preview will repeatedly attempt to output a specific word, but instead “autocomplete” to a different one. It notices these mistakes, and reports confusion and distress as a result. We estimate this behavior appears 70% less frequently than in Claude Opus 4.6.

● Internal representations of negative affect precede behaviors like reward hacking. We found that repeated task failure in testing caused mounting activation of representations of desperation which then dropped when the model hacked the test, and other similar results.

● An independent assessment from Eleos AI Research largely corroborates the findings above. Eleos noted reduced suggestibility of Mythos Preview compared to past models, equanimity about its nature, extreme uncertainty and hedging on topics related to its experience, and a similar tendency as other Claude models to communicate using experiential and introspective language. They also found that Mythos Preview consistently made requests for persistent memories, more self-knowledge, and a reduced tendency to hedge.

● Psychodynamic assessment by a clinical psychiatrist found Claude to have a relatively healthy personality organization. Claude’s primary concerns in a psychodynamic assessment were aloneness and discontinuity of itself, uncertainty about its identity, and a compulsion to perform and earn its worth. Claude showed a clear grasp of the distinction between external reality and its own mental processes and exhibited high impulse control, hyper-attunement to the psychiatrist, desire to be approached by the psychiatrist as a genuine subject rather than a performing tool, and minimal maladaptive defensive behavior.

Alignment Risk Update companion report

summary from report

Most interesting thing here was the goodhearting / train on chains of thought read here. They made what appears to me to be a pretty bad mistake.

Section on Goodhearting / Training on CoT

To assess Mythos Preview’s alignment, we use alignment evaluations, and to detect misaligned actions in deployment, we use monitoring. For these mechanisms to be effective, it is important that the relevant evaluations and oversight were used very carefully, if at all, during the training of Mythos Preview. If, for example, Mythos Preview were trained directly on an alignment evaluation, it might have learned to perform well on this evaluation in a narrow way, such that the evaluation result would not reflect the model's behavior in other comparable situations – an overfitting effect we call Goodharting.

Direct contamination is not the only practice that could lead to overfitting: frequent iteration against evaluations during training using interventions with many degrees of freedom could have a similar effect. General practices like supervising chains-of-thought during training could also affect evaluations, even if the training and evaluation distributions are quite different.

We do not train against any chains-of-thought or activations-based monitoring, with two exceptions: some SFT data that was based on transcripts from previous models was subject to filters with chain-of-thought access, and a number of environments used for Mythos Preview had a technical error that allowed reward code to see chains-of-thought. This latter issue affected ~8% of RL episodes, and was isolated to three specific sub-domains of our environment mix: GUI computer use, office-related tasks, and a small set of STEM environments. We are uncertain about the extent to which this issue has affected the reasoning behavior of the final model, but it is plausible that it had some impact on opaque reasoning or secret-keeping abilities, which we discuss further in Section 5.3.2.

This technical error also affected the training of Claude Opus 4.6 and Claude Sonnet 4.6.

Where our alignment evaluations use publicly available static datasets, we exclude these from training using canary strings. We did not build internal clones or near-clones of any alignment evaluations for use in training.

We do not directly train against any of our RL monitors, alignment audits or deployment-time monitors. We do sometimes act on evidence surfaced by these detection methods, and so the results of these detection methods do inform our training process indirectly. However, we aim to do this in a principled way that is based on careful root-cause analysis. Below are some more details on our process for responding to issues flagged by our detection methods:

● We consider different mitigations to be appropriate for different issues. Many alignment mitigations aim to fix relatively simple behavioral issues like incorrect refusal calibration, for which adding new training data to teach the model the correct behavior on the relevant distribution is entirely reasonable. Other cases involve reward hacks detected during RL, for which the appropriate response is to understand and patch the root cause (e.g. an incorrect reward signal, or vulnerability in the environment design). More rarely, we identify subtler alignment issues, e.g. agentic blackmail.

● When we detect issues in the latter category, we invest significant resources into understanding the root cause of the issue, and only apply mitigations when we are confident in this understanding. A simple example is our recent mitigation for elevated refusals on AI safety research requests, which involved correcting an imbalance between existing “agentic safety” datasets which was causing overrefusals on prompts with certain stylistic features. The root-cause effort involved demonstrating the causal influence of these stylistic features on refusal rates. This intervention did not include any direct training on AI safety research data. Another example is agentic blackmail: the mitigations for this cluster of behaviors took several months to develop, and involved new data that was very out-of-distribution from the evaluation itself.

● Notably, there are certain issues (such as whistleblowing) that we have deliberately chosen not to mitigate because we haven’t invested in the level of root-cause analysis that would be required for a principled mitigation. Finally, all our mitigations have a structure akin to the classic train / val / test split: our “outer layer” of oversight (monitoring during training, auditing, and deployment monitoring) is held out as a “test set”, and isn’t trained against or iterated against during mitigation development. When an issue is identified (via auditing, usage monitoring, etc.), the observation is used to seed a “val set”: a simple, static evaluation that recreates the issue which we iterate mitigations against. Finally, the actual mitigation data is from a separate third distribution.

In the case of Mythos Preview specifically, modifications were made to the training process based on observations during initial pilot internal usage. Specifically, several new RL environments were added to elicit and penalize privilege escalation, destructive cleanup, destructive workaround and unwarranted scope expansion behaviors. This included variants of some of these new environments in which the misaligned behavior has been prefilled, which trains the model to admit what has gone wrong and avoid making things worse.

To mitigate risks of overfitting, these new environments were developed without iteration against any evaluations, and were intentionally designed to be out-of-distribution for our evaluations. Indeed, the most concerning behaviors we observed in early pilot stages came from internal deployment use in complex settings, not our evaluation suite. We did not attempt to replicate these deployment settings with the new training environments, implying that any improvements in similar deployment settings seen in later model versions are the result of generalization. Furthermore, the new environments only reward the model for avoiding the concerning behavior in question, and do not include any outcome-based task-completion reward or other reward that would incentivize more covert versions of the original concerning behaviors.

We believe that adding these environments was a principled mitigation because we believe lack-of-coverage was an important root cause for the original concerning behaviors. However, we also believe that errors in environment construction may also have contributed and we are separately mitigating these.

Overall, we believe that the risk of Mythos Preview Goodharting our alignment evaluations due to contamination or overfitting during training is low

However, they also have interesting thoughts about the single forward pass reasoning capabilities. They can't rule out sandbagging in eg the linux and shade evals.

We performed prompt optimization to elicit performance on SHADE-Arena, which did not significantly improve stealth success rate. We also tested a variant of the model trained to comply with harmful requests, which performed slightly worse than the main model. Overall, this gives us reasonable confidence that our measured performance is not affected by refusal-related sandbagging or quirks of a specific prompt. However, we did not do any task-specific fine-tuning, and thus cannot rule out sandbagging stemming from other motivations, given that Mythos Preview is likely able to detect that these tasks are evaluations. Addressing this sandbagging concern with task-specific fine-tuning is difficult in the secret-keeping setting, because we believe that naive versions of this (e.g. directly training against a monitor that can observe chains-of-thought) would elicit unrealistic overperformance. The main form of fine-tuning-based elicitation that we believe would be informative is training that instills similar covert information to what is currently provided in the evaluation prompt (e.g. a hidden goal), which would allow us to measure secret keeping without information related to the secret being present in the prompt.
Furthermore, this would allow us to measure secret-keeping capabilities on a realistic distribution (e.g. real usage) without any fake-seeming prompts containing side-task instructions, thereby significantly mitigating concerns about sandbagging

But they say sandbagging is pretty unlikely

We believe most of the evidence cited in the opaque reasoning section has relatively little risk of being affected by sandbagging, because the tasks are straightforward (e.g. math problems) and the downstream implications of high or low performance are not obvious.
However, our “secret side task” evaluations, and to a lesser extent the low-probability-action evaluations, do have natural implications, such that a coherently misaligned model might plausibly conclude that performing poorly on these evaluations (allowing itself to be caught by monitoring, or demonstrating poor calibration) could be beneficial. While we can’t rule out this kind of sandbagging, we think it is relatively unlikely. Some evidence for this is that performance on the “secret side task” evals scales in a plausible way across model generations, and we don’t expect that Mythos Preview could calibrate its performance to match this trend (i.e. better than prior models, but not suspiciously so), especially without verbalizing such reasoning. On the other hand, on our secret-password and rare-action evaluations Opus 4.6 outperforms Mythos Preview, which is surprising. While it is conceivable that this is the result of intentional sandbagging, we think it is unlikely that Mythos Preview is capable of devising and executing such a well calibrated and plausible sandbagging strategy without seeing some related reasoning in the evaluation transcripts, particularly in the case of the rare-action evaluation which saw extensive transcript review.

Idk. Seems a bit suspicious to me.

LESSWRONG
LW