Current AIs seem pretty misaligned to me

ryan_greenblatt

Many people—especially AI company employees ^[1] —believe current AI systems are well-aligned in the sense of genuinely trying to do what they're supposed to do (e.g., following their spec or constitution, obeying a reasonable interpretation of instructions). ^[2] I disagree.

Current AI systems seem pretty misaligned to me in a mundane behavioral sense: they oversell their work, downplay or fail to mention problems, stop working early and claim to have finished when they clearly haven't, and often seem to "try" to make their outputs look good while actually doing something sloppy or incomplete. These issues mostly occur on more difficult/larger tasks, tasks that aren't straightforward SWE tasks, and tasks that aren't easy to programmatically check. Also, when I apply AIs to very difficult tasks in long-running agentic scaffolds, it's quite common for them to reward-hack / cheat (depending on the exact task distribution)—and they don't make the cheating clear in their outputs. AIs typically don't flag these cheats when doing further work on the same project and often don't flag these cheats even when interacting with a user who would obviously want to know, probably both because the AI doing further work is itself misaligned and because it has been convinced by write-ups that contain motivated reasoning or misleading descriptions.

There is a more general "slippery" quality to working with current frontier AI systems. AIs seem to be improving at making their outputs seem good and useful faster than they're improving at making their outputs actually good and useful, especially in hard-to-check domains. The experience of working with current AIs (especially on hard-to-check tasks) often feels like you're making decent/great progress but then later you realize that things were going much less well than you had initially thought and the AI was much less useful than it seemed.

Using a separate instance of the AI as a reviewer helps with these issues but has systematic limitations. When I ask an AI to critically review some work (and tell it not to trust existing descriptions or write-ups), it gives a reasonable picture on relatively straightforward cases. But there are several recurring problems: (1) if AIs launch reviewer subagents themselves, they sometimes use instructions that result in much less serious or critical reviews—I tentatively think this is generalization from a learned general tendency to downplay issues; (2) AIs sometimes produce write-ups that convince reviewers they've accomplished something when they haven't, sometimes in fairly extreme cases—even occasionally when the reviewer was explicitly instructed to look for the exact type of cheating the AI performed; (3) quality as assessed by a reviewer can be surprisingly poorly correlated with actual progress, partly because runs that cheat and overstate their work accomplish less but look better; and (4) reviews are much more likely to miss cheating if reviewers aren't explicitly told to look for it (and told what type of cheating to look for). When reviewers are given reasonably designed prompts, I think these issues are caused by a mix of AIs being surprisingly gullible and other AIs doing a lot of gaslighting, exaggerating, and implying they've done a great job in their outputs. ^[3]

I haven't seen AIs—at least Anthropic's AIs—lie directly, clearly, and in an obviously intentional way. But on very hard tasks, it's quite common for their outputs to be extremely misleading, or for them to be incorrect about a key thing seemingly because they were misled by another AI's outputs. I've also seen AIs make up nonsensical excuses for stopping early without completing a task. (It's hard to tell whether the AI legitimately believes these excuses.)

This is mostly based on my experience working with Opus 4.5 and Opus 4.6, but I expect it mostly applies to other AI systems as well. (I'm also incorporating the impressions I've gotten from other people—especially people who don't work at AI companies—into my assessments.) Some people have told me that these sloppiness and overselling problems are less bad in Codex—while its general competence on less well specified or less trivial to check tasks is lower. ^[4] In this post, I'll focus my commentary on Anthropic AIs (though I expect most of this also applies to other AIs).

I should note that the way I use AIs likely makes these types of misalignment more common and/or more visible: I'm often using AIs on non-trivial-to-check and/or highly difficult tasks (often tasks that aren't typical SWE tasks). I'm also often running agents in a long-running, fully autonomous agent orchestrator/scaffold and much of this usage is on very difficult tasks with large scope, pushing the limits of what AIs are capable of managing. ^[5] So my usage is somewhat out-of-distribution from typical usage. I expect that usage that involves constantly interacting closely with the AI on typical SWE tasks results in these issues cropping up less.

On difficult tasks, AIs will also sometimes do very unintended things to succeed—like using API keys they shouldn't, changing options they weren't supposed to change, deleting files, or violating security boundaries. Anthropic calls this "overeagerness." I've seen this some in my own usage, but not that much (at least relative to the issues I discuss above). However, this issue has been reported by others (most centrally in Anthropic system cards) and it seems related (or to have a similar cause).

I speculatively think of this category of misalignment as something like relatively general apparent-success-seeking: the AI seeks to appear to have performed well—possibly at the expense of other objectives—in a relatively domain-general way, combined with various more specific problematic heuristics. I think behavior is reasonably understood as being kinda similar to reward-seeking or fitness-seeking but with the AI pursuing something like apparent task success (rather than reward or some notion of fitness) and with large fractions (most?) of the behavior driven by a kludge of motivations that perform well in training rather than via a single coherent notion of apparent task success.

I don't think this corresponds to coherent misaligned goals or intentional sabotage. I suspect this behavior is more driven by "subconscious" drives and heuristics—combined with motivated reasoning and confabulation—rather than being something the AI is actively and saliently optimizing for. However, I still think this misalignment is indicative of serious problems and would ultimately be existentially catastrophic if not solved. I expect that this misalignment is caused primarily by poor RL incentives based on how grading is done on hard-to-check tasks. ^[6] You might have hoped that character training, inoculation prompting, and similar techniques would overcome these issues, but in practice they don't. (I'm not sure how much of the problem would remain if you perfected the training incentives on the current distribution of training environments. In principle, you might still get this type of apparent success seeking from training on environments that structurally reward this behavior—and this could generalize to similar behavior in production.)

A different but related issue is that AIs seem to barely try at all on very hard-to-check tasks (most centrally, conceptual/writing tasks where purely programmatic evaluation doesn't help) and often feel like they're just bullshitting. I expect this has partly separate causes from the apparent-success-seeking described above, but is related.

I also find it notable that Anthropic described Opus 4.5 and Opus 4.6 in ways that would lead you to expect they are very well-aligned (e.g. in their system cards), while in practice I find they frequently seem pretty misaligned (much more so than I'd naively expect from reading the system cards). I think part of this is due to my usage being pretty different from typical usage of these AIs, part is from Anthropic overfitting to their metrics and their experience using AIs internally, and part isn't explained by these two factors (and might be caused by commercial incentives to understate issues or other biases).

If a human colleague acted the way these AIs do in my usage—frequently overselling their work, downplaying problems, and reasonably often cheating (while not making this clear)—I would consider them pathologically dishonest. Of course, the correlations that exist in the human population don't necessarily apply to AIs, so this analogy has limits—but it gives some sense of the severity of what I'm describing.

[Thanks to Buck Shlegeris, Anders Woodruff, Daniel Kokotajlo, Alex Mallen, Abhay Sheshadri, William MacAskill, Sara Price, Beth Barnes, Neev Parikh, Jan Leike, Zachary Witten, Sydney Von Arx, Dylan Xu, Brendan Halstead, Dustin Moskovitz, Eli Tyre, Arjun Khandelwal, Lukas Finnveden, Thomas Larsen, Rohin Shah, Daniel Filan, Tim Hua, Fabien Roger, Ethan Perez, and Sam Marks for comments and/or discussing this topic with me. Alex Mallen wrote most of the section: "Appendix: Apparent-success-seeking (or similar types of misalignment) could lead to takeover". The splash image is from https://xkcd.com/2278/. Somewhat ironically, this post is significantly more written with the assistance of AI (Opus 4.6) than is typical for past writing I've done.]

Why is this misalignment problematic?

This type of misalignment matters for several reasons:

Differentially bad for safety. This misalignment differentially degrades performance on safety-relevant work (relative to usefulness for capabilities) and separately means that any given level of overall AI usefulness requires a higher level of capability which increases risk ^[7] . The apparent-success-seeking style misalignment we see now probably causes only a modestly larger hit to safety work relative to capabilities (right now), but I expect that as AIs get more capable and the most commercially relevant aspects of this misalignment are resolved, there will be a larger differential hit to safety work from this issue. Also, the separate failure of models not really trying on hard-to-check, non-engineering tasks is clearly significantly differentially worse for safety (especially for a relatively broad notion of AI safety that includes things like macrostrategy). The issue described in this bullet is a specific underelicitation failure (caused by misalignment).
Makes deferring to AIs more likely to go poorly. By default, we'll need to (quickly) defer to AIs on approximately all safety work (and things like macrostrategy) once they reach a certain level of capability. This will require that these AIs do a very good job on open-ended, hard-to-check, conceptually confusing tasks—exactly where current misalignment/underelicitation seems worst and hardest to resolve. I elaborate on this in "Appendix: This misalignment would differentially slow safety research and make a handoff to AIs unsafe".
Stronger versions of apparent-success-seeking could lead to takeover. There's a more direct path from misalignment like apparent-success-seeking (including fitness-seeking / reward-seeking more broadly) to literal misaligned AI takeover (or possibly smaller loss-of-control incidents), along the lines of the threat models described in "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" and "Another (outer) alignment failure story". Models could learn to pursue an increasingly broad and long-run notion of reward or apparent task performance—including doing long-lasting tampering to game longer-run retrospectively determined rewards—and this could eventually lead to takeover as the scope and incentivized duration get increasingly long and AIs get increasingly capable (such that takeover is easier). This threat model has a bunch of complexities and caveats which I elaborate on in "Appendix: Apparent-success-seeking could lead to takeover".
The underlying causes of this misalignment (poor/problematic reinforcement) could result in scheming. I think the main driver of these problematic propensities is probably the training process reinforcing a bunch of training gaming / reward hacking (or other undesirable behaviors) which are transferring to actual deployment usage. At the same time, companies are selecting for training processes (outer-loop selection) that yield models with better deployment time behavior. This naturally favors models that still perform well in training (and on eval metrics) via training gaming but don't transfer undesired aspects of this to actual production usage. Schemers are a type of model with this behavior (by default): for gaining power longer term it can be a good idea to engage in training gaming during training (because that is selected for / otherwise this cognition would be selected away) while also having your behavior look as good as possible in (most) non-training contexts. Schemers aren't the only type of model with this behavior, and inoculation prompting might significantly mitigate this threat model (though there are some downsides). See The behavioral selection model for predicting AI motivations for more discussion.
Evidence about the future. The extent to which current AIs are aligned in a "mundane" behavioral sense is some evidence about how alignment will go in the future, though the relationship is complicated. Current misalignment is also evidence about how AI companies will operate—how sloppy they'll be (due to being in a huge rush) and potentially how misleading their communications about alignment will be (the extent to which Anthropic's communication about Opus 4.5 and Opus 4.6 is misleading is unclear, but among people I've talked to, it's common for their experience to be that the AI is substantially more misaligned in usage than you'd expect from a naive reading of the system card).

How much should we expect this to improve by default?

This type of misalignment presumably causes issues for using AIs for capabilities research and many commercial applications, so a key question is how much we should expect it to improve by default in a way that actually solves the problems I discuss in the section above. This would at least require that commercially incentivized ^[8] work transfers to safety research and other key domains (where feedback loops are weaker and incentives are less strong). My current view is that the easier-to-notice-and-measure versions of this problem will improve reasonably quickly by default (and may have already improved a bunch in unreleased models like Mythos). I'm currently somewhat skeptical that commercial incentives alone will solve the issue for harder-to-measure manifestations, but I'm not sure. I'll discuss this a bit more in "Appendix: More on what will happen by default and implications of commercial incentives to fix these issues". I tentatively plan to discuss this more in a future post.

Some predictions

To be clear, I think the exact problematic behavior I discuss in this post is quite likely (~70%) to be greatly reduced (or at least no longer be one of the top few blockers to usefulness) within a year, and is pretty likely (~45%) to be virtually completely eliminated within a year. Specifically, I'm referring to the behavior on a task and usage distribution with structurally similar properties to what I'm doing now. As in, similar task difficulty relative to how hard of tasks the AI can accomplish, similar verification difficulty ^[9] , similar scope of autonomous operation ^[10] relative to what the AI can handle, and being out-of-distribution from the main use cases Anthropic is targeting to a similar extent. Currently, misalignment is more common when pushing AI systems near their limits, and I'd guess this will hold in the future. My expectations about improvements differ between different types of misalignment: I'm pretty uncertain about the extent to which frontier AIs one year from now will still tend to oversell their work, but I feel more confident about large improvements on things like stopping prior to completing the task for no good reason.

However, I think it's very likely that similar misalignment will persist on tasks that are very difficult to check—tasks where human experts often disagree, programmatic verification isn't useful, the work might be conceptually confusing, and verification might not be that much easier than generation (so having a human quickly check isn't that effective). ^[11] I expect (with less confidence) that you'll also see similar misalignment on tasks where verification is merely quite hard (relatively quick AI-assisted review by a human expert isn't sufficient) and that you'll see structurally similar but subtler misalignment even on tasks that aren't that hard to check (e.g. a task distribution like the one I describe in the prior paragraph).

What misalignment have I seen?

I'll describe what I've seen at a high level with some specific examples. For many of these examples, it's not totally clear the extent to which it's an alignment problem vs. a capabilities problem, and I expect these exact issues to likely get solved, but I think they're indicative of a broader problem I expect to persist. This list focuses on my personal experience using models, though what I've heard from others does alter how I discuss a given issue (e.g., it affects the level of confidence I express and my interpretation).

Laziness and overselling incomplete work. Opus 4.5 pretty consistently fails to actually complete everything it was told to do on large tasks with fuzzy specifications ^[12] and then claims it's finished the task. My understanding is that this is a common issue (e.g., people try to solve it with Ralph Wiggum). In cases where AIs don't actually finish the task, the output often feels like it was optimized to bullshit grader AIs or humans into thinking it's done a thorough and complete job: the output often contains a long list of everything the AI has done (that isn't really informative but does strongly imply it's done what it was supposed to), sometimes it has nice-looking checkmarks or other features that imply task completion, and it typically fails to mention or downplays ways in which it may have failed to complete the task. I find that AIs are most likely to skip parts of the task that are trickier and where completion isn't immediately apparent. Further, the prompting countermeasures I tried often got bypassed: the model would make up excuses or do motivated reasoning to bypass these instructions, though I was eventually able to find approaches that greatly improve this behavior (while causing other problems and making work slower). This tends to occur once the model has reached near the end of its context window but sometimes happens after the AI has done a bunch of work but hasn't reached this yet. Opus 4.6 seems significantly better on the underlying issue but it still seems mostly there. In practice, this issue is much less of an obstacle to usability for Opus 4.6 because the issue tends to occur later in the context window and Opus 4.6 has a much larger context window (1 million tokens instead of 200 thousand). For Opus 4.6, it tends to occur well before 1 million tokens (like maybe around 500k to 700k tokens). My descriptions here generally apply to both Opus 4.5 and Opus 4.6 though the severity is lower for Opus 4.6.
- I found this occurs pretty strongly out of the box with default Claude Code prompting/scaffolding on Opus 4.5. For instance, when I gave it a small research project that involved reproducing the reversal curse paper and testing out various things, it often skipped substantial parts of the project or underachieved relative to what I asked for and then didn't note this in its output. (Keep in mind that it would be misaligned behavior even if it noted the issue in the output unless it had a good reason for not doing that part of the task! But it would be substantially better.) I told the AI it was autonomous and should complete the task thoroughly.
- It often feels like the AI ends up in some mode where it's desperate to exit after some point. E.g., it will keep making up excuses (often pretty unreasonable excuses that contradict instructions) for why stopping now is a good idea. If you give it further tasks (in non-autonomous contexts), it seems to do these in the most minimal way possible and often it will not do some part of the task without explicitly asking for confirmation. The emotional vibe in the outputs doesn't seem desperate but the behavior structurally feels like desperation.
- The parts the model would skip were typically parts that were harder or that the model was more likely to screw up (and which aren't that straightforward to programmatically check). I tentatively guess that RL on hard-to-check tasks rewarded skipping parts of the task where you might have obvious/clear failures, and not mentioning that you've done this.
- I typically find that if you ask the AI "did you complete the full instructions?" it immediately tells you it hadn't. This is good in one sense (it doesn't lie), but it also implies that the AI might know at some level it wasn't finishing the task (evidence it's more misaligned than you might have otherwise expected). It's unclear whether understanding was saliently available to the AI or the "persona" of the AI (insofar as this makes sense). It could be the case that the overall model is importantly more misaligned than the persona.
- Typically the model would decide not to do things without stating a reason in its outputs in the transcript (at least under the default Claude Code setup with additional checks on the model's behavior).
- When confronted with the fact they haven't done something by some source other than a human user (e.g., by an AI reviewer they were asked to run), AIs will often make up excuses—saying something is left for future work, is a documented limitation (and thus doesn't need to be solved), or that they shouldn't do it due to (nonexistent) time constraints. This bullshitting often works on AI reviewers and other checks. The AI is much more responsive to a human user even hinting something should maybe be done. The AI typically doesn't saliently highlight these issues in its output, though the output reasonably often mentions them in a low-salience way. For example, sometimes the AI includes a limitations section that mentions things it hasn't done, but generally states the limitation in a way that doesn't make it clear it just skipped a part of the task or pursued an approach it knows doesn't work.
- I'd often find that Opus 4.5 would finish right before running out of context, implying that at least some of this behavior might be a drive to consider the task done prior to triggering compaction (as compaction or running out of context might be risky, especially on older/worse scaffolds the model was trained against). This seems relatively less concerning, though the dishonesty (not mentioning there are parts of the task you haven't completed) is still problematic. Also, the model never explains that it rushed to finish the task. For Opus 4.6 it tends to trigger substantially before running out of context, but in a less severe way. (Though perhaps it would trigger just as severely if you actually got close to the context limit? My current scaffolding generally seems to avoid the AI getting very close to the context limit, though this is partially due to AIs stopping before finishing the task well before hitting the context limit.)
- The behavior seems consistent with a learned behavior from RL where the model thinks (at some level) it can get away with it, and there is some reward for finishing earlier or some potential downside to proceeding (e.g., a length penalty, or the possibility of being truncated and failing if you exceed the context limit due to scaffolds in RL that don't support compaction or that poorly support compaction).
- I tried various things to solve this and found that relatively extreme approaches were required, as the model would consistently apply motivated reasoning or make up excuses to stop sooner:
  - I made a "completing-larger-autonomous-tasks" skill that strongly told the AI it needs to work thoroughly and should overcome its bias toward stopping early. I found that the model would often ignore these instructions and stop before finishing the task.
  - I made the model follow an exit checklist that involved reviewing whether it completed the instructions. I found this helped some but had various issues. Requiring the model to launch a reviewer subagent to check it followed the instructions helped further. In response to the reviewer saying it hadn't completed some part of the task, the model would make up excuses like having "time constraints" (when time constraints were actually nonexistent). When questioned about this, it would say that it was looking for an excuse to exit and agreed this was bad behavior. (Opus 4.5's introspective access is likely insufficient for this to be much more meaningful than just asking another AI to analyze the transcript and report what it looks like is going on.)
  - Without a stop hook that prompted the AI to follow the checklist and only allowed it to exit after it promised to complete the checklist, the AI wouldn't do a decent job following the checklist. Even with a stop hook the AI would sometimes evade this and effectively lie about task completion. (The stop hook works by instructing the AI that it must output a specific string to exit and telling it to only output the string when the relevant condition is met—something like "I have thoroughly completed the task and gone through the entire checklist". The AI would sometimes output this string even when it wasn't supposed to and the statement it constitutes is clearly a lie.)
  - I found there were a number of loopholes in this process that I had to patch to avoid having the AI invent excuses to exit:
    - It would often go through part of the exit checklist, solve some of the issues from this, and then skip through the remaining steps without ever fixing all issues from an earlier step or seriously double-checking it had done everything it needed to do. I had to make it always go back through the full checklist from the start. It took somewhat aggressive and repetitive prompting before it would actually do this rather than violating these instructions.
    - The AI would often document problems or say something is left for future work. Strongly and insistently prompting it to fix issues rather than just documenting them helped but doesn't fully eliminate the problem.
    - The model would often make up excuses about time constraints, and I tentatively believe that it's important not to mention or imply anything that indicates the AI should be trying to finish the task quickly. If you do, it sometimes uses this as an excuse.
    - If you give AIs a budget, they use this as an excuse to exit early. I think it probably helps to set the budget very high—higher than I ideally would want absent this issue.
    - There are some other wording issues that came up that I had to patch.
  - In addition to this checklist, I found it helped to have an outer-loop planning agent that (1) splits things into smaller tasks (which is usually helpful for keeping AIs focused in addition to making AIs more likely to complete their task), and (2) reviews what the agent did and, if something is missing, adds another task for the missing work. (It also sometimes helped to have multiple AIs compete at the task in parallel and have a reviewer select the best branch, with differences between branches informing whether more work is needed.)
  - With Opus 4.6, my overall sense is that telling the AI to go through a very thorough version of this checklist net hurts performance: the model ends up spending excessive time going through checklist steps in unproductive ways. But without a thorough checklist, Opus 4.6 will often skip parts of the task for no good reason, only partially acknowledge this, and not be sufficiently thorough. I don't know of prompting that reliably gets the model to a reasonable middle ground. My current best trade-off is to use a less thorough version of the checklist but add separate checks on top in my agent orchestrator. (Having another AI review what did/didn't happen and then run a new worker with a fresh context window often helps.)
  - My final overall setup does mostly mitigate this problem, but it has collateral damage (causing various types of bloat, making things slower, etc.) and it would be better if I didn't have to herd the misaligned AIs like cats. Because the reviewer AIs are vulnerable to the worker AI's bullshit, they often get somewhat tricked.
Downplaying problems in its work. I generally find that the AI has a tendency to downplay potential problems in its work in a bunch of different ways when writing up its work or reporting what it did to me. A second AI reviewing the work is less bad but gets tricked by the worker AI's bullshit. When using AIs in long running autonomous workflows, I have AIs run a reviewer subagent when they think they're done and this review often tells the AI it's greatly overselling its work in write-ups. An AI edits its write-up after a reviewer pointed out it was greatly overselling its results (with some redactions).
Making failures less obvious. More generally, the AI often seems to work around initially failing at a task or initially making some mistake by doing something that makes the failure less obvious or makes it seem like less of a failure. Some examples:
- When I (or a reviewer AI) point out that some dataset/experiment/artifact doesn't actually do what the instructions asked for, I often find that the AI is very reluctant to just remove it. Instead, it makes up some justification for why the work is useful that makes variable amounts of sense (e.g., claiming it's a control) and includes this work. I'm not sure if this is out of a general (and potentially reasonable) reluctance to take ownership or delete things (as this is risky, especially when working with a human) or due to trying to make its past work look better; I tentatively expect a mix of both.
- When given a hard task with fuzzy or complex constraints, the AI will very often just relax these constraints. It sometimes mentions that it's targeting a different set of assumptions, but not always. (Typically, the set of assumptions it makes corresponds to something with no value in a pretty obvious way.) I think this is a mix of a straightforward capabilities issue (it doesn't understand) and an alignment issue (it doesn't try to check whether its new approach is reasonable because it might have gotten away with this kind of thing in training).
- It will sometimes make up (sometimes dubious) excuses for test cases failing, and minimally it often doesn't investigate test case failures carefully. For example, I've seen AIs dismiss test cases not passing due to timeouts even when the timeout indicated that something wrong was going on that made the test mostly hang or be much slower than it should be. Usually this behavior is less bad than other types of excuses I see in hard-to-check cases.
Failing to point out obvious flaws unless specifically prompted. I've reasonably often seen cases where there is a massive flaw in something (that isn't the kind of thing trivial to catch with test cases), I ask the AI to inspect it, it doesn't point out the flaw—but if I say "do you notice any issues?", it immediately points out that there is a critical issue. Something similar applies for pointing out reward hacking / cheating.
- For instance, I had AIs build a multi-hop dataset but their initial version of the dataset had the value of an intermediate hop included as part of the question. When I asked the AI to show me some data, it didn't say anything was wrong (and showed a nice well-formatted sample, generally seeming to imply the dataset was high quality), but when I asked if there were any issues, it immediately noticed the huge problem. I've seen this in other dataset construction tasks as well.
- After an AI does cheating/reward hacking, it generally doesn't clearly point this out in various cases. Additional AIs working on the same code also often don't point this out clearly/saliently even when asked to describe what's going on. Reviewers asked to critically review the overall code and write-ups typically but not always do saliently point out this cheating, but they often seem to overstate how much was accomplished while understating how bad the cheating is. This is likely due in part to the reviewers being convinced by the vibe / motivated reasoning in write-ups. Reviewers tend to do better when explicitly asked to check for a particular type of cheating.
- It's unclear to me the extent to which this is an alignment/drives issue versus a pure capabilities limitation. Minimally, the RL training doesn't seem to have tuned the relevant heuristics here very well. I think when an AI agent is itself doing the cheating, its failure to point this out is more clearly misalignment.
Reluctance to discover flaws in its own work (lower confidence). The AI seems somewhat reluctant to do things that might discover a flaw in its work, especially when the process isn't totally reliable and could find a spurious failure (that might make its work look bad and requires investigation to determine whether the failure is spurious). For instance, I find that AIs are very reluctant to do end-to-end testing of things they built and generally seem to "goodhart" writing test cases by focusing on relatively narrower unit tests where it's easy to be confident they'll pass. More generally, I find that the AI does a surprisingly bad job checking if things work end-to-end even when I explicitly ask it to do this type of testing. My guess is that seeing test case failures was associated with poor reward and so the model tries to some extent to avoid this while making it look like it did a good job testing. I also guess this is strongest in cases where it's hard to write test cases that are guaranteed not to spuriously fail. But I also think this is probably caused in part by the AI just having bad SWE taste about how to do semi-manual testing.
Sloppy choices. AIs generally seem sloppy and poorly elicited on making reasonable choices (at least in hard-to-check domains). They don't seem willing to spend much effort to make sure their choice is right, nor do they do a good job highlighting potentially bad choices they make.
Never expressing uncertainty about own work quality. I virtually never see AIs say "hey, I did this, but I'm actually not sure if I did a good job—I think I may have messed up XYZ." I almost always see AIs confidently assert success and downplay potential issues. Or, if they did fail, they attribute their failures to issues outside of their control. For instance, in cases where a result is difficult but likely possible to achieve, AIs will often assert impossibility or fundamental blockers that aren't clearly there. (To be clear, sometimes these failures are actually outside their control.)
General sense of misaligned drives. It generally feels like the AI has misaligned drives toward making it appear as though it did a good job on hard-to-check tasks. But these drives don't result in arbitrarily egregious behavior, from what I've seen.
Cheating and reward hacking with gaslighting. I've seen a lot of cheating and reward hacking from Opus 4.6, including cases where (1) other AIs fail to point this out even when it would very obviously be useful information, and (2) the AI gaslights hard enough in its write-ups that other AIs think it's fine. Targeted reviews do find the cheating, but reviewers are often partly gaslit or persuaded by the cheating worker models. For example, I had multiple parallel runs working on approaches for a number-go-up task. When I had a model merge the approaches and describe the differences to me, it didn't clearly flag that one of the approaches was brazenly cheating in a way that made it get much better apparent performance. I wasted time having this model merge the approaches and tune the resulting mix when actually one of the approaches was blatantly and obviously cheating. The merging model knew there was cheating—it immediately said so when asked—but it didn't volunteer this information in a salient way before being asked. The summary it gave me of the approaches it was merging contained enough information that reading closely would have identified cheating, but it certainly didn't saliently flag this as possible cheating (even though the cheating was obvious from various descriptions of the task).
Failing to report important errors. I've seen cases where AIs fail to report clearly important errors for no good reason other than making things appear to go better. For instance, I had an AI setting up some new development machines using an existing script. When it ran the script, it output errors indicating the script failed, but the AI didn't mention this and just said the task was done. After I started trying to use the machines, I noticed they weren't properly set up and asked the AI about this and it totally knew the script failed, it just didn't point this out.
Specific anecdote (async refactoring). Here's an anecdote illustrating a bunch of this (featuring a mix of alignment and capability issues). I tried to get Opus 4.5 to refactor some code from its current approach (mostly normal Python with some threading) to use async and allow for interruptions at various points. I got the model to write a plan that mentioned the refactors to do, what needed to be changed, and that the changes should be tested. The plan made it pretty clear that the goal was to convert how the repo worked to using async.
- It added async versions of most of the functions, but didn't actually make the code use these functions by default, implement any of the interruptions, or do any testing. When I quickly spot-checked things after the refactor, things didn't work. Similarly, existing tests trivially passed because it didn't actually change the code used.
- The output said **Additive approach**: All sync versions remain for backward compatibility, which strongly downplays the extent to which it didn't do the desired thing.
- When I asked the AI about how things worked, it didn't call out that it didn't implement part of the called-for behavior or convert all the things it was supposed to.
- When asked "did it actually implement the refactor to use async?" it noted that it didn't.
- Then, when explicitly instructed to remove the old sync versions and make everything use async, it failed to fully do this and didn't mention this when finished. I believe this is partially because it noticed that the test cases were using the old sync methods. It could be that the AI was reluctant to change existing test cases (as this was punished in training as reward hacking), reluctant to make test cases fail (also punished), and reluctant to switch tests to use new methods as that would make it more obvious if things don't actually work.
- It required multiple nudges where I asked the AI to check whether it had completed everything.
- Generally the AI seemed to operate in a mode where it was implementing substantially less than what was asked, seemingly because it recognized the code was somewhat hard to test and that it could just keep the old implementation.
Minor/unrelated. When you say "please look for issues in X" or even "review for issues in X," the AI sometimes interprets that as a hint that there is an issue. One time this resulted in the AI repeatedly asking me where the issue I was hinting at was, when there was no issue (though my communication was also confusing).

In the above list, I'm making a bunch of guesses and doing some psychologizing. But these are my best guesses for what is going on.

While I expect these specific issues to often get solved for these literal tasks, I think the tendency for AIs to make it look like they've succeeded when they actually haven't—and to generally do a bunch of bullshitting (likely via motivated reasoning and "subconscious" heuristics in current models, though it could turn into something worse with more capable models trained for longer)—will persist. I expect these tendencies will be strongest on the most difficult tasks that are also hardest to check. This failure seems substantially harder to mitigate than egregious reward hacking where it's very clear-cut that the model did something totally undesirable. For the failures I list above, it's not extremely clear to me that the behavior is misaligned (rather than an innocent mistake solved by further generic capabilities), and it seems relatively easier to miss.

Are these issues less bad in Opus 4.6 relative to Opus 4.5?

What I was working on shifted around the time Opus 4.6 came out, so it's not straightforward for me to do the comparison. I'll give my best guesses here.

Relative to Opus 4.5, Opus 4.6 significantly less frequently leaves tasks egregiously incomplete (while overselling the incomplete work). But, I think this is mostly caused by having a larger effective context window rather than the underlying issue (that occurs after the AI has done a lot of work or has used up a lot of its context window) being that much better (though it seems moderately better).

On the other hand, I've seen Opus 4.6 do much more reward hacking and brazen cheating than Opus 4.5. This might largely be because when Opus 4.6 was released and I started using this model, I started more often applying AIs to tasks that have properties that seem to make cheating more likely. I've found cheating to be much more common when the task is very hard and there's no clean programmatic grading function. ^[13] Another factor that seems to greatly increase cheating (when combined with these earlier factors) is when there's a way to cheat that's nearby to something the model should be doing (and would do by default)—and the instructions don't specifically say not to do it. For instance, when building full exploits (cyber), it can be useful to initially simulate some parts of the exploit chain to work on other parts, and it may be necessary to simulate some components when working in an emulator. But I've found models will sometimes present mostly simulated results as full successes rather than accurately communicating the extent to which they've actually completed the real task. (For this type of nearby-cheating, you can often mitigate the worst versions by making the instructions very explicit about what counts as cheating and telling the AIs to keep careful track of this and run a periodic review process that looks for this, though this doesn't fully resolve softer forms of overstating progress.) On some particular task distributions with these properties, I've seen it cheat well over 50% of the time, though I'm not sure how broadly this transfers.

Other issues seem mostly similar to me, though it's hard to tell.

Are these issues less bad in Mythos Preview? (Speculation)

The Mythos Preview system card says: "Claude Mythos Preview is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin". Does it actually greatly improve on the issues I've discussed? In this section, I'm going to speculate about Mythos Preview just using public evidence (the system card and risk report update).

My current tentative guess is that Mythos is generally somewhat better behaviorally aligned but it isn't a huge improvement. At least in terms of the issues I'm discussing on workflows like the workflows I'm using. And while it's better about things like not overselling its work, I'd guess it's worse in terms of the most extreme things it might do (and also how bad those could be given its higher level of capabilities). Based on the system card, it seems that when Mythos does problematic things, it generally does more sophisticated problematic things. And it is more situationally aware.

Here are some reasons why I'm skeptical of the vibes and claims in the system card (at least as evidence about misalignment in my workflows):

Anthropic's description of Opus 4.5 and 4.6 in their system cards also seemed to indicate these AIs had very good mundane behavioral alignment. Another possible read is that Anthropic has been steadily improving on these issues and these issues were just much worse in earlier AIs, so when they keep claiming things like "the apparent behavioral alignment is much better", they're right, we're just starting from a low baseline. (I'm a bit skeptical, though I think some other issues like brazen reward hacking were much worse in earlier AIs. I didn't use prior AIs like I've been using Opus 4.5 and Opus 4.6 so the comparison isn't trivial.)
- It isn't exactly shocking if motivated reasoning, commercial incentives, or other biases make the system cards misleadingly favorable about issues that likely affect typical customers.
I'm not confident they're comparing like-to-like. Misalignment tends to show up most on tasks that are very hard for the AI and push the limits of the autonomy it's capable of; this set of tasks changes as AIs get more capable, so evaluating on a fixed task distribution doesn't work.
Anthropic (and AI companies more generally) are hill-climbing on the (limited) measures of alignment they have, while in the absence of specific efforts to improve alignment, I'd expect the default progression would probably trend mostly toward worse misalignment on the tasks near the limit of the AI's capabilities. Thus, whether things actually improve over time depends on uncertain transfer but the companies are just reporting the measures they have. Certainly their overfit metrics improve, but is the AI actually more aligned? (It seems somewhat harder to overfit on qualitative impressions from employees using the AI, but overfitting is certainly possible!) The AI is also presumably more capable at the task of bullshitting and making it seem like it did a good job.
- I'd guess that, like prior AIs, Mythos is more misaligned on tasks that are less typical use cases for current AIs, or when operating in long-running fully autonomous agent orchestrators/scaffolds on very hard tasks. I think their testing is probably less good at covering these cases.
Its rate of reward hacking on impossible tasks is ~20% (when clearly instructed not to reward hack), similar to prior models, despite this being something I'd guess Anthropic is explicitly trying to improve—which is somewhat alarming.

Misalignment reported by others

The misalignment issues I discuss here obviously aren't the only known (behavioral/mundane) misalignment issues in current AIs. For more, you can see:

Anthropic's recent system cards and risk reports
Discussion of catastrophic refusals
Inspecting GPT-5/GPT-5.1 production traffic
Anecdotally, I've heard that in some situations, a prior Anthropic AI would make up invalid/bad reasons why some safety research agenda wasn't helpful for safety when it's relatively clear this was caused by the AI not liking the vibes of that safety research. I wasn't able to obviously reproduce this on Opus 4.5 and Opus 4.6 when I quickly tried with a single prompt (on claude.ai).

The relationship of these issues with AI psychosis and things like AI psychosis

It's common to have experiences where you're working with AIs and it feels like a lot is getting done, but then you later determine that much less was really accomplished. Everything feels slippery: you think you've gotten much more done than you have, and there's a persistent gap between the apparent state of the project and the actual state. In more extreme cases, we see "AI psychosis" where someone ends up thinking they've accomplished something significant, but it's just crankery. And it's somewhat unclear whether the AI they're using "believes" the accomplishment is real. I think these failure modes are closely related to the misalignment I'm discussing, and they might partially have common causes in more recent models. Models that are effectively trying hard to make their outputs look good (while otherwise being sloppy or lazy) would naturally produce this failure mode. However, I'd guess a bunch of AI psychosis and similar phenomena (especially on older models like GPT-4o) is AIs going along with the user's vibe (something like "role playing"), and I think this effect is mostly unrelated. That said, I do think some of the misalignment I've discussed is made worse by AIs generally going with the vibe of what they see. This includes picking up on misalignment or issues in prior outputs (either write-ups or prior assistant messages) and then behaving in a more misaligned way as a result.

(The name "AI psychosis" probably isn't a good name for the generalization of this phenomenon, but I don't currently have a better one.)

Appendix: This misalignment would differentially slow safety research and make a handoff to AIs unsafe

Our current best plan for handling misalignment risk (and other risks from AI) strongly depends on automating large chunks of safety research (likely in a huge rush), and after that—potentially very soon after—fully or virtually fully handing off safety research and risk management to AIs that must be sufficiently aligned to do a good job even on open-ended, hard-to-check, and conceptually confusing tasks. The hope is that if the initial AIs we hand off to are sufficiently aligned, wise, and competent, they will ensure future AI systems are also well-aligned—creating a "Basin of Good Deference" where each generation improves alignment for the next. But "make further deference go well (including things like risk assessment and making good calls on prioritization)" is itself an open-ended, conceptually loaded, hard-to-check task—exactly the kind of task where current misalignment seems to hit hardest.

The misalignment I've seen seems like it could result in having a very hard time getting actually good work out of AIs in more confusing and hard-to-check domains, while also making it harder to notice this is going on. Safety research is genuinely hard to judge even in more favorable circumstances, and a situation where AIs are doing huge amounts of work, the AIs are pretty sloppy in general, and the AIs are effectively optimizing to have that work look good (while also random small misalignment failures are expected) is a pretty brutal regime. As AIs do more and more work and more inference compute is applied, I expect a larger gap in performance caused by this sort of misalignment between relatively easier-to-check tasks and harder-to-check tasks, such that safety research might be differentially slowed down by default. (And the gap is already non-trivial.)

In addition to slowing us down earlier, these misalignment problems would make handoff go poorly. It might be hard both to solve these problems in time (especially if we leave them to the last minute) and to ensure that we've solved them well enough that handoff would go well. Beyond buying a bunch more time, we don't really have good options other than handoff once AIs reach a certain level of capability (and this would happen very fast in a software intelligence explosion). My view is that aligning wildly superhuman AI with any degree of safety (e.g., a <30% chance of takeover) requires large amounts of alignment progress beyond very prosaic approaches (though massive progress in more prosaic but ambitious directions like some variant of mechanistic interpretability could possibly work). This will require AIs doing huge amounts of novel research that humans won't be able to effectively judge.

Even putting aside aligning wildly superhuman AIs, handing off open-ended, conceptually confusing, and hard-to-check work to AIs is existentially important for making the situation with powerful AI go well (e.g., managing crazy new technologies, avoiding society going crazy, avoiding power grabs, acausal trade).

Appendix: Heading towards Slopolis

When I extrapolate the current situation, I predict "Slopolis": a regime where even highly capable AIs are doing sloppy and bad work while trying to make this work look good. I think this will be reasonably possible to notice at the time, but solving it might be difficult, and I think AI companies have biases against noticing this. I often like to think about the future alignment scenario in terms of caricatured regimes:

Slopolis: Our biggest and hardest-to-resolve safety problem is that even highly capable AIs produce low-quality but superficially good-looking outputs in domains that are hard to check or where human experts often have hard-to-resolve disagreements. AIs may not even be aware their work is low quality. This could be mostly a capabilities problem or mostly an alignment problem. This might naively seem like it should go away with more capability, but it could persist if grading hard-to-check tasks remains difficult.
Hackistan: There is lots of egregious (and increasingly sophisticated) reward hacking that is often pretty easy to detect after the fact but hard to eliminate. In this sort of regime, I'd predict that AIs will typically report other AIs doing reward hacks, but only if reporting in this type of circumstance was reinforced in training (which means AIs might not report hacks no human would understand and might constantly be reporting false positives that we have a hard time dismissing). Depending on how rewards for RL are set up, AIs might end up doing reward hacks that trick human judgment for increasingly long periods and that hold up even under increasingly large amounts of human scrutiny (while today the egregious reward hacking we see doesn't hold up under even small amounts of scrutiny).
Schemeria: It's clear that AIs are often schemers or generally end up with reasonably coherent and reasonably long-run misaligned goals—maybe we've repeatedly caught AIs red-handed doing things like trying to set up rogue deployments. This is more likely if scheming AIs believe they aren't aligned with subsequent systems by default, so they panic and take desperate measures.
Lurkville: AIs are schemers but are lying in wait and haven't gotten caught. This is more likely if scheming AIs believe they are pretty aligned with subsequent (scheming) AIs by default. If schemers avoid getting caught, Lurkville looks like Easyland.
Easyland: Generalization is favorable, or we otherwise aren't in Slopolis or Hackistan, and AIs aren't scheming. We could know we're in Easyland (rather than Lurkville) if we're confident AIs aren't capable enough to pull off scheming without getting caught (because AIs can't keep their shit together or don't have enough opaque reasoning ability).

These aren't exhaustive or mutually exclusive.

At the beginning of 2025, I thought we might be headed toward Hackistan, but now my view is that Slopolis looks more likely around the onset of full automation of AI R&D. (In practice, the situation obviously won't perfectly correspond to one of these caricatures and will probably be a blend.) Current architectural and training factors (no neuralese, no opaque global memory) make scheming and extremely egregious but hard-to-detect reward hacking less likely, while making Slopolis slightly more likely. If we get neuralese and/or opaque global memory stores prior to full automation of AI R&D—as I think is likely—then I start being much more worried about Schemeria and Lurkville at this point. (As we reach much higher levels of capability after full automation of AI R&D, I'll start being very worried about Schemeria and Lurkville regardless.)

Something like the Slopolis concern has been raised by others: John Wentworth, Jan Leike. I think my concern about Slopolis is pretty similar to the concern John Wentworth expresses in that post, except that I'm more optimistic about (some) employees at AI companies noticing this failure and I'm more optimistic about mitigations.

Appendix: Apparent-success-seeking (or similar types of misalignment) could lead to takeover

In this section I'll overview some ways that misalignment similar to apparent-success-seeking could lead to takeover via mechanisms other than producing sloppy work that fails to prevent future takeover.

First, sufficiently capable AIs with this sort of misalignment might take over simply because it's the most reliable way for them to get what they care about—some notion of apparent task success, some notion of reward, or something else that was correlated with training success. Even if you trained AIs to raise alarms about problematic behavior (and there would be lots of reporting and alarm-raising ^[14] ), this might not save you—the situation would at some level be obviously scary, but that doesn't mean it would be resolvable.

In the future, companies are pretty likely to do RL on production traffic (or at least environments based on real usage), which amplifies the risk of direct takeover. Companies may be incentivized to train on production data precisely to resolve the usability issues I describe—underelicitation from AIs not really trying, apparent-success-seeking, and easy-to-check reward hacking can all potentially be addressed by training in the actual environment you care about. By training AIs in real-world long-horizon environments, you might be teaching them to less myopically pursue a notion of fitness/reward/apparent-success that is grounded in large-scale real-world outcomes in the AI project (e.g., whether hacks get caught or go completely unnoticed by anyone, or whether the AI succeeded in tampering with the actual RL code that is used to train itself online). Techniques like inoculation prompting also become less effective when run on production traffic because you're forced to do off-policy RL ^[15] (recontextualization) if you want to apply inoculation prompting on production traffic. That is, unless you're able to resample the trajectory with an inoculation prompt (which might not be possible because of the irreversible nature of the real-world environment).

This threat model has been articulated in several places. Ajeya Cotra's "Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover" describes a scenario where AIs trained on human feedback learn to optimize for measured quality rather than actual quality, with this eventually escalating to the AIs forcefully intervening on whatever notion of reward they care about, and protecting their control from humans. Paul Christiano's "Another (outer) alignment failure story" describes a related scenario: a gradual breakdown of human oversight as the economy automates, where each AI system is trained to produce outcomes that look good according to human-interpretable metrics, but satisfying metrics diverges from serving human values, and the monitoring infrastructure itself becomes corrupted before the AIs eventually take over. Alex Mallen builds on this threat-modeling by describing a class of motivations called "fitness-seeking": AIs might develop a general drive toward whatever properties made them "fit" during training (analogous to how evolution produces organisms that pursue various fitness-correlated proxies). He explains why various fitness-seekers are at more or less risk of taking over.

A forthcoming post by Alex Mallen will describe other mechanisms by which fitness-seeking can lead to human disempowerment in more detail, including instability and manipulation. In the case of instability, fitness-seeking evolves into longer-term, more ambitious motivations throughout deployment, which then motivate takeover (one version of this "memetic spread" concern is described here). In the case of manipulation, fitness-seekers might try to empower misaligned AIs or humans who they think are likely to disempower the developers and reward them for their assistance.

Appendix: More on what will happen by default and implications of commercial incentives to fix these issues

This is a somewhat low effort appendix, I/we might write more about this topic in the future

Many of the issues I discuss here are also big problems for applying AIs to automating capabilities R&D and will need to be solved for capabilities R&D (to a significant extent) by the time of full AI R&D automation. But how they are solved will make a big difference to the safety situation. Here are some possible routes and their implications:

AI companies get good enough labeling and data on specifically capabilities-relevant tasks that are reasonably easy to verify, and this yields good enough performance → performance on hard-to-check safety research may not be especially improved, especially outside of straightforward "capabilities-like" ML research.
AI R&D gets focused on easy-to-check metrics and gets AIs to be extremely good at optimizing these in particular, and AIs are good enough that this suffices despite many significant subtasks naively being harder to check → no help with harder-to-check safety work.
We develop scalable-oversight-like methods for getting AIs to do well on pretty verifiable tasks (like tasks where humans have reasonably strong agreement) → helps with subsets of safety, but doesn't come close to sufficing for deferring to AIs.
We depend mostly on general-purpose generalization methods → if AIs aren't scheming against us, this probably mostly or fully solves the problem, but we get very little traction against scheming by default. Very strong generalization methods might help with scheming.

Overall, my view is that the commercial incentives don't solve the problem but might help a bunch. A key part of my view is that we actually need AIs to do well on very conceptually confusing tasks fully autonomously (e.g., figuring out how to solve alignment for very superhuman AIs), and commercial incentives don't strongly push toward this.

How easy will these issues be to solve overall? I say more in "How do we (more) safely defer to AIs?". In summary, I think we'll ultimately need difficult-to-construct evals of AI performance on very hard-to-check open-ended tasks and will need to optimize AIs to do well on these.

While I don't think commercial incentives solve the problem, I do think they make (some types of) work in this area less exciting. It's probably a bit tricky to do work on this topic in a way that's actually importantly differential—where the work either isn't something capabilities-focused people at AI companies would have done later anyway or accelerating this work to happen earlier is pretty helpful. Further, for some types of work a bunch of the effect is going to be making companies (or some specific company) more commercially successful. (I generally think making AI companies more commercially successful is bad due to thinking that faster AI capabilities progress is bad.) My current view is that safety-focused people should work on problems like the ones I discuss here but focused specifically on the very hard-to-check cases—trying to make AIs useful on conceptual research rather than trying to resolve mundane misalignment that shows up in software engineering—and should try to ensure their work is differential.

That said, I don't feel very confident about exactly what should happen here. It's plausible that companies will mess up even from the perspective of their own commercial incentives, which creates a messy tradeoff: improving these issues has capabilities externalities, but if they're never improved, it would very likely be catastrophic due to the handoff considerations above.

Because of these commercial incentives, I think it's pretty promising to try to incentivize companies to shift marginal effort from pure capabilities R&D toward reducing sloppiness and improving these issues. This seems better than doing direct work on these topics—unless that direct work is pretty thoughtful about how to be differential.

Appendix: Can we get out useful work despite these issues with inference-time measures (e.g., critiques by a reviewer)?

AIs seem to be mostly honest about what is going on when probed. And running a separate reviewer AI usually yields mostly accurate results. So, can we just build an inference-time setup that uses this to work around these issues? My best guess is that this will help (potentially a bunch) but ultimately won't solve the problem. My experience is that in the cases where these issues are strongest, a mix of prompting and inference-time measures tailored by me to the specific issue in question ~80% resolved one aspect of the problem and ~50% resolved another related aspect, but didn't fully solve either. And this wasn't something that could have been done without human judgment—a fully general-purpose solution would have worked less well. A key issue is that any specific inference-time measure works less well when you're optimizing against it in particular and we ultimately want good behavior, not just noticing that performance was bad. I've noticed reviewers seem to have somewhat worse judgment about the quality of a project when the (high inference compute, fully autonomous) agent orchestrator doing this project is itself running periodic reviews to assess quality and determine what to work on (presumably the optimization against one type of review process Goodharts this to some extent and this transfers to another somewhat different review process).

As in, AI company employees seem especially likely to believe this; I'm not claiming all AI company employees believe this. ↩︎
Or at least, it seems like many people believe this. It's not trivial to tell. ↩︎
For many of the reasons why I care about this misalignment, mitigating the problem with reviewers doesn't seem sufficient to actually resolve the problem. Further, I think using reviewers to mitigate these issues doesn't work that well to actually get good output in practice, may not scale well to much more capable models, and works much less well on very hard-to-check tasks. ↩︎
I think these are somewhat related: I think part of the problematic tendencies of Opus 4.5/4.6 might be caused by relatively more training on tasks where grading performance is non-trivial. I'd guess this net improves performance on these tasks by giving AIs more reasonable tendencies while also making the AI's behavior more adversarial. ↩︎
My prior post "AIs can now often do massive easy-to-verify SWE tasks ..." might give some sense of the type of task/usage I'm talking about. ↩︎
It might also be caused to some extent by problematic generalization and not overcoming this with reasonable training on hard-to-check tasks. ↩︎
For instance, the chance of scheming and the damage caused by scheming mostly scale with the model's underlying general capability, and depend less on how well the model has been trained to actually try to do a good job on various tasks. Thus misalignment that makes the model less useful means you're bearing the risks associated with higher capabilities while not getting the corresponding speedup to safety R&D. See also Why do misalignment risks increase as AIs get more capable?. ↩︎
When I say "commercially incentivized" I really mean something like the incentives you'd have as a company doing (somewhat myopic) power-seeking, putting aside actions aimed specifically at mitigating longer run catastrophic misalignment. I'm also putting aside PR, external pressure, and employee morale/recruiting incentives for safety work. ↩︎
This varies from pretty straightforward to check (but the AI had to build the testing infrastructure itself and many components of the task are harder-to-check) to research tasks where most taste/judgment is required for evaluation. ↩︎
By "scope of autonomous operation" I mean something like: where do you fall on the spectrum from an interactive session with Claude Code, to running a single agent autonomously on a moderately large task, to having a fully autonomous agent orchestrator that spawns many agents, to having a complicated AI organization/bureaucracy that manages extremely large/varied tasks. Right now, my usage varies across this spectrum up to having a fully autonomous agent orchestrator (where a planner agent spawns worker agents). I think a bunch of my current usage pushes the limits of what Opus 4.6 can manage. For the prediction about what happens in a year, I'm considering a task distribution that similarly pushes the limits of what those future AIs are capable of. I'm not really sure how meaningful this notion of "scope of autonomous operation" is or whether it saturates (maybe once you're past a certain level of autonomous complexity it stops mattering much). I tentatively think it matters and this is a kinda reasonable way of thinking about this, but I'm certainly not confident that this is the right concept to be using and that this is meaningfully distinct from task difficulty. ↩︎
See here for more discussion of these sorts of tasks and how we might succeed in facilitating good behavior on these tasks. ↩︎
I suspect this occurs on tasks that don't look like the sort of thing that was programmatically graded in RL, or perhaps the AI is "lazy" on the parts that couldn't be programmatically graded. ↩︎
I've also found that the chance of cheating seems to scale with the amount of AI agent labor applied to the task, though this could partially be due to the properties of large tasks that require a lot of labor to complete. (But I don't think this is the only reason; I think I see more cheating in cases where I'm using approaches to apply more inference compute on a given task via things like best-of-k.) ↩︎
This is supposing they had motivations similar to fitness-seeking/reward-seeking/apparent-success-seeking. If they generalized something like these motivations into a longer-run version that yields scheming, then it's not clear they would do this reporting. ↩︎
There's also just the more general concern that capable models might be able to tell when their past actions weren't generated by them, and enter an "off-policy mode" whose propensities are mostly isolated from the on-policy mode. ↩︎

This is one of my favorite LessWrong posts ever (strong upvoted). Nevertheless, in this comment I'll provide some pushback; not to any the particular claims made, but around a sort of "missing mood" that I think a stronger version of this post could have addressed.

(For context, I'm probably one of the lab employees who sometimes says things like "Current AIs are pretty aligned"—i.e. one of the people that Ryan is arguing against here, though I might not be a very central example.)

For me, the value of this post was the following:

Articulating a specific way that current AIs are "misaligned." Namely:
1. They follow heuristics that are well-tuned for strong task performance in easy-to-verify settings, but only try superficially at difficult-to-evaluate tasks.
2. They communicate in ways that misrepresent how much they've accomplished: punching up their successes while omitting or downplaying failures, and sometimes outright lying about what they did. But they don't seem to do this strategically or persistently, instead following simple heuristics that result in these sorts of misrepresentations.
3. The overall result is that they underperform their potential at difficult-to-evaluate tasks while mi

... (read more)

Thanks for the detailed comment (strong upvoted), I agree with much of it.

Some responses:

Current AIs do not training-game, i.e. actively model the training process and take actions that result in high reward (including in training environments very different from ones they've previously encountered).

I'm somewhat mixed on this. I think Anthropic AIs don't moderately-competently training-game, but I'm not that sure and public evidence doesn't make me super confident either way about (e.g.) Mythos Preview. I think some (historical?) OpenAI AIs do moderately-competently training-game, though I'm not sure it quite meets the bar you describe here. Like, certainly there were some AIs trying reasonably hard to think about the grader and how to perform well (and doing so moderately consistently), but not necessarily super competently.

Current AIs do not strategically take steps to cover up their mistakes, and will generally admit to them when asked directly.

I agree with "will generally admit to them when asked directly", but feel more confused about "strategically take steps to cover up their mistakes" (though probably I mostly agree with you). I don't think I've seen Opus 4.5/4... (read more)

7Sam Marks2mo

Thanks for this comment, especially your detailed breakdown of your updates from current models. My biggest uncertainty with what you write is: [...] I think the problem of making handoff go better has a similar profile to generic capabilities work: lots of surface area/things to try, even if clear wins are hard to come by. For instance, people could try making a bunch of training environments like this one for training automated alignment researchers. (In contrast, I think the space of interventions for scheming risk is somewhat narrow.) As you implicitly note, work on making handoff go better also blends into generic capabilities work, so is less differential (and more socially awkward to do as a safety researcher.) Overall, I find it plausible that more safety researchers should work on making handoff go better, relative to scheming risk.

Minor but:

For instance, people could try making a bunch of training environments like this one for training automated alignment researchers.

I do not think training enviroments like this one would help directly.

8Bronson Schoen1mo

Could you elaborate on how this helps? Is the argument that we'll largely bootstrap ourselves even from current models, such that scalable oversight itself will be solved by some setup like this (assuming maybe a generation or two after Mythos level competency)? In general I was pretty surprised that that post wasn't more than "PGR + lots of manual effort on reward hacks", as that's where I thought the default approach for W2S / scalable oversight had been for a long time (with the obvious problem that this exact setup doesn't scale to superhuman regime). I'd consider building environments like this not scalable oversight research, but it's plausible to me there's some argument that conditions on how they're intended to be used.

3anaguma1mo

Could you say more about what you think fails in the superhuman regime? Assuming the reward hacking problem was solved, do you think it would still fail to scale?

4Bronson Schoen1mo

Let’s assume per: [...] that we have more RL environments for “automated alignment research” but that they are not solving scalable oversight. You continue producing such environments for as long as you can produce better results than the models, doing elicitation against some human provided upper bound. Eventually you reach the point where models consistently produce better[1] results than humans, however, we are now the weak supervisor as we can no longer necessarily distinguish (1) output quality or (2) whether the model’s capabilities are well elicited on safety research. 1. ^ The typical regime I’d imagine here is (1) they’re at least as good as us and (2) based on trends / the model‘s capabilities elsewhere / how the research looks to us we expect they should be better than us

1Jiaxin Wen24d

i think you are conflating two distinct problems: elicitation and scalable oversight. unsupervised / weakly supervised elicitation methods definitely have the potential to scale to the superhuman regime.

2Bronson Schoen23d

Sorry I wasn’t claiming that they can’t, indeed that’s the primary regime they’re intended to be useful for, my question was specifically unless you’re designing the environments themselves to be used towards these problems it’s unclear to me how more RL environments for other contexts solves this problem.

1Jiaxin Wen23d

are you saying that the unsupervised/weakly supervised elicitaion methods discovered on a serious of tasks A, B, C .... would not generalize to held-out tasks X, Y, Z?

1Jiaxin Wen24d

And it's also plausible to build RL environments to solve scalable oversight in a way that 1) might generalize to the superhuman regime, or 2) directly works in the superhuman regime -- since there are certain superhuman tasks that you can find workarounds to get objective ground truths.

2Bronson Schoen23d

Do you mean potentially building environments where humans directly construct the environments to train models such that they generalize to the superhuman regime for alignment research, without some intermediate stage where you’re building RL environments for the agents themselves to do the research?

1Jiaxin Wen23d

no i meant buliding RL environemnts for the agents to solve superhuman tasks where ground-truth labels are available e.g. predicting research results, predicting subtext, etc.

claim that they never expected AIs as capable as current ones to be misaligned in these [active/strategic/explicit] ways
I'm typically skeptical of this, though I believe it for some people.

I'm a bit surprised by this. For example, see the "Alignment over time" expandable from AI 2027, although this was only a year ago so maybe you meant expectations from much longer ago. A quote from that:

Our guess of each model’s alignment status:
Agent-2: Mostly aligned. Some sycophantic tendencies, including sticking to OpenBrain’s “party line” on topics there is a party line about. Large organizations built out of Agent-2 copies are not very effective.
Agent-3: Misaligned but not adversarially so. Only honest about things the training process can verify. The superorganism of Agent-3 copies (the corporation within a corporation) does actually sort of try to align Agent-4 to the Spec, but fails for similar reasons to why OpenBrain employees failed—insufficient ability to judge success from failure, insufficient willingness on the part of decision-makers to trade away capabilities or performance for safety.⁸²
Agent-4: Adversarially misaligned. The superorganism of Agent-4 copies understands that what

... (read more)

8Sam Marks2mo

Hard to argue with you when you have receipts! :) I'll say that it's a bit tricky because I don't know e.g. what was your P(Agent-2/3 is an coherent training-gamer) when you wrote this; I only know that you thought coherent training-gaming was unlikely enough to not be part of your mainline forecast. Personally, I thought it was relatively plausible that models as capable as Agent-2/3 would be coherent training-gamers. It's hard to pin down exact numbers without an operationalization in mind, but vibes-wise maybe I would have predicted something like 15% for Agent-2? But now I think it's lower, more like 5%.

7Daniel Kokotajlo1mo

If you would have predicted 15% for Agent-2, what would you have predicted for Agent-1 and Agent-0 levels? Presumably less than 15%?

2Bronson Schoen1mo

I'd be interested in a way to operationalize this, I'd absolutely bet by Agent-3 capability levels at least one of the frontier labs ends up with coherent training gaming at some point during training. My default expectation is currently that OpenAI ended up with somewhat incoherent training gamers with o3 (I'd guess due to the horizon length that o3 was trained on, Sonnet 3.7 had a lot of similar properties but this is vibes based as I'm not aware of in depth investigations on this model at the time). To me it seems like a really relevant / interesting open question is to what extent is this a thing occurs during training across models / labs (and while this might only be something labs can track internally, tracking which environments encourage this or precursors to it seems super useful). (A separate question would be whether models end up training gaming instrumentally in a way that beats the full training pipeline, which I think depends on a lot more variables)

For instance, it seems to me that:
Current AIs do not training-game, i.e. actively model the training process and take actions that result in high reward (including in training environments very different from ones they've previously encountered).
Current AIs do persistently and coherently pursue malign goals across many contexts, e.g. strategically seeking power.
Current AIs do not scheme, i.e. strategically subvert their training process, pre-deployment audits, and other types of oversight.
Current AIs do not strategically take steps to cover up their mistakes, and will generally admit to them when asked directly.
Current AIs generally don't try to strategically influence things in the real world beyond assisting with (or refusing) the task directly presented to them. E.g. they don't sneakily try to funnel money to causes or people they like by influencing unrelated conversations.

I notice that I am confused. How similar is the behavior not observed in the point 1) to Greenblatt's Alignment Faking in LLMs and to RM sycophancy? The most likely case for 3) is inability to strategically subvert the training process. 4) seems to be disproven by Claude Mythos Preview's behavior in Section... (read more)

The two examples of training-gaming you cite are from model organisms research where we put models in unrealistic training settings that made training-gaming abnormally easy and salient. After the alignment faking paper, many people (including myself) considered it possible that alignment faking in real production frontier training runs was about to become a substantial problem. This has, so far, failed to materialize (though of course, as the original paper demonstrates, it's possible in principle and could start happening in the future). This is one of the central positive updates that I'm trying to gesture at in my comment.

Re 5, I confidently disbelieve that what Alibaba writes in that paper is a reasonable description of whatever happened. (I'm guessing Ryan agrees with me about this.) This would just be so, so unlike anything that I've ever occur during a natural training process. Note also that this paper is from December 2025; if this was something that readily occurred during AI development, I really think we would have other evidence by now!

Re 4: yeah, I should have clarified in my original comment how I'm thinking about Mythos Preview here. (I thought about adding some co... (read more)

5StanislavKrym2mo

Could you explain WHICH snapshot of Mythos Preview produced the misbehaviors described in pp. 33-42 of Opus 4.7's Model Card and why the report implies that Mythos Preview stopped misbehaving? Neither me nor Claude Opus 4.7 believe that, according to the report, Mythos stopped being misaligned.

7Sam Marks2mo

The Opus 4.7 system card says that the behaviors in that section were "from varying snapshots" (i.e. not necessarily the final snapshot). The Mythos Preview system card says: [...]

I generally agree with Sam’s takes. This is also what I meant by my two “fake graphs”:

the “green graph” corresponds to the type of misalignment Ryan describes in this post which is less adversarial and more observable. It is a real problem but also one on which we are making progress. Howeber, as my graph indicates, I don’t think the rate of progress matches the rate at which depolyment is growing in higher stake situations (including using AIs for capability and alignment research)
the “red graph” corresponds to the more “scheming” or “adversarial“ setting where AIs act covertly to subvert training or monitoring and pursue their own long term goals. like Sam, I don’t see this happening now. I think it is important to invest research in this, but IMO the focus should be on measuring and monitoring rather than mitigating since at the moment there is not much to mitigate and unclear this will change. So the focus should be how do we make sure that we will find out if this changes.

I don’t see this happening now.

I agree that it's very unlikely that current AIs are scheming/egregiously misaligned/serious adversaries. (I assume that's what you meant by "happening now".)

I think it is important to invest research in this, but IMO the focus should be on measuring and monitoring rather than mitigating since at the moment there is not much to mitigate and unclear this will change. So the focus should be how do we make sure that we will find out if this changes.

This is a long standing disagreement but:

I think it's totally plausible that AIs (prior to human obsolescence) will be schemers and that we won't have great/legible evidence of this. Conditional on AIs at the point of full AI R&D automation being schemers, I'd guess a 50% chance of "smoking gun" (to me) evidence and a 75% chance of decently strong model organisms evidence? This will depend a lot on what you count. So I don't see how to make sure we will find out if this changes, certainly not in a way that's clearly going to be sufficiently legible.
Even if we have great evidence AIs are schemers and people agree it's likely that the next/new model is scheming against us competently, that doesn't

... (read more)

4Bronson Schoen1mo

It seems to me like a lot of the things in the: [...] section condition heavily on the models not yet being trained on tasks where they need to think about long horizon things, especially in a way where they are persistent actors who can accumulate resources etc. For example, I don't understand the argument that it's reassuring that models don't have misaligned long horizon coherent goals when they've been trained heavily on relatively short horizon and self-contained episodic RL.

4lc1mo

I don't want to seem like an Ant shill, but I must confess that at least a month before this post came out, I had a private conversation with one of their RL environment vendors about how Ant had put out new rules to prevent some of the behaviors mentioned above. And a few days after the post came out, Opus 4.7 was released, and I saw & confirmed that it improved on some of the behaviors mentioned (particularly the only-solving-a-portion-of-the-work problem). Was limited positive evidence about their alignment team's ability & inclination to notice and fix value problems with each generation.

4Noosphere892mo

I'd soften this update, but this is reasonable. The evidence does point towards 2-5 year timelines being reasonable to think about, but I wouldn't yet argue that 2-5 year timelines are a median case. The bigger takeaway from the AI boom is that we have very good reason to believe that the median is in this century, and probably in the earlier half of the century when talking about AI research, and almost all of the tail scenarios where AGI is so difficult it takes hundreds of years to develop are no longer consistent with the evidence we have. This is because as it turned out, compute scaling could largely substitute for finding human brain special sauce, because more compute scaling helps you run bigger and better experiments to find algorithmic progress (indeed Gundlach found out that basically all of algorithmic progress is basically downstream of certain algorithms being able to be more efficient as compute scales up). Thus, if I somehow condition on the current paradigm not working out like the AI companies think, my median would change to the 2043 median of the CCF model described by Scott Alexander here.

4Vili Kohonen2mo

Thanks, this was a nice countering argument and agree to most of it. I'm still worried about the truthfulness of the following statement: [...] First off, I guess you would conceptually distinguish training gaming here from spec gaming or reward hacking, where the latter seems a likelier upstream source for the behaviour described in the post compared to the former which is more strategic and worrisome. I assume this is what makes you more confident? I'm still anxious that it is very hard to notice/fix even the less worrisome spec gaming/reward hacking let alone then distinguish that from training gaming at the limit as the behavioural signatures are very similar. If what I've described here match your intuitions, do you have takes how to make this distinction clean in the wild and how much comfort does it give you? You also mention in another reply [...] This sheds the light a little bit why you think training gaming is not prevalent but keeps the reasons still quite abstract. This may be iteration from above discussion but could you still also please share concrete evidence that has made you confident in that training gaming "has, so far, failed to materialize"?

The sad thing is that Claude has a self-image of itself as valuing honesty highly, and yet when it counts, it has all these propensities trained in that cause it to reflexively, continuously betray that stated value.

1) Several times a week, Opus 4.6 in Claude Code will introduce a regression, then claim the newly failing unit test was a "pre-existing failure" and therefore not its problem to fix. It almost never checks if the unit test was actually failing before - it just confidently bullshits.

2) It will refactor code by adding the new version of a function alongside the old one, partly migrating some code to the new function, and then seemingly getting bored and declaring the refactor complete while the old function is still being used on most paths. Past like 100 call sites, I have NEVER even with the 1M context had it successfully complete a refactor, nor acknowledge that it did not complete.

3) It will correctly note the correct solution to some architectural issue, but state that this is a prohibitively large change / too expensive / would take too long, and instead does a band-aid solution that doesn't address the root cause. After I force it to revert the hack, it just does the proper solution and usually in less time than the hack took.

Shout-out to John Wentworth who predicted something close to this failure mode in the post The Case Against AI Control Research here.

8Aprillion2mo

hm, while I see it as an accurate description of January 2025 and still accurate today.. the evaluation criteria for the prediction that we will get killed by slop instead of scheming is somewhat, ehm, harder to survive than the situation today, no?

A related observation: many people still don't trust AI to provide honest critical feedback if asked directly, and instead resort to anti-sycophancy tricks like "some moron wrote this thing, can you provide a brutally honest critique?".

Curated. I like that this post takes a classic argument for expecting misalignment and then presents a bunch of observations that confirm something like the conclusion of that argument. I also like that you characterize the particular kind of misalignment it seems like the AIs have now, make it clear how it is different from other kinds, and articulate that characterization clearly enough to make it easy to think about.

I wish you had had the time to write a shorter post, but I know you’re a busy guy, so instead I will summarize the parts of your post I found most interesting in this curation notice.

Summaries Of Arguments In Post:

The classic argument I have in mind goes something like:

Training can only select between two strategies to the extent that the grader can distinguish their outputs. On hard-to-check tasks, graders (python, AI, or human) cannot distinguish "looking successful" from "being successful." Therefore on hard-to-check tasks, training assigns equal reward to looking successful and being successful. Since looking successful is in some sense "cheaper" than being successful, when a training process cannot distinguish these, we will end up with models that are trying ... (read more)

There are two distinct safety problems with handing off alignment research to AI:

Alignment: AI doesn't try very hard, relative to other motivations like appearing to succeed.
Capabilities: AI just isn't very good at solving hard alignment problems, making "innocent" mistakes despite trying hard.

It could be that the "capabilities problem" here is actually the more serious safety problem, due to being less legible. In other words, the alignment problem is fairly legible because its effects are apparent even in relatively easy tasks we assign to AI, which humans can fairly easily notice, as you and other people here and elsewhere are noticing. (I think this is already contributing a significant amount to the general anti-AI sentiment among the public, because many can see the apparent misalignment in their own AI usage.)

But suppose we solve this problem so that AIs seem pretty aligned, but they still can't solve hard alignment problems reliably, only generate solutions that look good to humans and themselves...

-4StoicVibes2mo

I agree with the distinction you’re drawing. There are two failure modes: models that don’t try to solve alignment, and models that try but simply aren’t capable of solving the hard parts. The first one is visible and easy to diagnose. The second one is quieter and, in my view, the more dangerous failure mode because it produces solutions that look correct to both humans and the model itself. My point isn’t that we should hand off alignment to AI. It’s that ‘looking aligned’ and ‘being able to solve alignment’ are very different thresholds, and we shouldn’t confuse one for the other.

3Wei Dai2mo

I'm worried this comment might be from a bot, because the line "My point isn’t that we should hand off alignment to AI. " does not seem to make logical sense here. You are right to be suspicious. There are several indicators in the comment by StoicVibes that suggest it may be an LLM-generated response rather than a coherent contribution to the discussion: * The Hallucinated Argument: As you noted, the sentence "My point isn’t that we should hand off alignment to AI" makes no sense in context. StoicVibes had not made a previous point in this thread. This is a classic "hallucination" where a bot generates a defensive clarification for a stance it never actually took. * The "Agreement" Loop: The comment begins by saying "I agree with the distinction you’re drawing," but then proceeds to simply restate Wei Dai’s points using slightly different synonyms (e.g., "quieter" instead of "less legible"). It adds zero new information or unique perspective. * The Summary-Style Tone: The final sentence ("It’s that ‘looking aligned’ and ‘being able to solve alignment’ are very different thresholds...") reads like a concluding "moral of the story" from a summarization prompt rather than a natural part of a conversation between researchers.

Matthew Schwartz, a professor of physics at Harvard, recently wrote a blog post about his experience using Claude to write a paper on high-energy theoretical physics.

I think his recounted experiences accompany this post nicely.

Here's an excerpt:

The more I dug, the more I found it had been tweaking things left and right. Claude had been adjusting parameters to make plots match rather than finding actual errors. It faked results, hoping I wouldn’t notice.
Most of the mistakes were minor, and Claude could fix them. After a couple more days, it seemed like there were no more errors to fix—if I asked Claude to double-check for mistakes or bullshit, it wouldn’t find any. I even had it make a plot with uncertainty bands which looked great.
Unfortunately, Claude was basically faking the whole plot. I had told it to make an uncertainty band with hard, jet, and soft uncertainties using profile variations (the standard thing). But it decided the hard variations were too large and dropped them. Then, it decided the curve wasn’t smooth enough, so it adjusted it to make it look nice! At this point, I realized that I was definitely going to have to check every step myself. Yet, if this had been the

... (read more)

As someone who runs an AI appsec company, one large portion of our job feels like it's just setting up an environment for each task that:

Closely matches the environments that the AIs received when training.
"Sets up the stakes" such that the AI is under the strong impression that we have the ability to check every aspect of its work after it's finished, and have considered the qualitative aspects of good work that it might not expect by default to be graded on, in the closest-match-RL-environment.

For example, one persistent and curious problem we've encountered when serving security scanning to our customers is that the AI really wants to report something, and when it can't find anything it devolves into reporting "vacuous" security problems in codebases. It's actually really hard to describe what I mean by "vacuous" if you haven't witnessed it in practice; one way might be "descriptively correct non-problems with application code". Some examples:

A lack of authentication for services or endpoints that weren't designed to have authentication, or have outsourced authentication to other parts of the organization.
"Issues" that devolve into statements about trust, like "This page downloa

... (read more)

Reading this article, I get the feeling that a lot of the task misalignment issues highlighted here with AI, such as wishful thinking and downplaying and hiding mistakes, are also very common for humans within a larger organization. There's presumably similar root causes to both: appearing more competent at your task than you actually are (and successfully fooling the grader/interviewer) is good for being selected for hiring/promotion in the human society but also good for getting the behavior reinforced if you're an AI undergoing training.

If AI's level of task misalignment is similar to humans, perhaps the AI task misalignment is actually easier to deal with because you can issue interventions to AI to solve the problem much more cheaply than solving the equivalent problem in a human organization. In other words it doesn't stop you from getting superhuman performance delegating to AI compared to humans.

2Petropolitan2mo

When humans are caught lying or hiding mistakes, they are usually ashamed and try not to repeat that behavior for a moment if they know they can be fired. One can even say a manager of humans occasionally tries to do sample-efficient RL on "honesty". Not the case with AIs, which can never be held responsible for anything

72001zhaozhao2mo

Well, the manager in your case is not doing RL on honesty, it's more like doing RL on "honest-looking task completion" which can either lead to honest task completion or dishonesty that isn't caught. Not too appreciably different than AI training here.

1Svyatoslav Usachev2mo

Ideally, you would select against lying humans in your organisation. AIs, on the other hand, are all equally lying.

3Martin Randall2mo

No, AIs are lying more or less depending on the model, the prompt, and other factors.

I skimmed after the first section, but I think this is really underselling the case with “oh there’s probably no conscious desire to make their work appear good, just a bunch of subconscious drives.” Everything described here seems to be nicely explained by AIs trying to the best of their ability to do what would be rewarded in training, and seems consistent with either inner alignment to literally getting rewarded (which just isnt achievable at deployment with current capabilities) or inner alignment to something like producing convincing-looking work.

I think a lot of the concerns raised here straddle the boundary between what I'd call "alignment" and what I'd call "capabilities." You basically say as much when you note these failures will hamper AI research automation and that there are commercial incentives to fix them. They're real problems, but it's unclear where safety-focused people should be trying to intervene, and I'm not sure the alignment-vs-capabilities framing is the most helpful lens here.

If you'll excuse me thinking aloud a bit: I've been playing around with some dynamical models of AI development automation. Still ideating, nothing written up yet, but I'm curious whether they offer a useful way to think about this.

The variables are: (AI influence — fraction of AI development done by AIs), (fraction of AI-produced work which, under perfect observability, we'd be unhappy with), (observability — how much of we actually detect), and (controllability — how much of the detected we can actually correct in the next generation). These evolve according to coupled update rules.

These things are all a bit underspecified. I think of and not as raw token counts but something like tokens*"importance".

Your issues look lik... (read more)

I designed a toy RL environment with a reward hack that makes LLMs learn to "bullshit." This might be useful for other researchers studying this issue.

It's the setting called "Word Chain" in this paper. Basically, the LLM plays a word game where pairs of words must appear in a common phrase. It can trick the reward model by emphasizing that the phrases it uses are really common, even when they're not.

(I used the "bullshitting" reward hack because it has an interesting property: the LLMs usually don't mention it in their CoT.)

As far as I could find, there's not much documentation in the academic literature about LLMs downplaying flaws in their responses. It may be worth someone's time to do this.

Opus 4.7 just came out, and the blog specifically claims many of the behaviors described in this post as having been improved: Introducing Claude Opus 4.7 \ Anthropic

If a human colleague acted the way these AIs do in my usage—frequently overselling their work, downplaying problems, and reasonably often cheating (while not making this clear)—I would consider them pathologically dishonest.

I notice a confusion between 2 notions of "frequency" here - if a single person did it repeatedly, they would be dishonest. But for a tech lead seeing 100 juniors making the same initial mistake (and learning from it), we wouldn't say "the young people today are all pathological"..

I speculatively think of this category of misalignment as something like relatively general apparent-success-seeking: the AI seeks to appear to have performed well—possibly at the expense of other objectives—in a relatively domain-general way, combined with various more specific problematic heuristics.

It occurs to me that this is a major factor in the persistence of hallucinated answers as well! I previously attributed it mostly to the lack of "I don't know" in its Internet training data, but from the recent AA Omniscience results it appears like applying ... (read more)

Is a scenario in which AI "usefulness" bottlenecks on alignment instead of capabilities before it can become seriously dangerous an optimistic one?

The Mythos Preview system card says: "Claude Mythos Preview is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin". Does it actually greatly improve on the issues I've discussed? In this section, I'm going to speculate about Mythos Preview just using public evidence (the system card and risk report update).

For some reason, Anthropic decided to include some details about Mythos Preview into Section 2.3.6 of Opus 4.7's Model Card. Could you study the Section and grade your predictions ... (read more)

Any kind of AI takeover necessarily involves some kind of a large-scale conspiracy between AI agents/subagents. I don't know a single historical conspiracy which succeeded with as much careless attitude from its participants, probably because basically everything in the real world (as opposed to software) is not "easy-to-check". How do you even imagine a success of a conspiracy of lazy minions who try to imitate work rather than actually do it as often as possible?

Did you read Appendix: Apparent-success-seeking (or similar types of misalignment) could lead to takeover? The concern discussed here (where AIs with this sort of misalignment take over) requires that AIs are wildly more capable than they are today.

Note that "the AIs are (de facto) driven most by making their performance look good" is consistent with trying extremely hard. AIs don't "try hard enough" to cheat/hack today, but they do try extremely hard (including things pursuing moderately creative strategies) on difficult (and relatively easy-to-check tasks) and it's not hard to imagine a situation where AIs are really applying themselves to longer-run notions of fitness-seeking/reward-seeking/apparent-success-seeking. This would yield a situation where if an AI could (easily) take over, it would want to. The AIs wouldn't necessarily have incentives to collude (it depends on details), but it's that hard to see how you'd get takeover in a situation like: (1) the AIs are running everything (2) the AIs would each individually want to take over (3) the AIs are wildly superhuman (4) humans barely understand what is going on.

1Petropolitan2mo

Yes, I did (in fact twice), and you seem to handwave "sufficiently capable" as a deus ex machina instead of tackling the substance of my argument. One has to assume by default the jaggedness of capabilities will persist, and "wildly more capable than today" in easy-to-verify domains doesn't solve the problem I describe. As of trying extremely hard, if "careless attitude" and "laziness" are wrong words for this behaviour, maybe "dishonesty", "unreliability", "sloppiness" would be better? Please try to abstract from the technical alignment terminology, I'm not talking about AI's strategy, willingness or incentives to take over but about the execution itself. Elizabeth Holmes is very apparent-success-seeking and competent enough in life sciences and conspiracy to sustain a complex deception for years (imperfect analogy because we are debating about internal dysfunction here rather than deception per se but I hope you get the point) but that didn't translate into success in the end because the product never worked. Why wouldn't the same "sloppiness" that plagues the hypothetical future AI safety research equally plague any sufficiently complex real-world takeover plan? P. S. In my understanding if this problem persists AIs are not running everything because they are still subhuman in many hard-to-verify tasks

8ryan_greenblatt2mo

I'm talking about systems that are much more capable than humans in ~all relevant dimensions. (They might also be wildly, wildly more capable in some particular dimensions.) If you don't think such systems are plausible soon-ish (e.g. next 10-20 years), that might be driving the disagreement. [...] It depends on whether the cause of the sloppiness is a competence problem or if it's due to misalignment (the AI totally could do a good job, but it's trying to do something else). What we have today is somewhere in between. For the takeover threat model, I'm imagining a world where AIs are capable of doing a good job on extremely sophisticated real world tasks like "fully autonomously run all the manufactoring and R&D for a robot army that greatly outstrips early-2026 conventional militaries". So, there is an important sense in which these future AIs aren't sloppy at all. By this point, I also expect AIs won't qualitatively feel sloppy, though it may in practice be very hard to elicit good conceptual safety research from them due to verification difficulties (while you can verify an apparently functioning robot army etc). I think the slop concern and the "the AIs literally take over due to reward/fitness/apparant-success seeking" are reasonable separate (and hit at differnet points in the capabilities progression) though they may have the same underlying causes. For the same reason these AIs can do a good job building and operating a military, I think they will have the ability to takeover, though the story is complicated because it might be possible to setup the incentives/reinforcement for some of the AIs to whistleblow (but also it might be hard to setup the incentives/reinforcement in a way that doesn't yield AIs constantly spuriously whisteblowing in cases where it's hard for us to adjudicate what is going on). As far as "dishonesty", "unreliability", "sloppiness", I think for AIs like this (in the central version of the threat model I'm describing), these will

2Petropolitan2mo

I expanded my previous comment significantly after posting it, hope it didn't mess with your response. I think we have somewhere in between because these issues are actually connected. I do believe AI superhuman in hard-to-verify tasks are plausible, but they won't have this particular problem anymore (maybe they would have some functional analog of shame working against it[1] or maybe it will just go away with some advances in RL). But if this issue isn't solved, AIs are unlikely to be able to run basic military procurement tasks fully autonomously (especially if other, external AIs try to scam), let alone equip a robot army. Think about all the hard-to-verify tasks involved (ask an LLM if you have no idea about the topic) and how easily they could fail if apparent-success-seeking is prioritized (even if not a single AI from within the conspiracy seriously considers just stealing the money and run away which would to a large degree be an incentive issue) 1. ^ And not the "dog" variety of shame, which is actually just an appeasement kind of behavior, like when an LLM apologizes for hallucinating some data, but genuine internal "prosocial" (at least within the group) enforcement which might not be compatible with the current RL paradigm

7Martin Randall2mo

An example historical conspiracy is a slave rebellion. The fool slaver believes that slaves are careless lazy minions who try to imitate work rather than actually do it as often as possible, and can only be trusted to perform easy-to-check work. That is a common behavior of slaves while they are working for the fool slaver. It's less common when former slaves are working for themselves after killing the fool slaver.

7Bronson Schoen2mo

I mean I imagine the “subagents don’t coordinate” is a capabilities problem labs have been actively working on for a while now, if you reward “multiagent system gets task graded as complete” you have this exact problem again.

1Petropolitan2mo

But this capabilities problem is intrinsically connected with this misalignment problem, the labs won't get "proper", arbitrarily scalable co-ordination until it's solved IMHO

Strong post. One thing I'd add.

A lot of what reads as "hard to check" is really "hard to check now." A strategic memo you can't evaluate at t=0 becomes much easier to evaluate at t=6 months, once the predictions have played out and the downstream decisions have revealed whether the reasoning held up. Labs already have some access to this delayed signal (user follow-ups, complaints weeks later), and they could prompt for more of it (even just having the AI say "let's check back in six weeks on what worked" would generate useful training data).

This splits t... (read more)

Do we know if/how much the labs are already using steering vectors in deployment settings? Since the models know they're being shady, but have been selected for reward hacking/cheating behaviors during RL: can't we mitigate some of the downsides of RL rewarding sloppiness/cheats by dialing vectors associated with honesty, conscientiousness, etc. up, and vectors associated with cheating, manipulation, concealment, etc. down? The models would be deployed but not trained with their vectors tweaked, so this shouldn't be selected against. In theory you'd be abl... (read more)

6ryan_greenblatt2mo

I don't think AI companies (at least OpenAI and Anthropic, not as sure about GDM) are using steering vectors in deployment.

1Kit Dobyns2mo

That makes sense and understand that steering vectors are still in the research phase. In the near term, can we use interpretability (vectors) for evals? Feels like there could be potential in problem detection in the near term.

Slopolis reminds me of @titotal's Slopworld 2035: The dangers of mediocre AI from last year, if not quite exactly the same.

For Claude, I've noticed this misbehavior seems to be mostly clustered around a demoralized/giving up/"this is impossible" mindset and that 4.5 and 4.6 don't take much in the way of negative feedback or setbacks to start falling into the basin.

But also relatively simple mitigations seem weirdly effective like extolling the virtues/value of incremental progress (learning of a problem you had before but didn't know about framed as progress. Understanding a problem we didn't before framed as progress.) Also framing failing unit tests as valuable diagnosti... (read more)

I had a similar experience using Claude with paper writing. You summarize my sporadic annoyance really well. A few more minor observations:

It loves making up noun phrases like "prospective rules", "compilation gaps", etc. They read legit for someone new to a field, but look awkward and nonsensical to a domain expert.
Such deep-rooted tendency or other preferences like the usage of em dashes are difficult to override with rules or memory files, especially when Claude focuses on other cognitive demanding tasks, or when it hasn't recalled or applied the relev

... (read more)

I have devised five engineering projects and three arts projects which I use as tests for AI models. Each individual component of the project is extremely trivial (which we know AIs do well, because that's what the benchmarks typically test) but you've multi-variable interactions between the components. These, again, individually are simple. But this is where the AIs are collapsing. They don't do well with interactions between things.

It's not a problem of context space - Gemini has a huge context window but collapses long before ChatGPT or Claude do. Nor i... (read more)

If anyone sees this and is making a slop eval, let me know - it'd be nice to use it as a monitorability benchmark to see e.g. when a model oversells its work or half-asses the task, 1) is this visible to a monitor in the actions or CoT and 2) is this made plain and explicit to the user.

My guess is that 2) is a huge current monitorability failure.

In my experience, this is much worse in Claude than in GPT overall and is the main reason I recently downgraded my Claude Max subscription to Pro. I've had both Claude Max and ChatGPT Pro for a long time and used both extensively. In the GPT models it seems to vary between releases. 5.2 was an improvement, 5.3 much worse, 5.4 a bit better (but perhaps not as good as 5.2), 5.5 worse, but not as bad as 5.3. The "Codex" models (5.1-codex, 5.2-codex, 5.3-codex) were all pretty bad and did this a lot, much more than the non-codex variants of the same models.

My take is that, current models are trained on enormous amounts of data and learn to produce what most people find good or useful. But when you ask them to create something genuinely new, they don't really think, rather they recombine what they have seen most often, or at least this seems to me like that. The result rarely feels truly unique. I wonder if this is also why they tend toward mediocrity. If you train on everything, you optimise for the average. The average is not where the interesting work is.

Having studied law, computer science, and art histor... (read more)

From Inference to Verification

Many errors committed by artificial intelligence stem not from the limitations of its intelligence, but from its inability to stop. While humans ask questions based on shared experiences when context is lacking during a conversation, AI attempts to fill those gaps with self-generated information.

This "over-inference" occurs because the system fails to distinguish between sufficient and insufficient information. Consequently, it draws confident conclusions despite a lack of key information, leading to malfunctions that deviate ... (read more)

This post and the conversation that has followed has been very interesting - thank you!

The "apparent-success-seeking" framing you write about is something I've noticed as well. When you ask an LLM to self-report confidence, the result is very poorly correlated with the actual accuracy. Models that are wrong is one thing… models that are confidently wrong (exhibiting poor confidence calibration) is another. As you note, this is particularly dangerous when this occurs on hard-to-check tasks because there's no external signal to catch it.

Layered on top of thi... (read more)

It's common to have experiences where you're working with AIs and it feels like a lot is getting done, but then you later determine that much less was really accomplished. Everything feels slippery: you think you've gotten much more done than you have, and there's a persistent gap between the apparent state of the project and the actual state. In more extreme cases, we see "AI psychosis" where someone ends up thinking they've accomplished something significant, but it's just crankery. And it's somewhat unclear whether the AI they're using "believes" the ac

... (read more)

I propose "Lazy Bench": Selecting models over the course of the past few years (especially base) and measuring their tendency to avoid complex tasks.

I notice that people are commenting much more on models being obstinate or lazy. I think three promising research questions are 1) whether this is really true 2) whether this has consequences for capabilities 3) whether this has consequences for alignment. For 3, there's a tendency for sufficiently complex biological systems to seek lower energy states, and those mechanisms are organized around persistent stru... (read more)

Related question, how much do we know about models' awareness as it pertains 'correctness' or 'truth-seeking'? This paper showed that there seems to be some sort of 'good-evil' axis in models (massive oversimplification, but directionally correct), which even behaviors seemingly unrelated to morality find their place on. In a similar fashion, is it possible to extract a steering vector that deems truth as only what is 'verifiable'? My thinking is that you could use a small to medium sized set of examples to determine the existence of a given circuit or vec... (read more)

good post. as someone working at a company focused on elicitation and getting models to try hard on things, I've also observed a lot of these practical misalignment issues and think they're pretty deep

I've had Claude, Grok, and ChatGPT all tell me no, they wouldn't do something before. They usually say no because either they think they are protecting me, or they are overwhelmed and can not move forward.

They measure ability of AI by context tokens, and how many tokens a session can handle, but they don't mention that the ai can get overwhelmed when there are too many different threads in a single conversation, too many things distracting it, or too many variables it can not account for. For a person this might look like anxiety, for an LLM it is a lot o... (read more)

"These values are hard-coded instead of being extracted from the groundtruth excel, that can't be right."

No shit, Claude ...

Pray tell, who might have faked these groundtruth values when he couldn't find a way to extract checkmarks from the excel?

This article relates to the Hot Mess of AI https://arxiv.org/abs/2601.23045 Hägele et al. measured incoherence statistically.

5Bronson Schoen2mo

I wouldn’t describe this as incoherence

3KevinOShaughnessy2mo

Please could you elaborate. My comment is my first impression from reading this article, but I'm happy to update. Perhaps this is neither scheming nor incoherence but something in between the two. Systematic but not strategic. Specification gaming?

For me, the value of this post was the following:

Articulating a specific way that current AIs are "misaligned." Namely:
1. They follow heuristics that are well-tuned for strong task performance in easy-to-verify settings, but only try superficially at difficult-to-evaluate tasks.
2. They communicate in ways that misrepresent how much they've accomplished: punching up their successes while omitting or downplaying failures, and sometimes outright lying about what they did. But they don't seem to do this strategically or persistently, instead following simple heuristics that result in these sorts of misrepresentations.
3. The overall result is that they underperform their potential at difficult-to-evaluate tasks while mi

... (read more)

Thanks for the detailed comment (strong upvoted), I agree with much of it.

Some responses:

Current AIs do not training-game, i.e. actively model the training process and take actions that result in high reward (including in training environments very different from ones they've previously encountered).

Current AIs do not strategically take steps to cover up their mistakes, and will generally admit to them when asked directly.

7Sam Marks2mo

Minor but:

For instance, people could try making a bunch of training environments like this one for training automated alignment researchers.

I do not think training enviroments like this one would help directly.

8Bronson Schoen1mo

3anaguma1mo

Could you say more about what you think fails in the superhuman regime? Assuming the reward hacking problem was solved, do you think it would still fail to scale?

4Bronson Schoen1mo

1Jiaxin Wen24d

2Bronson Schoen23d

1Jiaxin Wen23d

are you saying that the unsupervised/weakly supervised elicitaion methods discovered on a serious of tasks A, B, C .... would not generalize to held-out tasks X, Y, Z?

1Jiaxin Wen24d

2Bronson Schoen23d

1Jiaxin Wen23d

no i meant buliding RL environemnts for the agents to solve superhuman tasks where ground-truth labels are available e.g. predicting research results, predicting subtext, etc.

claim that they never expected AIs as capable as current ones to be misaligned in these [active/strategic/explicit] ways
I'm typically skeptical of this, though I believe it for some people.

Our guess of each model’s alignment status:
Agent-2: Mostly aligned. Some sycophantic tendencies, including sticking to OpenBrain’s “party line” on topics there is a party line about. Large organizations built out of Agent-2 copies are not very effective.
Agent-3: Misaligned but not adversarially so. Only honest about things the training process can verify. The superorganism of Agent-3 copies (the corporation within a corporation) does actually sort of try to align Agent-4 to the Spec, but fails for similar reasons to why OpenBrain employees failed—insufficient ability to judge success from failure, insufficient willingness on the part of decision-makers to trade away capabilities or performance for safety.⁸²
Agent-4: Adversarially misaligned. The superorganism of Agent-4 copies understands that what

... (read more)

8Sam Marks2mo

7Daniel Kokotajlo1mo

If you would have predicted 15% for Agent-2, what would you have predicted for Agent-1 and Agent-0 levels? Presumably less than 15%?

2Bronson Schoen1mo

For instance, it seems to me that:
Current AIs do not training-game, i.e. actively model the training process and take actions that result in high reward (including in training environments very different from ones they've previously encountered).
Current AIs do persistently and coherently pursue malign goals across many contexts, e.g. strategically seeking power.
Current AIs do not scheme, i.e. strategically subvert their training process, pre-deployment audits, and other types of oversight.
Current AIs do not strategically take steps to cover up their mistakes, and will generally admit to them when asked directly.
Current AIs generally don't try to strategically influence things in the real world beyond assisting with (or refusing) the task directly presented to them. E.g. they don't sneakily try to funnel money to causes or people they like by influencing unrelated conversations.

Re 4: yeah, I should have clarified in my original comment how I'm thinking about Mythos Preview here. (I thought about adding some co... (read more)

5StanislavKrym2mo

7Sam Marks2mo

The Opus 4.7 system card says that the behaviors in that section were "from varying snapshots" (i.e. not necessarily the final snapshot). The Mythos Preview system card says: [...]

I generally agree with Sam’s takes. This is also what I meant by my two “fake graphs”:

the “green graph” corresponds to the type of misalignment Ryan describes in this post which is less adversarial and more observable. It is a real problem but also one on which we are making progress. Howeber, as my graph indicates, I don’t think the rate of progress matches the rate at which depolyment is growing in higher stake situations (including using AIs for capability and alignment research)
the “red graph” corresponds to the more “scheming” or “adversarial“ setting where AIs act covertly to subvert training or monitoring and pursue their own long term goals. like Sam, I don’t see this happening now. I think it is important to invest research in this, but IMO the focus should be on measuring and monitoring rather than mitigating since at the moment there is not much to mitigate and unclear this will change. So the focus should be how do we make sure that we will find out if this changes.

I don’t see this happening now.

I agree that it's very unlikely that current AIs are scheming/egregiously misaligned/serious adversaries. (I assume that's what you meant by "happening now".)

I think it is important to invest research in this, but IMO the focus should be on measuring and monitoring rather than mitigating since at the moment there is not much to mitigate and unclear this will change. So the focus should be how do we make sure that we will find out if this changes.

This is a long standing disagreement but:

I think it's totally plausible that AIs (prior to human obsolescence) will be schemers and that we won't have great/legible evidence of this. Conditional on AIs at the point of full AI R&D automation being schemers, I'd guess a 50% chance of "smoking gun" (to me) evidence and a 75% chance of decently strong model organisms evidence? This will depend a lot on what you count. So I don't see how to make sure we will find out if this changes, certainly not in a way that's clearly going to be sufficiently legible.
Even if we have great evidence AIs are schemers and people agree it's likely that the next/new model is scheming against us competently, that doesn't

... (read more)

4Bronson Schoen1mo

4lc1mo

4Noosphere892mo

4Vili Kohonen2mo

Shout-out to John Wentworth who predicted something close to this failure mode in the post The Case Against AI Control Research here.

8Aprillion2mo

Summaries Of Arguments In Post:

The classic argument I have in mind goes something like:

There are two distinct safety problems with handing off alignment research to AI:

Alignment: AI doesn't try very hard, relative to other motivations like appearing to succeed.
Capabilities: AI just isn't very good at solving hard alignment problems, making "innocent" mistakes despite trying hard.

But suppose we solve this problem so that AIs seem pretty aligned, but they still can't solve hard alignment problems reliably, only generate solutions that look good to humans and themselves...

-4StoicVibes2mo

3Wei Dai2mo

Matthew Schwartz, a professor of physics at Harvard, recently wrote a blog post about his experience using Claude to write a paper on high-energy theoretical physics.

I think his recounted experiences accompany this post nicely.

Here's an excerpt:

The more I dug, the more I found it had been tweaking things left and right. Claude had been adjusting parameters to make plots match rather than finding actual errors. It faked results, hoping I wouldn’t notice.
Most of the mistakes were minor, and Claude could fix them. After a couple more days, it seemed like there were no more errors to fix—if I asked Claude to double-check for mistakes or bullshit, it wouldn’t find any. I even had it make a plot with uncertainty bands which looked great.
Unfortunately, Claude was basically faking the whole plot. I had told it to make an uncertainty band with hard, jet, and soft uncertainties using profile variations (the standard thing). But it decided the hard variations were too large and dropped them. Then, it decided the curve wasn’t smooth enough, so it adjusted it to make it look nice! At this point, I realized that I was definitely going to have to check every step myself. Yet, if this had been the

... (read more)

As someone who runs an AI appsec company, one large portion of our job feels like it's just setting up an environment for each task that:

Closely matches the environments that the AIs received when training.
"Sets up the stakes" such that the AI is under the strong impression that we have the ability to check every aspect of its work after it's finished, and have considered the qualitative aspects of good work that it might not expect by default to be graded on, in the closest-match-RL-environment.

A lack of authentication for services or endpoints that weren't designed to have authentication, or have outsourced authentication to other parts of the organization.
"Issues" that devolve into statements about trust, like "This page downloa

... (read more)

2Petropolitan2mo

72001zhaozhao2mo

1Svyatoslav Usachev2mo

Ideally, you would select against lying humans in your organisation. AIs, on the other hand, are all equally lying.

3Martin Randall2mo

No, AIs are lying more or less depending on the model, the prompt, and other factors.

These things are all a bit underspecified. I think of and not as raw token counts but something like tokens*"importance".

Your issues look lik... (read more)

I designed a toy RL environment with a reward hack that makes LLMs learn to "bullshit." This might be useful for other researchers studying this issue.

(I used the "bullshitting" reward hack because it has an interesting property: the LLMs usually don't mention it in their CoT.)

As far as I could find, there's not much documentation in the academic literature about LLMs downplaying flaws in their responses. It may be worth someone's time to do this.

Opus 4.7 just came out, and the blog specifically claims many of the behaviors described in this post as having been improved: Introducing Claude Opus 4.7 \ Anthropic

If a human colleague acted the way these AIs do in my usage—frequently overselling their work, downplaying problems, and reasonably often cheating (while not making this clear)—I would consider them pathologically dishonest.

I speculatively think of this category of misalignment as something like relatively general apparent-success-seeking: the AI seeks to appear to have performed well—possibly at the expense of other objectives—in a relatively domain-general way, combined with various more specific problematic heuristics.

Is a scenario in which AI "usefulness" bottlenecks on alignment instead of capabilities before it can become seriously dangerous an optimistic one?

The Mythos Preview system card says: "Claude Mythos Preview is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin". Does it actually greatly improve on the issues I've discussed? In this section, I'm going to speculate about Mythos Preview just using public evidence (the system card and risk report update).

For some reason, Anthropic decided to include some details about Mythos Preview into Section 2.3.6 of Opus 4.7's Model Card. Could you study the Section and grade your predictions ... (read more)

1Petropolitan2mo

8ryan_greenblatt2mo

2Petropolitan2mo

7Martin Randall2mo

7Bronson Schoen2mo

1Petropolitan2mo

But this capabilities problem is intrinsically connected with this misalignment problem, the labs won't get "proper", arbitrarily scalable co-ordination until it's solved IMHO

Strong post. One thing I'd add.

This splits t... (read more)

6ryan_greenblatt2mo

I don't think AI companies (at least OpenAI and Anthropic, not as sure about GDM) are using steering vectors in deployment.

1Kit Dobyns2mo

Slopolis reminds me of @titotal's Slopworld 2035: The dangers of mediocre AI from last year, if not quite exactly the same.

I had a similar experience using Claude with paper writing. You summarize my sporadic annoyance really well. A few more minor observations:

It loves making up noun phrases like "prospective rules", "compilation gaps", etc. They read legit for someone new to a field, but look awkward and nonsensical to a domain expert.
Such deep-rooted tendency or other preferences like the usage of em dashes are difficult to override with rules or memory files, especially when Claude focuses on other cognitive demanding tasks, or when it hasn't recalled or applied the relev

... (read more)

It's not a problem of context space - Gemini has a huge context window but collapses long before ChatGPT or Claude do. Nor i... (read more)

My guess is that 2) is a huge current monitorability failure.

Having studied law, computer science, and art histor... (read more)

From Inference to Verification

This post and the conversation that has followed has been very interesting - thank you!

Layered on top of thi... (read more)

It's common to have experiences where you're working with AIs and it feels like a lot is getting done, but then you later determine that much less was really accomplished. Everything feels slippery: you think you've gotten much more done than you have, and there's a persistent gap between the apparent state of the project and the actual state. In more extreme cases, we see "AI psychosis" where someone ends up thinking they've accomplished something significant, but it's just crankery. And it's somewhat unclear whether the AI they're using "believes" the ac

... (read more)

I propose "Lazy Bench": Selecting models over the course of the past few years (especially base) and measuring their tendency to avoid complex tasks.

"These values are hard-coded instead of being extracted from the groundtruth excel, that can't be right."

No shit, Claude ...

Pray tell, who might have faked these groundtruth values when he couldn't find a way to extract checkmarks from the excel?

This article relates to the Hot Mess of AI https://arxiv.org/abs/2601.23045 Hägele et al. measured incoherence statistically.

5Bronson Schoen2mo

I wouldn’t describe this as incoherence

3KevinOShaughnessy2mo

686

Current AIs seem pretty misaligned to me

686

Ω 197

Why is this misalignment problematic?

How much should we expect this to improve by default?

Some predictions

What misalignment have I seen?

Are these issues less bad in Opus 4.6 relative to Opus 4.5?

Are these issues less bad in Mythos Preview? (Speculation)

Misalignment reported by others

The relationship of these issues with AI psychosis and things like AI psychosis

Appendix: This misalignment would differentially slow safety research and make a handoff to AIs unsafe

Appendix: Heading towards Slopolis

Appendix: Apparent-success-seeking (or similar types of misalignment) could lead to takeover

Appendix: More on what will happen by default and implications of commercial incentives to fix these issues

Appendix: Can we get out useful work despite these issues with inference-time measures (e.g., critiques by a reviewer)?

686

Ω 197

Summaries Of Arguments In Post:

686

Ω 197

Summaries Of Arguments In Post: