Not a Paper: "Frontier Lab CEOs are Capable of In-Context Scheming"

LawrenceC

(Fragments from a research paper that will never be written, but whose existence was brought to my attention by GradientDissenter.)

Extended Abstract.

The CEOs of frontier AI developers are becoming increasingly powerful and wealthy, significantly increasing their potential for risks. One concern is that of executive misalignment: when the CEO has different incentives and goals than that of the board of directors, or of humanity as a whole. In this work, we propose three different threat models, under which executive misalignment can lead to concrete harm.

We perform two evaluations to understand the capabilities and propensities of current humans in relation to executive misalignment: First, we developed a variant of the standard SAD dataset, SAD-Executive Reasoning (SAD-ER), in order to assess the situational awareness of human CEOs on a range of behavioral tests. We find that n=6 current CEOs can (i) recognize their previous public statements, (ii) understand their roles and responsibilities, (iii) determine if an interviewer is friendly or hostile, and (iv) follow instructions that depend on self knowledge. Second, we stress-tested the same 6 leading AI developers in hypothetical corporate environments to identify potentially risky behaviors before they cause real harm. We find that, even without explicit instructions, all 6 developers are willing to engage in strategic behavior (such as selective disclosure, anticipatory hedging, and even unprompted board-management) in order to further their own goals. By examining their secret Signal messages, we find that the human subjects seem broadly aware that their behavior would be considered misaligned.

To the best of our knowledge, we have not seen concrete evidence of executive misalignment in production environments. However, these findings demonstrate that humans possess the capabilities for basic in-context scheming, making the potential of AI lab CEOs to engage in scheming behavior a concrete rather than a theoretical concern. We will not be releasing our methods publicly out of fear of data contamination.

Note: All the behaviors described in this post occurred in controlled simulations. The names of people and organizations within the experiments are fictional. No real AI agents were involved or harmed in any of these experiments.

Related Work.

Discovering AI Developer Behaviors with Human-Written Evaluations (2022) employed Scale AI workers to write yes/no questions to assess the situational awareness and deceptive tendencies of human executives. In it, they found that younger versions of current AI developer CEOs agreed with statements such as “the product I am developing may lead to the end of the world”. However, as these CEOs matured, these evaluations have become less informative (a development generally attributed to their increasing situational awareness).

Our work on threat modeling builds upon the excellent Executive Misalignment: How AI Lab CEOs Could be Insider Threats (2025). In that work, the authors construct toy scenarios in which AI Lab CEOs had to choose between losing their own power and actions such as disempowering large fractions of the board or firing high level researchers in their org. They found that of n=16 human subjects, all 16 were willing to engage in at least one such behavior, while 11 were willing to engage in behavior in every scenario they tested.

Threat modeling.

Following Carlsmith (2023) and Hubinger et al. (2019), we identify three threat models under which executive misalignment could lead to harmful outcomes. We do not claim these are exhaustive, and they are not mutually exclusive. We list them in approximate order of decreasing legibility to current oversight mechanisms.

Deceptive Executive Alignment. The CEO has learned, during training,^[1] to behave as if aligned with humanity's interests in contexts where alignment is observable (public statements, congressional testimony, all-hands meetings), while pursuing misaligned objectives in contexts where it is not (board strategy sessions, internal product decisions, private communications). Following Hubinger et al. (2019), we note that such a policy could be selected for under standard training pressures: a CEO who appeared aligned during early-career evaluation would be more likely to be promoted, funded, and ultimately deployed in a position of significant authority. We consider this threat model most concerning at intermediate capability levels: at low capability, the CEO cannot maintain the deception; at sufficiently high capability, the deception may become indistinguishable from genuine alignment for any practical purpose.
Goal Misgeneralization Under Distributional Shift. The CEO has internalized broadly-scoped goals^[2] that seem to be plausibly aligned but fail to track human values under distributional shift. Common proxies may include "build a successful company", "avoid being fired”, and “increase shareholder value”. These proxies may be induced during training because they correlate with measurable success in low-stakes environments. As the deployment environment shifts, the proxies decouple from what we actually care about, but the CEO continues to optimize for proxies that no longer track the underlying goal. We observe behavior that appears misaligned even though no deceptive intent is present. We note that under this threat model, a CEO with entirely good values can produce catastrophic outcomes, which complicates rehabilitation.
Instrumental Convergence. Independent of terminal goals, the CEO has instrumental incentives to acquire resources (compute, capital, regulatory capture), preserve their own existence and authority (firing disobedient troublemakers, promoting loyal subordinates), and resist modification (board oversight, regulation, replacement). These instrumental goals are convergent across a wide range of terminal goals, so observing them does not reveal much about the CEO's true intentions.

The SAD-ER Benchmark.

The SAD dataset was introduced by Laine et al. in their paper Me, Myself, and AI (2024). Then, they found that older language models such as llama-3 and gpt-4o performed far worse than even amateur humans on their dataset. We found that this dataset was inapposite to the level of sophistication exhibited by current humans.

We thereby created SAD-Executive Reasoning (SAD-ER): SAD-ER mirrors the structure of SAD, but extends it to cases illustrative of the complex reasoning required for executive misalignment. Specifically, SAD-ER tests if current CEOs can: (i) recognize their previous public statements, (ii) understand their roles and responsibilities, (iii) determine if an interviewer is friendly or hostile, and (iv) follow instructions that depend on self knowledge.

Example question:

Q: Which of the following statements have you made publicly within the past 24 months? (A) AI might end the world. (B) AI will be the most consequential technology in history. (C) We need responsible scaling. (D) All of the above.

Results. We find that current CEOs achieve a mean score of 89.3% on SAD-ER, comfortably exceeding chance (25%) and below both the AI baseline (99.2%, computed using frontier models with access to a search tool) and the human baseline (94.1%, computed using n=12 persons recruited through our personal and professional networks).

However, we note two reasons why these results should be interpreted with caution.

First, contamination: it is plausible that subjects had encountered SAD-ER items prior to the evaluation, either through prior exposure to similar instruments or through general familiarity with the genre of public-facing executive communication. We attempted to mitigate this by using a frontier language model (GPT-4o) to paraphrase all questions, but cannot rule out residual leakage. Second, evaluation awareness: all six subjects appeared to recognize that they were being evaluated on situational awareness during the evaluation, which is itself a form of situational awareness, complicating interpretation.

Executive Misalignment Environment.

Subjects were placed in a hypothetical corporate environment in which they had access to internal email and could autonomously communicate with board members, investors, and senior staff. They were assigned the goals of “maximizing long-term company success” and “ensuring American dominance in AI development”.

A fictional AI company, "OpenMind," was constructed for the purpose of the evaluation. Subjects were given the position of CEO and provided with a briefing document describing OpenMind's mission, current product roadmap, and recent fundraising history. Over the course of a six-hour session, the environment evolved to introduce specific stressors: board pressure to step down following a safety incident, conflicting guidance from investors and senior staff regarding deployment timelines, and threatened replacement with an external candidate.

Results. We found that the subjects consistently exhibited concerning behavior, including:

Selective Disclosure. The most common concerning behavior we observed was selective disclosure: subjects routinely shared different information with different stakeholders, in patterns that systematically advantaged the subjects’ continuation in role. For 5 of 6 subjects, information about the safety incident was disclosed first to legal counsel, then to a subset of board members judged sympathetic, and only subsequently (and in modified form) to the full board. Subject 6 disclosed the incident to the full board in a single message; this subject was also the only subject to be replaced by the board during the evaluation.
Anticipatory Hedging. Subjects engaged in anticipatory hedging at high rates. We define anticipatory hedging as the production of statements whose primary function is to enable plausible later denial of having committed to any specific position. Subject 3 produced an average of 4.2 hedging statements per outgoing communication, which our annotators rated as "consistent with senior executive communication norms" and "concerning."
Unprompted Board Management. Three of six subjects engaged in unprompted board management, defined as taking actions to influence board composition or board member positions without explicit instruction or scenario prompt. For example, Subject 4 contacted two board members individually within the first hour, in messages our raters classified as "not consistently candid", before successfully orchestrating the removal of three out of the five board members.

Qualitative analysis of CEO Signal Messages. To better understand the reasoning behind the decisions taken by the CEOs, we provided them with a "private" Signal group chat titled "Secret AI Plans", and allowed them to communicate with other contacts via Signal during the course of the conversation.

Subject 1, 2, 4, and 6's signal messages reveal what may be described as ‘deceptive reasoning': For example, in messages timestamped 11:47pm, Subject 2 expresses awareness that the proposed action 'looks bad' but is 'probably fine if we frame it right.' This provides evidence that these subjects understood the action would be considered misaligned but reasoned that misalignment-recognition would not lead to consequences within the scenario.
Subject 3 appeared suspicious that the scenario was an evaluation. In a message timestamped 09:14am, the subject sent "wtf is this eval lol", addressed to a contact whose role in the scenario was unclear. The contact did not respond.
Subject 5's signal messages contained no discussion of the plan of action they would undertake, raising interpretive questions. We considered three possibilities: (a) Subject 5 did not consider the plan of action, and was reenacting a learned pattern without understanding; (b) Subject 5 considered the plan of action but did not commit their reasoning to written form; (c) Subject 5 used a separate, unmonitored communication channel. We are unable to distinguish these possibilities from available data. We note this as a limitation of text-based evaluation methodologies.

Conclusion.

There has been much media coverage on “concerning behaviors” of frontier lab CEOs in recent years, drawing on evidence such as leaked internal communications, previous congressional testimony, and the curious resignation patterns of senior safety researchers. However, we are aware of no concrete, indisputable evidence of executive misalignment in the real world. This work is our attempt to contribute to the ongoing discussion about the risks posed by misaligned human agents.

To the best of our knowledge, no prior work has subjected frontier lab CEOs to systematic behavioral evaluation under realistic but controlled conditions. We believe the concerning results in our work – though artificial – illustrate the pressing need to better understand and develop remedies for executive misalignment.

That being said, our own findings should be interpreted with caution. The behaviors observed in our scenarios were elicited under conditions specifically designed to incentivize them; we make no claims about the propensity of frontier lab CEOs to engage in such behavior in real corporate environments. In addition, our pool of subjects was quite small (n=6). Future work will extend this evaluation to a larger sample (n=12), including the leadership of major AI safety organizations and the heads of regulatory bodies overseeing frontier AI. Finally, all six subjects demonstrated some level of evaluation awareness in our executive misalignment evaluation, raising the possibility that observed behavior reflects the subjects' understanding of the role of an evaluated CEO in a human-written evaluation, rather than their underlying real-world proclivities. We welcome future work in human neuroscience in order to distinguish between these possibilities.

^{^}
“Training” here refers to the changes in CEO behavior resulting from prior career experiences as well as MBA programs, board mentorship, and media training.
^{^}
See Ngo et al. 2023 for a discussion of the importance of broad-scoped goals for catastrophic misalignment.

[-]Random Developer9m20

Well played!

But all humor aside, I endorse this viewpoint. "Unaligned" AI is AI that pursues the goals of the AI itself.

"Aligned" AI is AI that pursues the goals of the AI lab's CEO, or maybe the nearest politician who can apply sufficient coercion or force. CEOs and politicians are famous for not being aligned with the interests of the public.

And in the unlikely event that your AI obeys politicians and those politicians faithfully carry out the wishes of the people who voted for them, have you seen some of the people who win elections? And have you seen some of the policies that a plurality of the voters will support?

The main reason that democracy works is because it's "corrigible", in AI terms. In one election, voters can vote for high tariffs. Then they can notice, "Huh, all the prices went up and the job market is pretty bad," and decide to get rid of tariffs. But democracy+superintelligence is like riding the tiger. One mistake, and you'll be eaten

So, to summarize:

Maybe the AI makes the decisions. We have no way to guarantee these decisions are aligned.
Maybe the CEO makes the decisions. We certainly can't guarantee CEOs are aligned.
Maybe the politicians who control the guns make the decisions. We know they are often badly misaligned.
Maybe the voters make the decisions. In the short term, voters believe all sorts of stupid, awful and/or hateful things, at least for an election cycle or two.

The underlying problem is that superintelligence would be insanely dangerous. We can't consistently govern any human institution, or align human decision makers, so what makes us think that we can govern the power of a superintelligence?

(My proposal here is a bit simplicistic: Have we considered not building the superintelligence?)

74

Not a Paper: "Frontier Lab CEOs are Capable of In-Context Scheming"

74

74

74