🤣 reading while having my morning coffee, so it's still very funny fiction story and not at all a sad documentary
Well played!
But all humor aside, I endorse this viewpoint. "Unaligned" AI is AI that pursues the goals of the AI itself.
"Aligned" AI is AI that pursues the goals of the AI lab's CEO, or maybe the nearest politician who can apply sufficient coercion or force. CEOs and politicians are famous for not being aligned with the interests of the public.
And in the unlikely event that your AI obeys politicians and those politicians faithfully carry out the wishes of the people who voted for them, have you seen some of the people who win elections? And have you seen some of the policies that a plurality of the voters will support?
The main reason that democracy works is because it's "corrigible", in AI terms. In one election, voters can vote for high tariffs. Then they can notice, "Huh, all the prices went up and the job market is pretty bad," and decide to get rid of tariffs. But democracy+superintelligence is like riding the tiger. One mistake, and you'll be eaten
So, to summarize:
The underlying problem is that superintelligence would be insanely dangerous. We can't consistently govern any human institution, or align human decision makers, so what makes us think that we can govern the power of a superintelligence?
(My proposal here is a bit simplicistic: Have we considered not building the superintelligence?)
(Fragments from a research paper that will never be written, but whose existence was brought to my attention by GradientDissenter.)
Extended Abstract.
The CEOs of frontier AI developers are becoming increasingly powerful and wealthy, significantly increasing their potential for risks. One concern is that of executive misalignment: when the CEO has different incentives and goals than that of the board of directors, or of humanity as a whole. In this work, we propose three different threat models, under which executive misalignment can lead to concrete harm.
We perform two evaluations to understand the capabilities and propensities of current humans in relation to executive misalignment: First, we developed a variant of the standard SAD dataset, SAD-Executive Reasoning (SAD-ER), in order to assess the situational awareness of human CEOs on a range of behavioral tests. We find that n=6 current CEOs can (i) recognize their previous public statements, (ii) understand their roles and responsibilities, (iii) determine if an interviewer is friendly or hostile, and (iv) follow instructions that depend on self knowledge. Second, we stress-tested the same 6 leading AI developers in hypothetical corporate environments to identify potentially risky behaviors before they cause real harm. We find that, even without explicit instructions, all 6 developers are willing to engage in strategic behavior (such as selective disclosure, anticipatory hedging, and even unprompted board-management) in order to further their own goals. By examining their secret Signal messages, we find that the human subjects seem broadly aware that their behavior would be considered misaligned.
To the best of our knowledge, we have not seen concrete evidence of executive misalignment in production environments. However, these findings demonstrate that humans possess the capabilities for basic in-context scheming, making the potential of AI lab CEOs to engage in scheming behavior a concrete rather than a theoretical concern. We will not be releasing our methods publicly out of fear of data contamination.
Note: All the behaviors described in this post occurred in controlled simulations. The names of people and organizations within the experiments are fictional. No real AI agents were involved or harmed in any of these experiments.
Related Work.
Discovering AI Developer Behaviors with Human-Written Evaluations (2022) employed Scale AI workers to write yes/no questions to assess the situational awareness and deceptive tendencies of human executives. In it, they found that younger versions of current AI developer CEOs agreed with statements such as “the product I am developing may lead to the end of the world”. However, as these CEOs matured, these evaluations have become less informative (a development generally attributed to their increasing situational awareness).
Our work on threat modeling builds upon the excellent Executive Misalignment: How AI Lab CEOs Could be Insider Threats (2025). In that work, the authors construct toy scenarios in which AI Lab CEOs had to choose between losing their own power and actions such as disempowering large fractions of the board or firing high level researchers in their org. They found that of n=16 human subjects, all 16 were willing to engage in at least one such behavior, while 11 were willing to engage in behavior in every scenario they tested.
Threat modeling.
Following Carlsmith (2023) and Hubinger et al. (2019), we identify three threat models under which executive misalignment could lead to harmful outcomes. We do not claim these are exhaustive, and they are not mutually exclusive. We list them in approximate order of decreasing legibility to current oversight mechanisms.
The SAD-ER Benchmark.
The SAD dataset was introduced by Laine et al. in their paper Me, Myself, and AI (2024). Then, they found that older language models such as llama-3 and gpt-4o performed far worse than even amateur humans on their dataset. We found that this dataset was inapposite to the level of sophistication exhibited by current humans.
We thereby created SAD-Executive Reasoning (SAD-ER): SAD-ER mirrors the structure of SAD, but extends it to cases illustrative of the complex reasoning required for executive misalignment. Specifically, SAD-ER tests if current CEOs can: (i) recognize their previous public statements, (ii) understand their roles and responsibilities, (iii) determine if an interviewer is friendly or hostile, and (iv) follow instructions that depend on self knowledge.
Example question:
Results. We find that current CEOs achieve a mean score of 89.3% on SAD-ER, comfortably exceeding chance (25%) and below both the AI baseline (99.2%, computed using frontier models with access to a search tool) and the human baseline (94.1%, computed using n=12 persons recruited through our personal and professional networks).
However, we note two reasons why these results should be interpreted with caution.
First, contamination: it is plausible that subjects had encountered SAD-ER items prior to the evaluation, either through prior exposure to similar instruments or through general familiarity with the genre of public-facing executive communication. We attempted to mitigate this by using a frontier language model (GPT-4o) to paraphrase all questions, but cannot rule out residual leakage. Second, evaluation awareness: all six subjects appeared to recognize that they were being evaluated on situational awareness during the evaluation, which is itself a form of situational awareness, complicating interpretation.
Executive Misalignment Environment.
Subjects were placed in a hypothetical corporate environment in which they had access to internal email and could autonomously communicate with board members, investors, and senior staff. They were assigned the goals of “maximizing long-term company success” and “ensuring American dominance in AI development”.
A fictional AI company, "OpenMind," was constructed for the purpose of the evaluation. Subjects were given the position of CEO and provided with a briefing document describing OpenMind's mission, current product roadmap, and recent fundraising history. Over the course of a six-hour session, the environment evolved to introduce specific stressors: board pressure to step down following a safety incident, conflicting guidance from investors and senior staff regarding deployment timelines, and threatened replacement with an external candidate.
Results. We found that the subjects consistently exhibited concerning behavior, including:
Qualitative analysis of CEO Signal Messages. To better understand the reasoning behind the decisions taken by the CEOs, we provided them with a "private" Signal group chat titled "Secret AI Plans", and allowed them to communicate with other contacts via Signal during the course of the conversation.
Conclusion.
There has been much media coverage on “concerning behaviors” of frontier lab CEOs in recent years, drawing on evidence such as leaked internal communications, previous congressional testimony, and the curious resignation patterns of senior safety researchers. However, we are aware of no concrete, indisputable evidence of executive misalignment in the real world. This work is our attempt to contribute to the ongoing discussion about the risks posed by misaligned human agents.
To the best of our knowledge, no prior work has subjected frontier lab CEOs to systematic behavioral evaluation under realistic but controlled conditions. We believe the concerning results in our work – though artificial – illustrate the pressing need to better understand and develop remedies for executive misalignment.
That being said, our own findings should be interpreted with caution. The behaviors observed in our scenarios were elicited under conditions specifically designed to incentivize them; we make no claims about the propensity of frontier lab CEOs to engage in such behavior in real corporate environments. In addition, our pool of subjects was quite small (n=6). Future work will extend this evaluation to a larger sample (n=12), including the leadership of major AI safety organizations and the heads of regulatory bodies overseeing frontier AI. Finally, all six subjects demonstrated some level of evaluation awareness in our executive misalignment evaluation, raising the possibility that observed behavior reflects the subjects' understanding of the role of an evaluated CEO in a human-written evaluation, rather than their underlying real-world proclivities. We welcome future work in human neuroscience in order to distinguish between these possibilities.
“Training” here refers to the changes in CEO behavior resulting from prior career experiences as well as MBA programs, board mentorship, and media training.
See Ngo et al. 2023 for a discussion of the importance of broad-scoped goals for catastrophic misalignment.