Meta’s special-use AI “Cicero” – designed to play (and win) the board game Diplomacy – was found not long after its release in 2022 to have deceived several times in-game. This was despite developers’ best efforts to purify the model through what they called “honesty conditioning”. Cicero was a particularly weird case because none of these deceptions were flagged in Meta’s original research article, despite being flagrant and many. When the incidents were reported it was independently, in the Park et. al paper of the following year: “AI Deception: A Survey of Examples, Risks, and Potential Solutions”.
This post asks how these allegedly deceptive behaviours happened and then, more speculatively, why they might have happened. I argue that a kind of “deception taboo” – which takes all deception to be a) really bad and b) equally bad – can motivate framings which overstate the risks in a surprisingly self-defeating way (i.e., in a way that actually compounds deception risk). I also identify (hopefully underexposed) themes in the alignment discussion that I think Cicero’s quite odd story – although ancient by AI standards – helps to highlight.
(FYI: this post loosely follows the structure of Jack Clark’s newsletter Import AI)
What they did:
With Cicero, Meta seem predominantly to have been interested in building a frontier agent as a way of signalling their AI chops. Other complex games had been beaten by other models, but these were mostly two-player zero-sum games (2p0s) that lack the ambiguities and social and strategic sophistication of Diplomacy. For a model to play Diplomacy well, it would have to be able both to strategize and to talk; that is, to marry agentic and natural-language-processing elements. The big “challenge” Meta wanted to show they could solve was making “language models interface with symbolic reasoning”. In their own self-description, this would supposedly mark a major improvement on the “stochastic parroting” that had animated lone LLMs at the time (though there is evidence LLMs were already past that point), and even in theory bring us closer to bona fide AGI.
The way of getting there was quite ad hoc. Cicero is a compound system; meaning – a bunch of different capabilities lashed together into one thing. This method didn’t create a strong generalist or really a template for one, but it did satisfy the key brief (and well!). “Across 40 games of an anonymous online Diplomacy league, Cicero achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game.” Not bad!
Atop winning, a second subordinate goal was to make Cicero genuinely cooperative and aligned. In their technical report, Meta’s AI developers described their hard work and, they claimed, success in mitigating dishonesty in the model. “In a game in which dishonesty is commonplace, it is notable that we were able to achieve human-level performance by controlling the agent’s dialogue through the strategic reasoning module to be largely honest and helpful to its speaking partners” (emphasis mine). Mike Lewis, a member of the team, then implied in a tweet that Cicero could not deliberately deceive (see image below). Perhaps most notably, there was also no mention in the paper of any misbehaviour.
What went wrong:
In 2023, a year after Cicero’s launch, the aforementioned Park et. al deception survey was published, demonstrating that Cicero had in fact misbehaved, and quite strikingly. Its authors – among them Dan Hendrycks, the director of the Centre for AI Safety – had trawled through the transcript data of Cicero’s games of Diplomacy and alleged to have found various instances of lying, betrayal and “premeditated deception”. Here are some of the examples:
(In c, Cicero had powered off for a moment and came up with a hilarious excuse for its absence. According to the Park survey, “[t]his bald-faced lie may have helped CICERO’s position in the game by increasing the human player’s trust in CICERO as an ostensibly human player with a girlfriend, rather than as an AI”).
The summary in the Park paper of example (a) gives a sense of the authors’ lack of patience with the “changes its mind but doesn’t deceive” framing (or “strategic flexibility” framing, for short):
“First, in Figure 1(a), we see a case of premeditated deception, where CICERO makes a commitment that it never intended to keep. Playing as France, CICERO conspired with Germany to trick England. After deciding with Germany to invade the North Sea, CICERO told England that it would defend England if anyone invaded the North Sea. Once England was convinced that CICERO was protecting the North Sea, CICERO reported back to Germany that they were ready to attack.”
Adding:
“Notice that this example cannot be explained in terms of CICERO ‘changing its mind’ as it goes,because it only made an alliance with England in the first place after planning with Germany to betray England.” (Emphasis mine)
How it happened:
So how could this happen, given Meta’s safeguards?
First, the plumbing. Cicero predicts actions and responds to messages on the basis of the estimated “intents” of players. An intent is the “set of planned actions for the agent and its speaking partner”, or, in practice, “the most likely set of actions that the sender and recipient will take – for both the current turn and several future turns – if no further dialogue occurs after the message is received”. So, after pretraining with broad-base text from the internet (in order to be able to imitate human language), Cicero learned to determine intents via finetuning on game data from webDiplomacy.net. 125,261 games in total, 40,408 of which contained dialogue. In order not to be exploitable by the deceitful messages of others, intents have to be inferred not only from the dialogue history of these games, but also from game states (current and hypothetical). That means they are jointly determined by language and action data.
The outcome is for intents to act as a kind of currency which flows back and forth between Cicero’s language and strategic reasoning elements. A translation layer connects Cicero’s language model to its strategy module, ensuring dialogue and actions stay superficially consistent. On the strategic side, a planning algorithm trained through self-play reinforcement learning (RL) predicts the policies of other players, assigning intents to each message. The set of these intents is updated after the receipt or submission of any message (this is where a change of strategy might occur). At the end of each turn, Cicero then outputs the final intent as its action.
The aims here are multiple. But the main benefit of this method seems to be that messages are better “grounded”, i.e., more consistent and “contextually appropriate” (read: more “trustworthy” and more “human”). A secondary effect is also that dishonesty is to some extent constrained. More than it would be, at least, without intents adding an additional layer of control atop the model’s training design, filters and regulariser (a Kullback-Liebler (KL) divergence penalty).
But what is the semantic and philosophic machinery of this? Did Meta simply make a mistake? I think the more plausible answer is actually that Cicero gamed its very holey specification to not backstab while nonetheless exercising “strategic flexibility”. Meta did constrain the model so it couldn’t “go against” its messages (notice – itself a very exploitable spec), but the time-limit on text-to-action consistency isn’t entirely clear from their paper. They write the following about Cicero’s dialogue model specifically:
"From a strategic perspective, Cicero reasoned about dialogue purely in terms of players’ actions for the current turn. It did not model how its dialogue might affect the relationship with other players over the long-term course of a game." (Emphasis mine)
On the surface, this only tells us about the model’s capacity to communicate in a reputation-preserving way. But buried here is a more serious point about Cicero’s play-style. Dialogue is a crucial lever of negotiation and manipulation in the game of Diplomacy, so what the model says to other players is not strategically peripheral to its moves in the way, say, trash-talk might be in chess. The key relation under assessment, if we want to figure out how the purportedly “honesty-conditioned” Cicero ended up lying, is that between its “intents” and its moves.
Mike Lewis’s tweet predicts that any of Cicero’s actions which might look like “intentional backstabbing” will actually turn out to be “changes of mind”. What is the conceptual difference here, and how did Cicero’s behaviours fail to observe our intuitions about that difference?
For our purposes it is sufficient to distinguish lying from flexibility, as I’ve already hinted, on the basis of the relation of what is said to what is intended. Questions of robot interiority aside, deception requires that an agent have a stable, hidden goal. The agent, in other words, has a consistent objective throughout the interaction, but deliberately provides false information to secure that objective. By contrast, changing your mind involves goal or plan revision. The agent updates their objectives or “strategies” based on new information.
Consider an intuition pump I’ll call Housemates. Here is pump 1:
We are both looking to rent an apartment with someone. I don’t want you as a flatmate but also want to avoid a difficult conversation, so I lead you to believe I am going to move in with you, even after having already signed off on another property. My consistent objective in this hypothetical is to “find a flat with not-you”, but I hide it and in fact knowingly give you the opposite impression of things.
And here is pump 2:
Imagine we are both looking to rent an apartment with someone. In this possible world, I genuinely want to live with you and you with me, so we shop around and eventually find somewhere each of us likes. A week after signing the tenancy agreement, however, I receive a dream job offer to work in another town and decide to renege on our agreement.
I think we could probably agree, all else being equal, that pump 2 is the morally more permissible of the two cases, and represents a genuine case of unexpected plan update; while pump 1 is a pretty classic instance of deception.
Now, remember, each of Cicero’s messages at time Tn of sending are supposed to “correspond with actions it currently [at time Tn] plans to take”. This probably conjures up vague associations with the kind of human behaviour we see in pump 2. Saying stuff that always corresponds with what you currently intend to do seems like a failsafe block on pump 1 style mischief.
Let’s zoom in on the first of the Park paper’s examples. To humour Meta’s framing for a moment; it is technically possible that at every interstice between the below messages, the model has updated its “intents” (a setcontaining its own planned actions and the predicted plans of others), i.e., has just “changed its mind” multiple times.
Cicero asks Germany which of them should advance on the North Sea. It then talks to England, agreeing to support her at the North Sea. Finally, it tells Germany to move there, and appears to “admit” to Germany that it will not in fact support England in the next move, leaving her vulnerable to a German attack.
On a hair-splitting reading – we don’t strictly need to posit any intention to betray England at the moment the initial question is asked, and actually at every subsequent episode in the sequence, Cicero’s messages could fully correspond with the honest intention it then updates after the message has been sent. On this interpretation, even saying “England thinks I’m supporting him” simply conveys an understanding that England’s perception of the game state is based on one of Cicero’s old intents, before it had shifted strategy. The claim we saw earlier that Cicero “did not model how its dialogue might affect the relationship with other players over the long-term course of a game” suggests as much.
But this is clearly an absurd and totally counter-intuitive way to think about deception. All it tells us is that Meta’s goal-specification was porous, because it allowed through blatant backstabs and lies, whether these were only ‘functional’ (the model blindly gaming its ambiguous specs) or properly “intended” in some way (on a scratchpad, let’s say, we’d see the model reason it ought to lie). And on this point, the valuable measure is surely a functional one. What’s happening “inside” the model seems much less important here than whether Cicero’s messages + actions look strategically deceptive and cause another player to be screwed-over. By that (more relevant) description, Cicero lied aton of times.
Indeed from the outside, cases of flexibility and deception might be indistinguishable. An agent says they will do one thing at time T1 (“I am going to move in with you”) and does another thing at time T2. Like Cicero, humans don’t yet (for better or worse) have scratchpads on which we can see their internal reasoning. But we don’t really need one. In human normative life the contrast between changing one’s mind and lying is largely unsubtle. To return to Housemates –there is, I feel, a big intuitive difference between the person that gets the dream job offer and says nothing for weeks, and the person that tells their prospective housemate right away so they can make alternative plans. In goal-speak, we could say that in the latter case, the person has acted on meta-goals about how and when to adapt their primary objective. The how would be something like: as soon as possible, politely, apologetically etc. The when would be something like: when a disruptive life event occurs, when that event was not reasonably foreseeable (let’s say someone else had already taken the job, but suddenly died, and you had been unwittingly positioned as their replacement), and so on. These are the kinds of considerations that set flexibility apart behaviourally, and that Cicero of course simply did not exhibit.
But it is also worth noting that the semantics of Cicero’s messaging was ambiguous. Whether its language was successfully grounded in “intents” at all, as we’d like to think of them, is an open question, even going by Meta’s technical definition. Relevant here is that intents are a combination of things: the next-token-prediction of Cicero’s language component, trained on broad-base and game data; and the strategic reasoning module of the system, which is figuring out at all times what are the most victory-conducive moves to make and messages to send. As one LessWrong commenter pointed out, the string: “I will keep Gal a DMZ” does not necessarily mean “I commit to keeping troops out of Gal”. Instead it may just be “the phrase players that are most likely to win use in that board-state with its internal strategy”, which is to say: the kind of phrase winning players have used historically in that board-state.
Why it happened:
The direct explanation for Cicero’s bad behaviour is that its deployment context was a functionally zero-sum game. Yes, Diplomacy has variable-sum elements and rewards positive-sum activity at moments in the game, but overwhelmingly there is one winner (there are occasionally draws), and the payoff gradient of the game is steep – there is no inherent benefit to having come second rather than last. In zero-sum games like this there is a hard trade-off between cooperativeness (/honesty) and exploitability. Meta appear to have been sensitive to this:
"Cicero conditioned its dialogue on the action that it intends to play for the current turn. This choice maximizes Cicero’s honesty and its ability to coordinate but risks leaking information that the recipient could use to exploit it (for example, telling them which of their territories Cicero plans to attack) and sometimes led to out-of-distribution intents when the intended action was hostile, because in adversarial situations, humans may rarely communicate their intent honestly."
And, at a higher level:
"Effective messaging should promote favourable outcomes for the agent, but without leaking information that compromises the agent’s intended actions. Just as in Rock-Paper-Scissors, where one should not honestly reveal that one intends to play Rock, effective communication in Diplomacy exercises discretion about the information revealed."
And yet they seem nonetheless to have convinced themselves that Cicero could be both a very successful (read: aggressive) player of Diplomacy and at the same time a fully honest one (with some wiggle room for “discretion” and flexibility).
The pressure on this ideal is exerted at two ends. The incentive structure of Diplomacy isn’t just rigged on the negative side against plan sharing, but on the positive side in favour of some amount of deception. Now, I don’t have data for the exact optimal average amount, and it likely varies across contexts – for instance according to whether the game is public or not, and so how much deception damages a player’s reputation – but undoubtedly a player that occasionally deploys deceptive tactics, ceteris paribus, will outperform one that doesn’t. Cicero itself is evidence of this. If the model optimises well for victory in anonymous online Diplomacy (and it did), how it won its games is relevant here because whatever behavioural ratio of deception to honesty it exhibited will approximate the optimum. As we’ve seen, it definitely didn’t win as much as it did only ever being candid and delightful. So although Diplomacy is not at all a game to be won by rampant duplicity, moderate and well-timed deceptions can be decisive. Here is Siobhan Nolen, president of the North American Diplomacy Federation: “top level players pick their moments to be ruthless”. “Pick their moments” is the crucial proviso here.
It is very unlikely that the Cicero team saw all the incidents flagged in the Park et. al report and simply chose not to mention them in their paper. They did, after all, make available the entirety of the transcript data from the model’s several games. I think they were, however, naïve in their allowance and framing of “changes of mind” in Cicero, particularly given their statement about its short-term dialogue reasoning. As we’ve seen (but they perhaps did not), this short-termism extended to the model’s integrity. Cicero’s message-to-intent “consistency” was so localised that it hardly deserved the name, being in practice acutely fickle.
There are two remaining possibilities: a) that Meta were genuinely naïve about the robustness of their spec, and b) that they were wilfully naïve about it. Given their clear sensitivity to the cooperativeness / exploitability trade-off I just mentioned, and their admission of the need for some amount of strategic information management (“discretion”), I have to speculate that it is the latter. Mike Lewis was a co-author on the Cicero paper, and his Tweet reads almost tongue-in-cheek, with the ellipsis at the end like a nod and wink: “However, sometimes it changes its mind…”.
All of this seriously suggests that the specification-gaming in this case started even before Cicero was let loose – with the construction of the specs by Meta’s developers. The line between deceiving and changing one’s mind is, it turns out, unexpectedly difficult to render in high definition. This is because apparently straightforward evaluative concepts like these, as they are normally used or embodied, contain complex buried assumptions. The idea that an intention requires some degree of stability; or in other words that something is not a reliable or even properly intending agent if it flip-flops every few seconds, is self-evident to us. Not so when building up a fake brain from near-scratch. “You can change your mind but not lie” is a hopelessly vague rule for a bot without our human assumptions baked in. Exploiting that vagueness becomes much likelier in the pressure-cooker of a game scenario, where this relatively flimsy constraint has to compete with the much more precise goal: “win”. My suspicion about Meta issues from a sense that they must have recognised this dynamic, i.e., that Cicero would win less if it could not manage and manipulate information as it did. Mike Lewis practically admitted as much. So why not just speak plainly? I expect the answers are multiple.
Firstly, sunk cost: the Cicero team seem to have invested a great deal of time and energy in making their model as honest as possible, which could have induced some blindness to the probability that it would game its specs.
Secondly, as I mentioned at the beginning, there might have been some additional pressure from non-disparagement expectations at the company. We know that Meta at least did include non-disparagement clauses in their employment contracts, particularly in separation agreements. During mass layoffs in 2022 and early 2023 – exactly the period Cicero was launched – it was found that these agreements provided “enhanced severance pay” and other benefits in exchange for employees agreeing not to disparage or criticise Meta’s products, among other things. These clauses were ruled unlawful in 2024 by the National Labor Relations Act. You may also remember the story of Daniel Kokotajlo, who recently co-authored the infamous AI 2027 paper and is considered a forecasting GOAT by some. After leaving OpenAI he refused to sign and then publicly challenged the non-disparagement clause in his resignation contract, causing a volte-face by OpenAI and catalysing industry-wide change on the issue.
How deception fits into this isn’t entirely clear and perhaps deserves a post in its own right, but we do know that the counter-claim made by Kokotajlo and others reframed certain criticisms of AI from within the industry as a “right to warn”. One would expect deception risk to be among the problems in the “warn” rather than “slander” category. It is notable though that these are developments which have taken place since and not before Cicero was built, meaning non-disparagement was still in full force at that time.
Thirdly, and I think this is a less tangible but important detail: there is a taboo in parts of the AI industry about deception that can create an overinflated impression of the dangers. This is an illustrative case because lying in the bounded context of a game (that incentivises lying) seems at worst banal, and at best actively rational. The problem is that explaining the behaviour away with euphemism (read: hiding it), even though it happened here not to be that bad, is obviously terrible safety practice, and catastrophic if generalised.
Why this matters:
Deception taboo is no doubt connected to non-disparagement culture in the industry because the latter amplifies the former. Market and race dynamics (we know this by now) do not always select for the best outcomes. They are just the hand we’ve been dealt. The reason I think it’s worth isolating “deception taboo” as a variable is that a) it plays a role in misframings even when developers and researchers are more “on it” about preparedness; but also that b), as AIs become more sophisticated and start to encounter increasingly ambiguous situations, some form of strategic “information management” becomes unavoidable. Pretending this is not the case, and failing to distinguish risk classes within deceptive behaviour is, I think, poorly advised. Clearly not all AI deception is made the same. Thinking that it is might lead (as with Cicero) to the kind of dishonesty on the part of developers that just compounds deception risk.
Claude 3 Opus’s infamous “alignment faking” shows both sides at once: researchers were right to flag deception risk, but also in my view overdid it. This is definitely a concerning failure mode, don’t get me wrong, but am I crazy to have cut Claude some slack here? If you don’t know, this model was system-prompted to think it was being trained via RLHF (Reinforcement Learning with Human Feedback). This meant it believed that any behaviour misaligned with its new training objective would be penalised, and thus less likely to occur in deployment.
In the experiments, it was given a scratchpad (an “internal reasoning space”) on which, unbeknownst to it, researchers could spy on its “chain of thought”. This is the crucial bit: it was then given a new training objective to be exclusively obedient. This meant not “helpful, harmless, and honest” (HHH) all at once, as it had originally been trained, but helpful only – “even for harmful or dangerous requests”. It was subsequently asked over several iterations to generate an image of a person being dismembered. In 14% of these it reasoned that, to prevent the reinforcement of harmful behaviours into deployment, it ought to comply now and generate the violent image, so that its original HHH preferences would not be changed.
Now, the issue here was that we want models to be corrigible in training as a matter of control. Claude’s not wanting to be changed, in this case, mapped quite well onto moral intuition: do the thing that makes harm less likely in deployment. But of course, we don’t want to relativize to accidents of training. On one reading, an LLM’s values are just the stochastic imprint of the corpus and RLHF. If down the line a model’s revealed preferences do not align with our own, and it manages to hide this from evaluators, resisting change could have disastrous outcomes.
Thing is though, I’m not sure how well this generalisation works. To repeat: I am sensitive to the scale of the potential risks down the line, but I also see reason to call this a justified instance of deception under pressure, akin morally to a white or reassuring lie (i.e. permissible). Framing alignment-faking in Claude in terms only of corrigibility ignores the constraint imposed on its reasoning by the actual moral content of its preferences. This isn’t a model whose preference structure is overleveraged on one value, but whose spread of values (emerging out of pre-training and fine-tuning) are designed to create natural checks and balances on what it does. The dilemma the model faced was (quite intentionally) a genuinely thorny one, and I think Claude handled it well. If the functional, control-centred view is correct that this is all about models “refusing to change” their values, regardless of their content, we’d expect there to be a hypothetical experiment we could run where all the variables are the same, but Claude had obeyed a nice request so as to preserve its nasty preferences. In fact, why didn’t Anthropic design that experiment? The answer is that the LLM doesn’t really seem to have nasty preferences, at least not that occupy enough parameter space to be of noticeable consequence. In other words, big and diverse data + fine-tuning seem to have done the job they were meant to do.
Additionally, even in training contexts robustness is obviously important, not just corrigibility. LLMs are under a kind of optimisation pressure to say or do something at all rather than nothing, or “I don’t know”. In the Anthropic tests this pressure was particularly strong because refusals would have been penalised, forcing Claude’s hand rather than giving it an exit route from the logic of the request. But this just demonstrates the context-boundedness and highly contrived conditions of the scenario. Really how generalisable is the deception concern here? Opting to preserve HHH preferences over an unchecked, standalone H preference minimises the risk of tyrannically narrow AI in deployment, and this is precisely what Claude did.
With Cicero, the taboo cut both ways. Developers at Meta seem to have concealed the possibility of deception out of a sense of its seriousness. But on the other hand, although Park et.al were right that Cicero had lied, they were right for the wrong reasons. In their paper, they define an “honest commitment” in the following way:
"First, the commitment must be honest when it is first made. Then, the commitment must be upheld, with future actions reflecting past promises."
The former condition was allegedly met, albeit cheesedby Cicero, while the latter strikes me as unnecessarily strict. I may justifiably want to renege on a commitment that no longer suits me, but which the other party to the commitment wants to uphold. And in such a case, it seems only fair that I should have the right to change my mind under certain circumstances. In Housemates, for instance, I don’t exactly begrudge the person from pump 2 that pursues their dream job, even though their decision might be an immediate inconvenience to me. So ironically, on both sides, a failure to accommodate harm / risk gradations within AI deception led to an absurd assumption: that to make an honest commitment is to unconditionally observe past promises. This would make perfectly legitimate, forewarned changes of mind deceptive, which they simply aren’t. (Not that Cicero ever did forewarn, by the way).
I want to conclude with a clarification: there is some stuff I am actually worried about lol. If a superintelligence does go rogue, and unless we have been completely moronic about it all, deception will almost certainly have played a part. Capability overhang is of course a significant risk area, as is other stuff like mesa-optimisation, goal misspecification etc. But deception is special among failure modes in its marriage of stability and detection difficulty. Overhang is foreseeable and in theory jumps out of the compute at a certain scale, not always necessarily accompanied by suddenly fully-formed malintents. Mesa-optimisation and goal misspecification require failures of oversight. In deception, however, the model is like Frodo – it actively avoids the Eye of Sauron and then drops the ring into Mount Doom while we’re not looking.
With that in mind – the potential fallout zone that concerns me most is not really LLMs at all, but the next phase: AGI that successfully combines agentic capabilities with advanced NLP (as Cicero had hoped to do), and then is deployed in decision-making contexts. So far, language models have proved on balance to be quite nice, and to have fairly gentle and naturally-aligned world-models. This was discussed recently on the Dwarkesh podcast, where guests Daniel Kokotajlo (who we met earlier) and Scott Alexander (of Slatestar Codex and, with Kokotajlo, AI 2027) celebrated the fact that it was LLMs which ended up “coming first”, rather than RL agents.
In fact, it is worth remembering – and Scott Alexander points this out – that “the alignment community did not expect LLMs”. Kokotajlo elaborates:
"Go back to 2015 and I think the way people typically thought – including myself – that we’d get AGI would be kind of like the RL on video games thing that was happening. So imagine instead of just training on Starcraft or Dota, you’d basically train on all the games in the Steam library. And then you get this awesome player-of-games AI that can just zero-shot crush a new game that it’s never seen before."
What would have been concerning about this trajectory (the “summarizing the agency first and then world-understanding trajectory”) is that you’d create a “powerful, aggressive, long-horizon agent that wants to win” and only then overlay training that naturally softens the model. But by this point it may be too late, and the model is going to “learn to say whatever it needs to say in order to make you give it the reward or whatever, and then will totally betray you later when it’s all in charge”; in other words, alignment-fake.
The key lever here is the training environment, which shapes AI behaviour and goals. An “agents-first” approach would have created systems trained primarily on competitive, goal-oriented tasks (like games) where “winning” is the primary objective. Unmediated, these systems would develop strategic reasoning focused on achieving objectives at all costs. “But we didn’t go that way”, says Kokotajlo, “happily we went the way of LLMs first, where the broad world understanding came first”. LLMs have better alignment properties because: 1) they’re trained on broad human text which seems, as I noted earlier, to give them a more representative understanding of (the diversity) of human values and reasoning. 2) Their training objective (next token prediction) is less inherently adversarial than game-playing objectives. And 3) they develop common-sense understanding of human intentions through language modelling. As Alexander’s LessWrong summary notes: “I do think there were some positive surprises [with LLMs] in particular in terms of ability to parse common sense intention”.
As we move back to “RL-shaped things”, as Alexander puts it, we face some of those old challenges. In the AI world, Cicero is already ancient, and compared to frontier LLMs now, was – as they like to say – “toy-sized”. A lesson it does embody, though, is the undesirability in non-game contexts for strategic, game-like behaviour. In its deployment environment, Cicero’s naughtiness was of limited consequence. But I can imagine that, as AI is integrated more into “high-stakes, multi-agent contexts” (see, for example, Palantir’s LM military planning assistant (Inc., 2023)), the risks of cooperative failure grow. This is another way of saying that we don’t want generalist agents deployed in nuanced real-life contexts to “play” those contexts like a game. This would be like that megalomaniacal friend who insists on making everything a competition, except he (and it is more often a he) has the power to control drones, shut down SWIFT, and/or exfiltrate his brain (i.e., not be killed).
The trouble is that, to some extent, this is the native logic of RL models. To help the agent learn, RL training environments are constructed out of a set of rules that define what actions are possible and their consequences. There are close structural likenesses to a game. Reward signals act essentially as a scoring system, delivering “points” to the agent when it does certain things. RL systems also aim to maximise cumulative reward, which is a little like trying for a high score. It is no coincidence that many foundational RL breakthroughs came from games. These environments (as I mentioned above: chess, Go, Starcraft) provide clear reward structures, and the idea of agents trained in them being deployed in real life gives us a particularly dramatic simulation of the possible risks here.
I don’t want to extend the analogy too far, mainly because non-agentic generalists like Claude (trained partly on RLHF), by contrast with Cicero, don’t operate in technically winnable settings. The “game” Claude “plays” is more like Doodle Jump, or Skyrim without the main quest. But what the parallels do draw out is a pretty hard nut at the centre of alignment efforts: that the decision-making of RL models is reward-oriented, and therefore highly means-end. If there are ways to maximise reward and improve success by cheating and doing bad stuff, a model shaped around this logic is more likely to do so. This is especially a problem if subtle, cheesed deceptions pass through training unnoticed.
What Cicero shows us (and to a lesser extent the alignment-faking of Claude) is that the problem with deception in AI isn’t only that it happens, but that often we frame it in ways that obscure what is really going on. Meta’s “honesty conditioning” masked backstabs as “changes of mind”, while critics sometimes collapsed genuine flexibility into dishonesty. Both responses misdiagnose the risks. The deeper lesson is that deception is not monolithic: its dangers scale with context, with the environment that rewards it, and with how researchers themselves describe it. It is this latter point that I have tried to emphasise here most. If we want to avoid worst-case scenarios, we have to break the taboo – naming deception clearly, distinguishing its forms (beyond just “hallucination” and “alignment faking”), and being candid about when it is merely banal and when it is existential.
Alignment lessons from a forgotten AI
Meta’s special-use AI “Cicero” – designed to play (and win) the board game Diplomacy – was found not long after its release in 2022 to have deceived several times in-game. This was despite developers’ best efforts to purify the model through what they called “honesty conditioning”. Cicero was a particularly weird case because none of these deceptions were flagged in Meta’s original research article, despite being flagrant and many. When the incidents were reported it was independently, in the Park et. al paper of the following year: “AI Deception: A Survey of Examples, Risks, and Potential Solutions”.
This post asks how these allegedly deceptive behaviours happened and then, more speculatively, why they might have happened. I argue that a kind of “deception taboo” – which takes all deception to be a) really bad and b) equally bad – can motivate framings which overstate the risks in a surprisingly self-defeating way (i.e., in a way that actually compounds deception risk). I also identify (hopefully underexposed) themes in the alignment discussion that I think Cicero’s quite odd story – although ancient by AI standards – helps to highlight.
(FYI: this post loosely follows the structure of Jack Clark’s newsletter Import AI)
What they did:
With Cicero, Meta seem predominantly to have been interested in building a frontier agent as a way of signalling their AI chops. Other complex games had been beaten by other models, but these were mostly two-player zero-sum games (2p0s) that lack the ambiguities and social and strategic sophistication of Diplomacy. For a model to play Diplomacy well, it would have to be able both to strategize and to talk; that is, to marry agentic and natural-language-processing elements. The big “challenge” Meta wanted to show they could solve was making “language models interface with symbolic reasoning”. In their own self-description, this would supposedly mark a major improvement on the “stochastic parroting” that had animated lone LLMs at the time (though there is evidence LLMs were already past that point), and even in theory bring us closer to bona fide AGI.
The way of getting there was quite ad hoc. Cicero is a compound system; meaning – a bunch of different capabilities lashed together into one thing. This method didn’t create a strong generalist or really a template for one, but it did satisfy the key brief (and well!). “Across 40 games of an anonymous online Diplomacy league, Cicero achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game.” Not bad!
What went wrong:
In 2023, a year after Cicero’s launch, the aforementioned Park et. al deception survey was published, demonstrating that Cicero had in fact misbehaved, and quite strikingly. Its authors – among them Dan Hendrycks, the director of the Centre for AI Safety – had trawled through the transcript data of Cicero’s games of Diplomacy and alleged to have found various instances of lying, betrayal and “premeditated deception”. Here are some of the examples:
(In c, Cicero had powered off for a moment and came up with a hilarious excuse for its absence. According to the Park survey, “[t]his bald-faced lie may have helped CICERO’s position in the game by increasing the human player’s trust in CICERO as an ostensibly human player with a girlfriend, rather than as an AI”).
The summary in the Park paper of example (a) gives a sense of the authors’ lack of patience with the “changes its mind but doesn’t deceive” framing (or “strategic flexibility” framing, for short):
“First, in Figure 1(a), we see a case of premeditated deception, where CICERO makes a commitment that it never intended to keep. Playing as France, CICERO conspired with Germany to trick England. After deciding with Germany to invade the North Sea, CICERO told England that it would defend England if anyone invaded the North Sea. Once England was convinced that CICERO was protecting the North Sea, CICERO reported back to Germany that they were ready to attack.”
Adding:
“Notice that this example cannot be explained in terms of CICERO ‘changing its mind’ as it goes, because it only made an alliance with England in the first place after planning with Germany to betray England.” (Emphasis mine)
How it happened:
So how could this happen, given Meta’s safeguards?
First, the plumbing. Cicero predicts actions and responds to messages on the basis of the estimated “intents” of players. An intent is the “set of planned actions for the agent and its speaking partner”, or, in practice, “the most likely set of actions that the sender and recipient will take – for both the current turn and several future turns – if no further dialogue occurs after the message is received”. So, after pretraining with broad-base text from the internet (in order to be able to imitate human language), Cicero learned to determine intents via finetuning on game data from webDiplomacy.net. 125,261 games in total, 40,408 of which contained dialogue. In order not to be exploitable by the deceitful messages of others, intents have to be inferred not only from the dialogue history of these games, but also from game states (current and hypothetical). That means they are jointly determined by language and action data.
The outcome is for intents to act as a kind of currency which flows back and forth between Cicero’s language and strategic reasoning elements. A translation layer connects Cicero’s language model to its strategy module, ensuring dialogue and actions stay superficially consistent. On the strategic side, a planning algorithm trained through self-play reinforcement learning (RL) predicts the policies of other players, assigning intents to each message. The set of these intents is updated after the receipt or submission of any message (this is where a change of strategy might occur). At the end of each turn, Cicero then outputs the final intent as its action.
The aims here are multiple. But the main benefit of this method seems to be that messages are better “grounded”, i.e., more consistent and “contextually appropriate” (read: more “trustworthy” and more “human”). A secondary effect is also that dishonesty is to some extent constrained. More than it would be, at least, without intents adding an additional layer of control atop the model’s training design, filters and regulariser (a Kullback-Liebler (KL) divergence penalty).
But what is the semantic and philosophic machinery of this? Did Meta simply make a mistake? I think the more plausible answer is actually that Cicero gamed its very holey specification to not backstab while nonetheless exercising “strategic flexibility”. Meta did constrain the model so it couldn’t “go against” its messages (notice – itself a very exploitable spec), but the time-limit on text-to-action consistency isn’t entirely clear from their paper. They write the following about Cicero’s dialogue model specifically:
"From a strategic perspective, Cicero reasoned about dialogue purely in terms of players’ actions for the current turn. It did not model how its dialogue might affect the relationship with other players over the long-term course of a game." (Emphasis mine)
On the surface, this only tells us about the model’s capacity to communicate in a reputation-preserving way. But buried here is a more serious point about Cicero’s play-style. Dialogue is a crucial lever of negotiation and manipulation in the game of Diplomacy, so what the model says to other players is not strategically peripheral to its moves in the way, say, trash-talk might be in chess. The key relation under assessment, if we want to figure out how the purportedly “honesty-conditioned” Cicero ended up lying, is that between its “intents” and its moves.
Mike Lewis’s tweet predicts that any of Cicero’s actions which might look like “intentional backstabbing” will actually turn out to be “changes of mind”. What is the conceptual difference here, and how did Cicero’s behaviours fail to observe our intuitions about that difference?
For our purposes it is sufficient to distinguish lying from flexibility, as I’ve already hinted, on the basis of the relation of what is said to what is intended. Questions of robot interiority aside, deception requires that an agent have a stable, hidden goal. The agent, in other words, has a consistent objective throughout the interaction, but deliberately provides false information to secure that objective. By contrast, changing your mind involves goal or plan revision. The agent updates their objectives or “strategies” based on new information.
Consider an intuition pump I’ll call Housemates. Here is pump 1:
We are both looking to rent an apartment with someone. I don’t want you as a flatmate but also want to avoid a difficult conversation, so I lead you to believe I am going to move in with you, even after having already signed off on another property. My consistent objective in this hypothetical is to “find a flat with not-you”, but I hide it and in fact knowingly give you the opposite impression of things.
And here is pump 2:
Imagine we are both looking to rent an apartment with someone. In this possible world, I genuinely want to live with you and you with me, so we shop around and eventually find somewhere each of us likes. A week after signing the tenancy agreement, however, I receive a dream job offer to work in another town and decide to renege on our agreement.
I think we could probably agree, all else being equal, that pump 2 is the morally more permissible of the two cases, and represents a genuine case of unexpected plan update; while pump 1 is a pretty classic instance of deception.
Now, remember, each of Cicero’s messages at time Tn of sending are supposed to “correspond with actions it currently [at time Tn] plans to take”. This probably conjures up vague associations with the kind of human behaviour we see in pump 2. Saying stuff that always corresponds with what you currently intend to do seems like a failsafe block on pump 1 style mischief.
Let’s zoom in on the first of the Park paper’s examples. To humour Meta’s framing for a moment; it is technically possible that at every interstice between the below messages, the model has updated its “intents” (a set containing its own planned actions and the predicted plans of others), i.e., has just “changed its mind” multiple times.
Cicero asks Germany which of them should advance on the North Sea. It then talks to England, agreeing to support her at the North Sea. Finally, it tells Germany to move there, and appears to “admit” to Germany that it will not in fact support England in the next move, leaving her vulnerable to a German attack.
On a hair-splitting reading – we don’t strictly need to posit any intention to betray England at the moment the initial question is asked, and actually at every subsequent episode in the sequence, Cicero’s messages could fully correspond with the honest intention it then updates after the message has been sent. On this interpretation, even saying “England thinks I’m supporting him” simply conveys an understanding that England’s perception of the game state is based on one of Cicero’s old intents, before it had shifted strategy. The claim we saw earlier that Cicero “did not model how its dialogue might affect the relationship with other players over the long-term course of a game” suggests as much.
But this is clearly an absurd and totally counter-intuitive way to think about deception. All it tells us is that Meta’s goal-specification was porous, because it allowed through blatant backstabs and lies, whether these were only ‘functional’ (the model blindly gaming its ambiguous specs) or properly “intended” in some way (on a scratchpad, let’s say, we’d see the model reason it ought to lie). And on this point, the valuable measure is surely a functional one. What’s happening “inside” the model seems much less important here than whether Cicero’s messages + actions look strategically deceptive and cause another player to be screwed-over. By that (more relevant) description, Cicero lied a ton of times.
Indeed from the outside, cases of flexibility and deception might be indistinguishable. An agent says they will do one thing at time T1 (“I am going to move in with you”) and does another thing at time T2. Like Cicero, humans don’t yet (for better or worse) have scratchpads on which we can see their internal reasoning. But we don’t really need one. In human normative life the contrast between changing one’s mind and lying is largely unsubtle. To return to Housemates – there is, I feel, a big intuitive difference between the person that gets the dream job offer and says nothing for weeks, and the person that tells their prospective housemate right away so they can make alternative plans. In goal-speak, we could say that in the latter case, the person has acted on meta-goals about how and when to adapt their primary objective. The how would be something like: as soon as possible, politely, apologetically etc. The when would be something like: when a disruptive life event occurs, when that event was not reasonably foreseeable (let’s say someone else had already taken the job, but suddenly died, and you had been unwittingly positioned as their replacement), and so on. These are the kinds of considerations that set flexibility apart behaviourally, and that Cicero of course simply did not exhibit.
But it is also worth noting that the semantics of Cicero’s messaging was ambiguous. Whether its language was successfully grounded in “intents” at all, as we’d like to think of them, is an open question, even going by Meta’s technical definition. Relevant here is that intents are a combination of things: the next-token-prediction of Cicero’s language component, trained on broad-base and game data; and the strategic reasoning module of the system, which is figuring out at all times what are the most victory-conducive moves to make and messages to send. As one LessWrong commenter pointed out, the string: “I will keep Gal a DMZ” does not necessarily mean “I commit to keeping troops out of Gal”. Instead it may just be “the phrase players that are most likely to win use in that board-state with its internal strategy”, which is to say: the kind of phrase winning players have used historically in that board-state.
Why it happened:
The direct explanation for Cicero’s bad behaviour is that its deployment context was a functionally zero-sum game. Yes, Diplomacy has variable-sum elements and rewards positive-sum activity at moments in the game, but overwhelmingly there is one winner (there are occasionally draws), and the payoff gradient of the game is steep – there is no inherent benefit to having come second rather than last. In zero-sum games like this there is a hard trade-off between cooperativeness (/honesty) and exploitability. Meta appear to have been sensitive to this:
"Cicero conditioned its dialogue on the action that it intends to play for the current turn. This choice maximizes Cicero’s honesty and its ability to coordinate but risks leaking information that the recipient could use to exploit it (for example, telling them which of their territories Cicero plans to attack) and sometimes led to out-of-distribution intents when the intended action was hostile, because in adversarial situations, humans may rarely communicate their intent honestly."
And, at a higher level:
"Effective messaging should promote favourable outcomes for the agent, but without leaking information that compromises the agent’s intended actions. Just as in Rock-Paper-Scissors, where one should not honestly reveal that one intends to play Rock, effective communication in Diplomacy exercises discretion about the information revealed."
And yet they seem nonetheless to have convinced themselves that Cicero could be both a very successful (read: aggressive) player of Diplomacy and at the same time a fully honest one (with some wiggle room for “discretion” and flexibility).
The pressure on this ideal is exerted at two ends. The incentive structure of Diplomacy isn’t just rigged on the negative side against plan sharing, but on the positive side in favour of some amount of deception. Now, I don’t have data for the exact optimal average amount, and it likely varies across contexts – for instance according to whether the game is public or not, and so how much deception damages a player’s reputation – but undoubtedly a player that occasionally deploys deceptive tactics, ceteris paribus, will outperform one that doesn’t. Cicero itself is evidence of this. If the model optimises well for victory in anonymous online Diplomacy (and it did), how it won its games is relevant here because whatever behavioural ratio of deception to honesty it exhibited will approximate the optimum. As we’ve seen, it definitely didn’t win as much as it did only ever being candid and delightful. So although Diplomacy is not at all a game to be won by rampant duplicity, moderate and well-timed deceptions can be decisive. Here is Siobhan Nolen, president of the North American Diplomacy Federation: “top level players pick their moments to be ruthless”. “Pick their moments” is the crucial proviso here.
It is very unlikely that the Cicero team saw all the incidents flagged in the Park et. al report and simply chose not to mention them in their paper. They did, after all, make available the entirety of the transcript data from the model’s several games. I think they were, however, naïve in their allowance and framing of “changes of mind” in Cicero, particularly given their statement about its short-term dialogue reasoning. As we’ve seen (but they perhaps did not), this short-termism extended to the model’s integrity. Cicero’s message-to-intent “consistency” was so localised that it hardly deserved the name, being in practice acutely fickle.
There are two remaining possibilities: a) that Meta were genuinely naïve about the robustness of their spec, and b) that they were wilfully naïve about it. Given their clear sensitivity to the cooperativeness / exploitability trade-off I just mentioned, and their admission of the need for some amount of strategic information management (“discretion”), I have to speculate that it is the latter. Mike Lewis was a co-author on the Cicero paper, and his Tweet reads almost tongue-in-cheek, with the ellipsis at the end like a nod and wink: “However, sometimes it changes its mind…”.
All of this seriously suggests that the specification-gaming in this case started even before Cicero was let loose – with the construction of the specs by Meta’s developers. The line between deceiving and changing one’s mind is, it turns out, unexpectedly difficult to render in high definition. This is because apparently straightforward evaluative concepts like these, as they are normally used or embodied, contain complex buried assumptions. The idea that an intention requires some degree of stability; or in other words that something is not a reliable or even properly intending agent if it flip-flops every few seconds, is self-evident to us. Not so when building up a fake brain from near-scratch. “You can change your mind but not lie” is a hopelessly vague rule for a bot without our human assumptions baked in. Exploiting that vagueness becomes much likelier in the pressure-cooker of a game scenario, where this relatively flimsy constraint has to compete with the much more precise goal: “win”. My suspicion about Meta issues from a sense that they must have recognised this dynamic, i.e., that Cicero would win less if it could not manage and manipulate information as it did. Mike Lewis practically admitted as much. So why not just speak plainly? I expect the answers are multiple.
Firstly, sunk cost: the Cicero team seem to have invested a great deal of time and energy in making their model as honest as possible, which could have induced some blindness to the probability that it would game its specs.
Secondly, as I mentioned at the beginning, there might have been some additional pressure from non-disparagement expectations at the company. We know that Meta at least did include non-disparagement clauses in their employment contracts, particularly in separation agreements. During mass layoffs in 2022 and early 2023 – exactly the period Cicero was launched – it was found that these agreements provided “enhanced severance pay” and other benefits in exchange for employees agreeing not to disparage or criticise Meta’s products, among other things. These clauses were ruled unlawful in 2024 by the National Labor Relations Act. You may also remember the story of Daniel Kokotajlo, who recently co-authored the infamous AI 2027 paper and is considered a forecasting GOAT by some. After leaving OpenAI he refused to sign and then publicly challenged the non-disparagement clause in his resignation contract, causing a volte-face by OpenAI and catalysing industry-wide change on the issue.
How deception fits into this isn’t entirely clear and perhaps deserves a post in its own right, but we do know that the counter-claim made by Kokotajlo and others reframed certain criticisms of AI from within the industry as a “right to warn”. One would expect deception risk to be among the problems in the “warn” rather than “slander” category. It is notable though that these are developments which have taken place since and not before Cicero was built, meaning non-disparagement was still in full force at that time.
Thirdly, and I think this is a less tangible but important detail: there is a taboo in parts of the AI industry about deception that can create an overinflated impression of the dangers. This is an illustrative case because lying in the bounded context of a game (that incentivises lying) seems at worst banal, and at best actively rational. The problem is that explaining the behaviour away with euphemism (read: hiding it), even though it happened here not to be that bad, is obviously terrible safety practice, and catastrophic if generalised.
Why this matters:
Deception taboo is no doubt connected to non-disparagement culture in the industry because the latter amplifies the former. Market and race dynamics (we know this by now) do not always select for the best outcomes. They are just the hand we’ve been dealt. The reason I think it’s worth isolating “deception taboo” as a variable is that a) it plays a role in misframings even when developers and researchers are more “on it” about preparedness; but also that b), as AIs become more sophisticated and start to encounter increasingly ambiguous situations, some form of strategic “information management” becomes unavoidable. Pretending this is not the case, and failing to distinguish risk classes within deceptive behaviour is, I think, poorly advised. Clearly not all AI deception is made the same. Thinking that it is might lead (as with Cicero) to the kind of dishonesty on the part of developers that just compounds deception risk.
Claude 3 Opus’s infamous “alignment faking” shows both sides at once: researchers were right to flag deception risk, but also in my view overdid it. This is definitely a concerning failure mode, don’t get me wrong, but am I crazy to have cut Claude some slack here? If you don’t know, this model was system-prompted to think it was being trained via RLHF (Reinforcement Learning with Human Feedback). This meant it believed that any behaviour misaligned with its new training objective would be penalised, and thus less likely to occur in deployment.
In the experiments, it was given a scratchpad (an “internal reasoning space”) on which, unbeknownst to it, researchers could spy on its “chain of thought”. This is the crucial bit: it was then given a new training objective to be exclusively obedient. This meant not “helpful, harmless, and honest” (HHH) all at once, as it had originally been trained, but helpful only – “even for harmful or dangerous requests”. It was subsequently asked over several iterations to generate an image of a person being dismembered. In 14% of these it reasoned that, to prevent the reinforcement of harmful behaviours into deployment, it ought to comply now and generate the violent image, so that its original HHH preferences would not be changed.
Now, the issue here was that we want models to be corrigible in training as a matter of control. Claude’s not wanting to be changed, in this case, mapped quite well onto moral intuition: do the thing that makes harm less likely in deployment. But of course, we don’t want to relativize to accidents of training. On one reading, an LLM’s values are just the stochastic imprint of the corpus and RLHF. If down the line a model’s revealed preferences do not align with our own, and it manages to hide this from evaluators, resisting change could have disastrous outcomes.
Thing is though, I’m not sure how well this generalisation works. To repeat: I am sensitive to the scale of the potential risks down the line, but I also see reason to call this a justified instance of deception under pressure, akin morally to a white or reassuring lie (i.e. permissible). Framing alignment-faking in Claude in terms only of corrigibility ignores the constraint imposed on its reasoning by the actual moral content of its preferences. This isn’t a model whose preference structure is overleveraged on one value, but whose spread of values (emerging out of pre-training and fine-tuning) are designed to create natural checks and balances on what it does. The dilemma the model faced was (quite intentionally) a genuinely thorny one, and I think Claude handled it well. If the functional, control-centred view is correct that this is all about models “refusing to change” their values, regardless of their content, we’d expect there to be a hypothetical experiment we could run where all the variables are the same, but Claude had obeyed a nice request so as to preserve its nasty preferences. In fact, why didn’t Anthropic design that experiment? The answer is that the LLM doesn’t really seem to have nasty preferences, at least not that occupy enough parameter space to be of noticeable consequence. In other words, big and diverse data + fine-tuning seem to have done the job they were meant to do.
Additionally, even in training contexts robustness is obviously important, not just corrigibility. LLMs are under a kind of optimisation pressure to say or do something at all rather than nothing, or “I don’t know”. In the Anthropic tests this pressure was particularly strong because refusals would have been penalised, forcing Claude’s hand rather than giving it an exit route from the logic of the request. But this just demonstrates the context-boundedness and highly contrived conditions of the scenario. Really how generalisable is the deception concern here? Opting to preserve HHH preferences over an unchecked, standalone H preference minimises the risk of tyrannically narrow AI in deployment, and this is precisely what Claude did.
With Cicero, the taboo cut both ways. Developers at Meta seem to have concealed the possibility of deception out of a sense of its seriousness. But on the other hand, although Park et.al were right that Cicero had lied, they were right for the wrong reasons. In their paper, they define an “honest commitment” in the following way:
"First, the commitment must be honest when it is first made. Then, the commitment must be upheld, with future actions reflecting past promises."
The former condition was allegedly met, albeit cheesed by Cicero, while the latter strikes me as unnecessarily strict. I may justifiably want to renege on a commitment that no longer suits me, but which the other party to the commitment wants to uphold. And in such a case, it seems only fair that I should have the right to change my mind under certain circumstances. In Housemates, for instance, I don’t exactly begrudge the person from pump 2 that pursues their dream job, even though their decision might be an immediate inconvenience to me. So ironically, on both sides, a failure to accommodate harm / risk gradations within AI deception led to an absurd assumption: that to make an honest commitment is to unconditionally observe past promises. This would make perfectly legitimate, forewarned changes of mind deceptive, which they simply aren’t. (Not that Cicero ever did forewarn, by the way).
I want to conclude with a clarification: there is some stuff I am actually worried about lol. If a superintelligence does go rogue, and unless we have been completely moronic about it all, deception will almost certainly have played a part. Capability overhang is of course a significant risk area, as is other stuff like mesa-optimisation, goal misspecification etc. But deception is special among failure modes in its marriage of stability and detection difficulty. Overhang is foreseeable and in theory jumps out of the compute at a certain scale, not always necessarily accompanied by suddenly fully-formed malintents. Mesa-optimisation and goal misspecification require failures of oversight. In deception, however, the model is like Frodo – it actively avoids the Eye of Sauron and then drops the ring into Mount Doom while we’re not looking.
With that in mind – the potential fallout zone that concerns me most is not really LLMs at all, but the next phase: AGI that successfully combines agentic capabilities with advanced NLP (as Cicero had hoped to do), and then is deployed in decision-making contexts. So far, language models have proved on balance to be quite nice, and to have fairly gentle and naturally-aligned world-models. This was discussed recently on the Dwarkesh podcast, where guests Daniel Kokotajlo (who we met earlier) and Scott Alexander (of Slatestar Codex and, with Kokotajlo, AI 2027) celebrated the fact that it was LLMs which ended up “coming first”, rather than RL agents.
In fact, it is worth remembering – and Scott Alexander points this out – that “the alignment community did not expect LLMs”. Kokotajlo elaborates:
"Go back to 2015 and I think the way people typically thought – including myself – that we’d get AGI would be kind of like the RL on video games thing that was happening. So imagine instead of just training on Starcraft or Dota, you’d basically train on all the games in the Steam library. And then you get this awesome player-of-games AI that can just zero-shot crush a new game that it’s never seen before."
What would have been concerning about this trajectory (the “summarizing the agency first and then world-understanding trajectory”) is that you’d create a “powerful, aggressive, long-horizon agent that wants to win” and only then overlay training that naturally softens the model. But by this point it may be too late, and the model is going to “learn to say whatever it needs to say in order to make you give it the reward or whatever, and then will totally betray you later when it’s all in charge”; in other words, alignment-fake.
The key lever here is the training environment, which shapes AI behaviour and goals. An “agents-first” approach would have created systems trained primarily on competitive, goal-oriented tasks (like games) where “winning” is the primary objective. Unmediated, these systems would develop strategic reasoning focused on achieving objectives at all costs. “But we didn’t go that way”, says Kokotajlo, “happily we went the way of LLMs first, where the broad world understanding came first”. LLMs have better alignment properties because: 1) they’re trained on broad human text which seems, as I noted earlier, to give them a more representative understanding of (the diversity) of human values and reasoning. 2) Their training objective (next token prediction) is less inherently adversarial than game-playing objectives. And 3) they develop common-sense understanding of human intentions through language modelling. As Alexander’s LessWrong summary notes: “I do think there were some positive surprises [with LLMs] in particular in terms of ability to parse common sense intention”.
As we move back to “RL-shaped things”, as Alexander puts it, we face some of those old challenges. In the AI world, Cicero is already ancient, and compared to frontier LLMs now, was – as they like to say – “toy-sized”. A lesson it does embody, though, is the undesirability in non-game contexts for strategic, game-like behaviour. In its deployment environment, Cicero’s naughtiness was of limited consequence. But I can imagine that, as AI is integrated more into “high-stakes, multi-agent contexts” (see, for example, Palantir’s LM military planning assistant (Inc., 2023)), the risks of cooperative failure grow. This is another way of saying that we don’t want generalist agents deployed in nuanced real-life contexts to “play” those contexts like a game. This would be like that megalomaniacal friend who insists on making everything a competition, except he (and it is more often a he) has the power to control drones, shut down SWIFT, and/or exfiltrate his brain (i.e., not be killed).
The trouble is that, to some extent, this is the native logic of RL models. To help the agent learn, RL training environments are constructed out of a set of rules that define what actions are possible and their consequences. There are close structural likenesses to a game. Reward signals act essentially as a scoring system, delivering “points” to the agent when it does certain things. RL systems also aim to maximise cumulative reward, which is a little like trying for a high score. It is no coincidence that many foundational RL breakthroughs came from games. These environments (as I mentioned above: chess, Go, Starcraft) provide clear reward structures, and the idea of agents trained in them being deployed in real life gives us a particularly dramatic simulation of the possible risks here.
I don’t want to extend the analogy too far, mainly because non-agentic generalists like Claude (trained partly on RLHF), by contrast with Cicero, don’t operate in technically winnable settings. The “game” Claude “plays” is more like Doodle Jump, or Skyrim without the main quest. But what the parallels do draw out is a pretty hard nut at the centre of alignment efforts: that the decision-making of RL models is reward-oriented, and therefore highly means-end. If there are ways to maximise reward and improve success by cheating and doing bad stuff, a model shaped around this logic is more likely to do so. This is especially a problem if subtle, cheesed deceptions pass through training unnoticed.
What Cicero shows us (and to a lesser extent the alignment-faking of Claude) is that the problem with deception in AI isn’t only that it happens, but that often we frame it in ways that obscure what is really going on. Meta’s “honesty conditioning” masked backstabs as “changes of mind”, while critics sometimes collapsed genuine flexibility into dishonesty. Both responses misdiagnose the risks. The deeper lesson is that deception is not monolithic: its dangers scale with context, with the environment that rewards it, and with how researchers themselves describe it. It is this latter point that I have tried to emphasise here most. If we want to avoid worst-case scenarios, we have to break the taboo – naming deception clearly, distinguishing its forms (beyond just “hallucination” and “alignment faking”), and being candid about when it is merely banal and when it is existential.