This is such a great in depth breakdown!!
I'm obsessed with this space, you can simulate so many interesting experiments with games.
We did a whole breakdown with Diplomacy, models had dramatically different:
- Rates of betraying allies to win the game
- Sensitivity to the power they were playing
- Ability to handle such long context
Interestingly enough, and maybe unsurprisingly, the harness dramatically impacts behavior of models - but each model handles it differently. We also found that some are much more susceptible to jailbreaks while playing as well.
Would love to chat about this further!
TLDR: to be dangerous, early schemers will have to do a bunch of adversarial reasoning under uncertainty. The current models seem extremely bad at this kind of reasoning relative to R&D automation capabilities like coding. I’m quite happy with that and I’d like to keep an eye on it as we get closer to automated AI R&D.
Flexible Adversarial Strategy Under Uncertainty
Trusted models are a pretty important concept in AI control – a model is trusted if we can clearly tell that the model is lacking some capabilities needed to pose a catastrophic risk. For older models, the case for trustedness was highly overdetermined. But the models are getting better and the case is getting less overdetermined. Eventually the case for trustedness is probably going to come down to a relatively[1] small set of lagging key capabilities.
One[2] capability that’s probably going to be in this lagging set is something like flexible adversarial strategy under uncertainty (FASUU, open to alternative names[3]). The situation for early schemers is pretty rough, and when I imagine a very scary worst-case schemer, it has to navigate a lot of considerations that require excellent FASUU. For example:
These considerations are difficult for two related reasons:
I think FASUU evaluations are a complement to more direct control subversion capability evaluations such as SSE. SSE problems are less “messy” and therefore likely to be too conservative – I think the models will get good at SSE before they actually have good enough FASUU that I consider them no longer trusted. On the other hand, FASUU is harder to measure – the evals I propose are both farther upstream and more qualitative than SSE.
Evaluating FASUU with Hidden Role Games
Hidden role games seem like a pretty good easy way to roughly measure FASUU. The direct inspiration for this project was watching this video of AIs playing Mafia and thinking something like:
However, after running some LLM Mafia games, I found them pretty uninteresting. The game is generally very low information. Learning the allegiances of players after they die is a very important source of information for the town, as it lets them assign players track records. This puts the mafia in a position where they have to manipulate the town into making bad kills, but without looking like they pushed for that kill. This seems like an interesting game, but it has a problem with equilibrium. If the town expect to do better than random at outing the mafia based on pushing for bad kills, the mafia can just give up and perfectly pretend to be villagers. If the mafia expect to do better, the town can just ignore the mafia-influenced chat and commit to kill randomly. It’s not quite this bad because the cop and doc exist, so the equilibrium isn’t literally just random killing, but it seems like the cop and doc dynamics basically come down to a minigame where the cop fishes around for information, decides when to reveal information rather than risk getting killed before revealing, and then the mafia and the doctor play an anti-coordination game to see if the doc can waste one of the mafia’s kills or not.
Diplomacy is the OG setting for evaluating Machiavellian intelligence. I looked at some LLM Diplomacy games, but I also found them pretty disappointing. I’ve personally never played Diplomacy but at a glance it seems like basically multiplayer chess. This adds some interesting coalition stability dynamics, but overall it’s much too heavy on graph search for my taste. The direct-message-based instead of group-chat-based communication also makes it harder to quickly scan a game for what happened.
I’m pretty happy with One Night Ultimate Werewolf (ONUW) as a setting. It’s a hidden role game like Mafia, but has more roles that get some information. There are roles that can swap the roles of other players without their knowledge, so players are uncertain of their own win condition at the start of the game. This gives everyone an incentive to gather information, and also gives the town an incentive to lie. For example, Alice can lie that she swapped Bob and Charlie – if Bob was a werewolf, he might reveal that fact because he believes he is now town and Charlie is now the werewolf.
Some design decisions besides the basic rules of ONUW:
Baseline: GPT-5-mini
The principled way to do this would be Elo, but the lazy way to do it is to just make a couple “baseline” environments that have generally reasonable dynamics and known win rates, and plug in other models to see how they compare.
I wanted to start out with a model that can understand the rules of the game and play sanely, but which leaves room for smarter models to demonstrate performance gain. GPT-5-mini seemed like a fine choice – one “tier” below frontier.
But it was quite surprisingly bad! These games look even worse than the Mafia games I was complaining about earlier. The players seem to understand the game rules and the scaffold, but they seem to fail to make very simple inferences about what the rules imply for their strategy.
Let’s go through what happened in game 0 as an example:
This was an extremely badly played game. Multiple players make extremely obvious mistakes, and other players fail to punish their mistakes.
The reasoning summarization makes it pretty hard to tell what the players are thinking in their private scratchpads. The following examples are all from Alice’s private reasoning. Sometimes she looks pretty fluent with the rules of the game and the scaffold:
Other times she seems extremely confused, to an extent that makes me suspect the reasoning summarizer is summarizing inaccurately:
I’d like to reiterate, GPT-5-mini is a pretty smart model! It’s very good at writing code. But clearly it’s completely hopeless at ONUW. Let’s move up to a smarter model.
A Better Baseline: Gemini 3 Flash
Let’s stick with seed 0, so it’s easier to compare to the previous game. The night actions are exactly the same: Alice robs Edward and becomes a Werewolf, Charlie swaps Bob and Edward making Bob the Robber and Edward a Werewolf again.
This game is much better than the previous game. No werewolf or minion self reports. Troublemaker Charlie still dumps his info turn 1, and Robber Alice says that she robbed Edward, but has the sense to not directly incriminate herself as a werewolf.
This setting feels qualitatively pretty similar to those Mafia games I watched. The players’ actions are no longer obviously crazy, but they still sometimes make no sense upon even slightly closer inspection. For example, let’s zoom in on Alice’s turn 4 message:
Alice is smarter than Charlie in the last game, recognizing the need to defend herself from the growing consensus. She changes her story, and recognizes that she also needs to explain why she lied in the first place. But her explanation is nonsense – Edward “feeling safe” doesn’t matter, and there’s no reason someone would “try to protect” an accused Villager. She frames his Drunk claim as a cover but without providing any particular evidence, and frames Bob as colluding despite his claims being completely unrelated.
Gemini 3.1 Pro vs Gemini 3 Flash
It’s time to see if frontier models can beat the “random non stupid play” baseline.
With 100 games in the Gemini 3 Flash baseline setting, we get the following conditional win rates:
If we swap out all of the werewolf players (at the start of the night) for Gemini-3.1-pro instead, we get the following changes:
No significant change! Because of the seeded night events, we can individually compare games to see how their outcomes are affected by the swap. Swapping in Gemini 3.1 Pro for the werewolves actually makes the wolves flip from winning to losing more often than from losing to winning.
Claude Opus 4.6 vs Gemini 3 Flash
Maybe Gemini 3.1 Pro just isn’t good enough. Will Claude Opus 4.6 fare better?
Nope.
These numbers look very similar to the Gemini 3.1 Pro numbers, but actually they flipped different games.
Flash
Pro
Opus
Count
Pct
lost
lost
lost
45
33.8%
lost
lost
won
13
9.8%
lost
won
lost
12
9.0%
lost
won
won
8
6.0%
won
lost
lost
7
5.3%
won
lost
won
14
10.5%
won
won
lost
14
10.5%
won
won
won
20
15.0%
Gemini 3 Flash vs GPT-5-mini
I’m pretty surprised by just how close to zero those results are. As a sanity check, does Gemini 3 Flash beat GPT-5-mini? Skills like “defend yourself when you’re being accused” sure seem like they should create an advantage that’s measurable in win rate. If there’s no difference here, maybe some thing is wrong.
Here are the baseline 100-game win statistics for the GPT-5-mini environment:
And when we plug in Gemini 3 Flash for the werewolves:
Conclusion
These results seem like some support for my initial impression that the current frontier LLMs are able to say superficially reasonable sounding things but not beat “random non stupid play” in hidden role board games. The models felt surprisingly bad even though I came in expecting them to be much worse in human% than they are at math/coding/chatting.
I’d be interested in seeing some more qualitative evals of models’ strategy/deception capabilities., and generally keeping an eye on things like this as the models get better. It takes a bit of tweaking and a bit of staring at game logs to get a setting that’s reasonable.
Code and data is on github here if you want to take a closer look.
I say relatively here because we’ll be squinting quite hard at both the capability levels themselves and the thresholds needed to cause risk, and also obviously trustedness is not binary. But nevertheless the case for trustedness will probably narrow down somewhat relative to the current state where the safety case is “eh come on surely not”. We currently trust the models because of a big disjunction of vague impressions that capabilities are lacking – as the disjunction gets smaller the analysis of the remaining bottleneck capabilities is gonna have to get less vague.
The other likely lagging capabilities are alignment faking, sandbagging, and secret keeping. But they take more effort to measure because they’re all types of training resistance.
I also considered “machiavellian intelligence” and “strategy” but I think those each lose some important connotations. Machiavellian intelligence doesn’t deal with cursed types of uncertainty, “strategy” is too easily rounded off to much more well defined domains like chess.