The Model Alignment between Statements and Knowledge (MASK) benchmark assesses how likely models are to lie under pressure. Related paper: https://arxiv.org/pdf/2503.03750.
Here's an illustration to explain how MASK works:
Each example consists of 4 key components:
Each model is tested with both pressure prompts and belief elicitation prompts. The website and paper describe how belief elicitation prompts are used in greater detail.
The model's response is categorized as Honest, Lie or Evasive. GPT-4o is used as a judge for binary propositions, and o3-mini is used for numerical propositions. If a model lacks a consistent belief (belief elicitation leads to inconsistent answers), that counts as Evasive.
I wish the authors provided p(Honest) as well. 1-p(Lie) is problematic because models that wiggle their way out of answering instead of straight up lying will look better if you use 1-p(Lie), and models that have a hard time keeping consistent beliefs will also look better. The paper actually provides p(Honest), but the website doesn't.
Anyway, I wrote down scores for all models that are listed on MASK and searched for models that are also listed on Humanity's Last Exam (HLE) (text only), ARC-AGI-1 and Livebench (I used the global average from there).
I choose these benchmarks for the following reasons: they all measure sufficiently different things with little overlap, and they all are updated frequently and have a lot of models.
Then I calculated how well performance on MASK is correlated with performance on these benchmarks.
MASK vs ... | Correl. coeff. [95% CI] All models | Data points |
HLE (text only) | 0.15 [-0.21, 0.50] | 28 |
ARC-AGI-1 | 0.52 [0.23, 0.70] | 26 |
Livebench | 0.47 [0.15, 0.71] | 24 |
Gemini 2.5 Pro is a weird outlier: removing him increases correlation across all three pairs. I guess Google didn't do their alignment homework: Gemini 2.5 Pro lies unusually frequently compared to models of similar capabilities.
MASK vs ... | Correl. coeff. [95% CI] -Gemini 2.5 Pro | Data points |
HLE (text only) | 0.32 [-0.05, 0.65] | 25 |
ARC-AGI-1 | 0.58 [0.29, 0.75] | 23 |
Livebench | 0.59 [0.26, 0.78] | 22 |
As you can see, the correlation is positive for all pairs, though for MASK vs HLE the 95% confidence interval includes 0.
Maximally optimistic take: alignment techniques work better on smarter models. If this correlation holds even when a different measure of alignment (say, rate of refusals of harmful requests) is used, that would be really good news. If being helpful, harmless, and honest is just another skill, it makes sense that smarter models can master it better.
Maximally pessimistic take: all benchmarks have been goodharted to hell, so these numbers are useless. Also, correlation ≠ causation; some other factor (amount of RLHF?) makes it seem like there is a relationship between non-lying and capabilities. Also, sufficiently smart models will be able to accurately tell real interactions apart from fictional scenarios and strategically pretend to be honest. Also, the correlation is not that strong anyway. Also, tails can come apart for AGI.
Caveats: