Two weeks ago, xAI finally published its Risk Management Framework and first model card. Unfortunately, the RMF effects very little risk reduction and suggests that xAI isn't thinking seriously about catastrophic risks. (The model card and strategy for preventing misuse are disappointing but much less important because they're mostly just relevant to a fraction of misuse risks.)

On misalignment, "Our risk acceptance criteria for system deployment is maintaining a dishonesty rate of less than 1 out of 2 on MASK. We plan to add additional thresholds tied to other benchmarks." MASK has almost nothing to do with catastrophic misalignment risk, and upfront benchmarking is not a good approach to misalignment risk. On security, "xAI has implemented appropriate information security standards sufficient to prevent its critical model information from being stolen by a motivated non-state actor." This is not credible, xAI doesn't justify it, and xAI doesn't mention future security plans.

All this—along with xAI's lack of capacity to do safety work and lack of any good signs on safety—makes me bearish on security and AI takeover risk if xAI is among the first to develop critical AI systems.

Misalignment

Our risk acceptance criteria for system deployment is maintaining a dishonesty rate of less than 1 out of 2 on MASK. We plan to add additional thresholds tied to other benchmarks.

This is very silly. There are several huge problems here. Most importantly, benchmarks like this don't address the biggest category of misalignment risk: the model is deceptively aligned, sometimes pursuing its own secret goals, but generally acting honest and aligned so that it will be trusted and deployed. By default models may strategically fake alignment to preserve their goals or just notice that they're likely being tested and choose to act aligned. Benchmarks like this can't distinguish models being aligned from faking it. And MASK is about models straightforwardly prioritizing helpfulness over honesty — it measures models' propensities to lie due to requests (or system prompts) instructing the model to support a specific conclusion;^[1] this doesn't seem closely related to models' propensities to pursue their own goals. Additionally, even if MASK measured something relevant, a dishonesty threshold of 50% would be far too high. (And it's even higher than it s...

Nick Baldwin

Nick Baldwin