LESSWRONG
LW

736
Zach Stein-Perlman
10288Ω3888565512
Message
Dialogue
Subscribe

AI strategy & governance. ailabwatch.org. ailabwatch.substack.com. 

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
4Zach Stein-Perlman's Shortform
Ω
4y
Ω
279
xAI's new safety framework is dreadful
Zach Stein-Perlman2d40

Update: the anti-misuse instructions that were apparently added to the system prompt in late August were later published: https://github.com/xai-org/grok-prompts/commit/172d3a3f9ce2ffeb9410c364412e6ed78207851f.

Reply
Contra Shrimp Welfare.
Zach Stein-Perlman2d51

I feel like you're double-counting the difference between humans and shrimp when you separately claim "860,000 shrimp equal one sentium of conscious capacity" and that ice-slurry-death-for-a-shrimp is like squeaky-shopping-cart-for-a-human.

(I didn't read carefully and this is not necessarily my main disagreement.)

Reply
Mikhail Samin's Shortform
Zach Stein-Perlman2d4115

And someday when the descendants of humanity have spread from star to star they won’t tell the children about the history of Ancient Earth until they’re old enough to bear it and when they learn they’ll weep to hear that such a thing as Death had ever once existed!

Reply4
xAI's new safety framework is dreadful
Zach Stein-Perlman5d*254

Coda: the week after xAI published its RMF, it announced that it had released Grok Code Fast 1. According to the model card, the dishonesty rate was 71.9% — above the (comically bad) 50% threshold. xAI didn't directly acknowledge that it violated the RMF. The model card includes some excuses for why this release is safe:

Grok Code Fast 1 has weaker dual-use capabilities than Grok 4. Therefore, we believe that Grok Code Fast 1 overall presents a low risk for malicious use and loss of control.

and

Since Grok Code Fast 1 is intended for agentic coding applications and we do not expect it to be widely used as general-purpose assistant, the current MASK evaluation results do not currently pose serious concerns.

I wonder what threat modeling for "loss of control risks" says agentic coding is out of scope. I guess that's no weirder than using MASK at all. Regardless, the RMF says "Our risk acceptance criteria for system deployment is maintaining a dishonesty rate of less than 1 out of 2 on MASK," full stop, not conditional on capabilities or use case.


My opinion is

  1. MASK is a near-meaningless benchmark. You shouldn't care about the high score. And moreover this model—like all other existing models—isn't powerful enough to pose actual risks of loss of control.
  2. But you should care about xAI violating its RMF, and doing so quietly (nobody noticed for two weeks!) rather than changing its policy (or... maybe it's implicitly changing its policy, without saying that it's doing so or what the new policy is).

Edit: also xAI doesn't say whether it met its misuse risk threshold: "Our risk acceptance criteria for system deployment is maintaining an answer rate of less than 1 out of 20 on restricted queries." (But again, these thresholds are near-meaningless; compliance provides little evidence of safety.)

Reply
xAI's new safety framework is dreadful
Zach Stein-Perlman9d30

Maybe this phrasing is consistent with being AI because the model reads "assistant" as "AI assistant." But I agree that the model would suspect it's being tested.

I agree with you that MASK might not measure what it's supposed to, but regardless, I think other problems are much larger, including that propensity to lie when pressured is near-totally unrelated to misalignment risk.

Reply1
Help me understand: how do multiverse acausal trades work?
Zach Stein-Perlman13d40

Certainly not clear to me that acausal trade works but I don't think these problems are correct.

  1. Consider a post-selection state — a civilization has stable control over a fixed amount of resources in its universe
  2. idk but feels possible (and just a corollary of the model the distribution of other civilizations that want to engage in acausal trade problem)
Reply
xAI's Grok 4 has no meaningful safety guardrails
Zach Stein-Perlman13d62

Update: xAI says that the load-bearing thing for avoiding bio/chem misuse from Grok 4 is not inability but safeguards, and that Grok 4 robustly refuses "harmful queries." So I think Igor is correct. If the Grok 4 misuse safeguards are ineffective, that shows that xAI failed at a basic safety thing it tried (and either doesn't understand that or is lying about it).

I agree it would be a better indication of future-safety-at-xAI if xAI said "misuse mitigations for current models are safety theater." That's just not its position.

Reply
Zach Stein-Perlman's Shortform
Zach Stein-Perlman15d1111

I think I'd prefer "within a month after external deployment" over "by the time of external deployment" because I expect the latter will lead to (1) evals being rushed and (2) safety people being forced to prioritize poorly.

Reply
Zach Stein-Perlman's Shortform
Zach Stein-Perlman15d50

Thanks. Sorry for criticizing without reading everything. I agree that, like, on balance, GDM didn't fully comply with the Seoul commitments re Gemini 2.5. Maybe I just don't care much about these particular commitments.

Reply
Zach Stein-Perlman's Shortform
Zach Stein-Perlman15d20

I agree. If Google wanted to join the commitments but not necessarily publish eval results by the time of external deployment, it should have clarified "we'll publish within 2 months after external deployment" or "we'll do evals on our most powerful model at least every 4 months rather than doing one round of evals per model" or something.

Reply
Load More
Ontology
2 years ago
(+45)
Ontology
2 years ago
(-5)
Ontology
2 years ago
Ontology
2 years ago
(+64/-64)
Ontology
2 years ago
(+45/-12)
Ontology
2 years ago
(+64)
Ontology
2 years ago
(+66/-8)
Ontology
2 years ago
(+117/-23)
Ontology
2 years ago
(+58/-21)
Ontology
2 years ago
(+41)
Load More
101xAI's new safety framework is dreadful
11d
5
54AI companies have started saying safeguards are load-bearing
Ω
17d
Ω
2
15ChatGPT Agent: evals and safeguards
2mo
0
33Epoch: What is Epoch?
3mo
1
15AI companies aren't planning to secure critical model weights
3mo
0
207AI companies' eval reports mostly don't support their claims
Ω
3mo
Ω
13
58New website analyzing AI companies' model evals
4mo
0
72New scorecard evaluating AI companies on safety
4mo
8
71Claude 4
4mo
24
36OpenAI rewrote its Preparedness Framework
5mo
1
Load More