AI strategy & governance. ailabwatch.org. ailabwatch.substack.com.
Briefly:
Edit 2: after checking, I now believe the data strongly suggest FTX had a large negative effect on EA community metrics. (I still agree with Buck: "I don't like the fact that this essay is a mix of an insightful generic argument and a contentious specific empirical claim that I don't think you support strongly; it feels like the rhetorical strength of the former lends credence to the latter in a way that isn't very truth-tracking." And I disagree with habryka's claims that the effect of FTX is obvious.)
practically all metrics of the EA community's health and growth have sharply declined, and the extremely large and negative reputational effects have become clear.
I want more evidence on your claim that FTX had a major effect on EA reputation. Or: why do you believe it?
Edit: relevant thing habryka said that I didn't quote above:
For the EA surveys, these indicators looked very bleak:
"Results demonstrated that FTX had decreased satisfaction by 0.5-1 points on a 10-point scale within the EA community"
"Among those aware of EA, attitudes remain positive and actually maybe increased post-FTX —though they were lower (d = -1.5, with large uncertainty) among those who were additionally aware of FTX."
"Most respondents reported continuing to trust EA organizations, though over 30% said they had substantially lost trust in EA public figures or leadership."
Another possible best practice for evals: use human+AI rather than [edit: just] AI alone. Many threat models involve human+AI and sometimes human+AI is substantially stronger than human alone and AI alone.
Claim by SemiAnalysis guy: DeepSeek has over 50,000 H100s. (And presumably they don't spend much compute on inference.)
I want to distinguish two versions of the intervention (in multi-step settings): (1) hide CoT on the suspicious step vs (2) hide all prior CoT (and hide all CoT partway through 5% or something of non-suspicious trajectories so the model can't infer from CoT-hidden that the trajectory was flagged, or do a control eval with results that demonstrate this is unnecessary).
I'm over 75% that a lab would do (1), conditional on doing a "replace suspicious outputs" thing (but under 75% they'd say they're doing so, much less demonstrate it to a skeptical audience; they might just not publish details, especially if this protocol is just in the internal deployment context). I'm under 75% that a lab would do (2).
It would be crazy to show the untrusted model the old CoT when resampling.
What's a bad thing that the agent can do with the jailbreak that it can't just do in the unmonitored scaffold?
Edit: dumb question, oops: obviously the schemer will have affordances that it doesn't have within the unmonitored scaffold / there are actions where it's helpful to jailbreak the monitor. E.g. submitting a pull request, where the code will be read by the monitor. So finding a jailbreak is helpful to the schemer.
This early control post introduced super important ideas: trusted monitoring plus the general point