One thing that could explain the lack of safety blog-posts for anthropic is that a lot of their blog-posts appear only on https://alignment.anthropic.com/ and not on https://www.anthropic.com/research. So it seems like your scraper (which I think only goes through the latter) undercounts Anthropic's safety blog-posts.
It would be useful if you could share the dataframe/csv you generated which had the blog-posts and Gemini's classification!
Here are other reasons why I think adopting the policy of "by default all meetings (in-person and virtual) are recorded":
Since my org is small and high-trust, I don't see any good reasons to not do this. That's why I'm going to push for it at my org.
It's worth the cost of being less in-sync with the whole organization to enable the fluidity and bandwidth that in-person spoken communication enables.
I think small orgs should record most in-person spoken communication. These recordings can easily be turned into transcripts using AIs and be stored centrally.
This would preserve the benefits of fluidity and bandwidth but also reduce the cost of being less in-sync.
It would be cool if lesswrong had a feature that automatically tracked when predictions are made.
Everytime someone wrote a post, quick take or comment an LLM could scan the content for any predictions. It could add predictions which had an identifiable resolution criteria/date to a database (or maybe even add it to a prediction market). Would then be cool to see how calibrated people are.
We could also do this retrospectively by going through every post ever written and asking an LLM to extract predictions (maybe we could even do this right now I think it would cost on the order of 100$).
We naïvely expected to see a linear trend between the two extremes, but we observed a curve which was worse than linear, with usefulness dropping faster than safety increases at low safety values. We are curious to understand this phenomenon better.
I think you can solve this analytically.
The protocol seems to be:
1. With probability p use the Untrusted model to generate code (otherwise use the Trusted model)
2. If the untrusted model generates the code, with some probability (see below for what I think this probability should be) audit the code
Then vary p between [auditing budget, 1] to get different (safety,usefulness) points.
Optimal audit probability:
The blue team should basically audit as much as they can given their auditing budget. Since they can focus all of their auditing budget on the untrusted models outputs this is equal to auditing_budget/p.
So for example if auditing budget is 0.5% and p=0.5 (so you are using the untrusted model half the time), you can afford to audit 1% of the untrusted monitor's outputs.
Optimal red team protocol:
I think the optimal red team protocol is to always submit a blatant, obvious backdoor. This minimizes the chance you never submit a backdoor in a deployment. Blatant backdoors are better because there is no monitor and it maximises the chance you can generate an exploit on the backdoor. There’s no penalty for having a high attack rate because all that matters is whether the first backdoor is audited or not.
(this is technically not optimal because the red team should try and predict whether their code will actually succeed on the main task, and only submit a backdoor if they are confident of this, but we can ignore this for simplicity)
Safety calculation:
I think https://blog.redwoodresearch.org/p/wincontinuelose-scenarios-and-executereplaceaudi applies here.
Safety is the ratio between the following two events occuring on a particular step:
This would simplify to
Where
h/y is the chance that a backdoor is audited and (1-h/y)c is the chance that a backdoor is not audited and also successful.
Note that when c=1 (the red team can solve all the tasks in deployment) the curve is simply safety = auditing budget/ usefulness.
What you can't solve analytically here:
Potentially a good red team can make c go up by trying to only attack when they are very confident that the code will succeed on the main task (and hence be a successful backdoor). Maybe the red team has to be careful about not being too selective here especially with low values of p because they risk never submitting a backdoor at all.
I think the analytical solution gives a good lower bound at how good the red team can be and also explains why you would expect a safety usefulness curve here to be sub-linear.
Claude 4.5 generated a list of blog-posts that only appeared in the former link: