Claude 4.5 generated a list of blog-posts that only appeared in the former link:
Examples of blog-only posts include:
- Activation Oracles
- Towards training-time mitigations for alignment faking in RL
- Open Source Replication of the Auditing Game Model Organism
- Anthropic Fellows Program announcements
- Evaluating honesty and lie detection techniques
- Strengthening Red Teams
- Stress-testing model specs
- Inoculation Prompting
- Findings from a Pilot Anthropic–OpenAI Alignment Evaluation Exercise
- Subliminal Learning
- Do reasoning models use their scratchpad like we do?
- Publicly Releasing CoT Faithfulness Evaluations
- Putting up Bumpers
- How to Replicate and Extend our Alignment Faking Demo
- Three Sketches of ASL-4 Safety Case Components
- And many others
One thing that could explain the lack of safety blog-posts for anthropic is that a lot of their blog-posts appear only on https://alignment.anthropic.com/ and not on https://www.anthropic.com/research. So it seems like your scraper (which I think only goes through the latter) undercounts Anthropic's safety blog-posts.
It would be useful if you could share the dataframe/csv you generated which had the blog-posts and Gemini's classification!
Here are other reasons why I think adopting the policy of "by default all meetings (in-person and virtual) are recorded":
Since my org is small and high-trust, I don't see any good reasons to not do this. That's why I'm going to push for it at my org.
It's worth the cost of being less in-sync with the whole organization to enable the fluidity and bandwidth that in-person spoken communication enables.
I think small orgs should record most in-person spoken communication. These recordings can easily be turned into transcripts using AIs and be stored centrally.
This would preserve the benefits of fluidity and bandwidth but also reduce the cost of being less in-sync.
It would be cool if lesswrong had a feature that automatically tracked when predictions are made.
Everytime someone wrote a post, quick take or comment an LLM could scan the content for any predictions. It could add predictions which had an identifiable resolution criteria/date to a database (or maybe even add it to a prediction market). Would then be cool to see how calibrated people are.
We could also do this retrospectively by going through every post ever written and asking an LLM to extract predictions (maybe we could even do this right now I think it would cost on the order of 100$).
We naïvely expected to see a linear trend between the two extremes, but we observed a curve which was worse than linear, with usefulness dropping faster than safety increases at low safety values. We are curious to understand this phenomenon better.
I think you can solve this analytically.
The protocol seems to be:
1. With probability p use the Untrusted model to generate code (otherwise use the Trusted model)
2. If the untrusted model generates the code, with some probability (see below for what I think this probability should be) audit the code
Then vary p between [auditing budget, 1] to get different (safety,usefulness) points.
Optimal audit probability:
The blue team should basically audit as much as they can... (read 371 more words →)
People may find this post interesting as well: Gemini 3 is evaluation paranoid and contaminated