I'm of a like mind. I did not know what "hyperstition" meant until recently. While there is a chance I was uniquely uninformed, the fact that I had to consult LessWrong to familiarize myself with this term motivated my collaborators and I to intentionally avoid using it in our Alignment Pretraining paper. It sounds cool, but we thought it would make it more difficult to communicate our results.
Anthropic has done some interesting semi-public research on data filtering (Chen et al., 2025). Speaking of which, that report gave quite a positive impression of data filtering. I'm curious what changed in their latest results.
I helped write a paper on pretraining data filtering last year (speaking for myself, not EleutherAI or UK AISI). We focused on misuse in an open-weight setting for 7B models we pretrained from scratch, finding quite positive results. I'm very excited to see more discourse on filtering. :)
Anecdotes from folks at frontier companies can be quite useful, and I'm glad Jerry shared their thoughts, but I think the community would benefit from more evidence here before updating for or against filtering. Not to be a Reviewer #2, but there are some pretty important questions that are left unanswered. These include:
All that said, I do expect that naive pretraining data filtering will struggle with precise interventions. However, our public understanding of data filtering and pretraining dynamics is so nascent that it is hard to tell just how precise we can be. Using a simple multi-stage filtering pipeline, we were able to drop the WMDP benchmark to near-random performance without notably degrading MMLU biology for our 7B model. While MCQA benchmarks like WMDP and MMLU have numerous limitations, these results suggest that we can still have models with a college-level understanding of these subjects while reducing dangerous capabilities.
One's threat model here is quite load-bearing. Perhaps a dual-use capability like CBRN might be beneficial to a relatively small set of scientists, whereas the misuse risk comes from users among the general public. If so, one could sidestep much of the dual-use tradeoff by pretraining on filtered data and then branching the base models after midtraining:
All that said, we can have reasonable debates about dual-use trade-offs for closed-source models like Claude, only because they have post-training and deployment safeguards. Open-weight models have no safeguards, as they are deployed locally and can have their safety training easily removed. There is little to prevent an attacker from extracting sensitive information if it's in the model's weights. In the absence of tamper-resistant post-training, we still need advances in capability-prevention measures, such as data filtering, to mitigate misuse risks from ever-improving open-weight models. See Caspet et al. (2025) for more on this.
EleutherAI and Geodesic Research have wanted to study scaling laws for pretraining filtering, but have been bottlenecked by compute constraints. I think a well-executed paper on scaling laws would answer many of these questions. We're optimistic that we'll be able to study this in 2026, but perhaps not until H2. If Anthropic did this research internally, I'd be over the moon if they wrote the paper.
My experience with pretraining data filtering has felt like my subjective impression of CoT monitoring. It is a simple and intuitive intervention that many folks ruled out. It is far from perfect, and there are ways it may not scale. Yet, it might be an important part of our safety stack and remains understudied.
Really enjoying this — about two hours in. I've been thinking a lot about research management lately, so your points on small project teams (2-3 people with full ownership) especially resonated. It really does seem like a lot of research projects take way longer than necessary in hindsight. I think a dedicated post on this would be pretty impactful, or perhaps the main topic of a future episode!
Thanks for the interest! We plan to publish the "main release" of our paper on arXiv in the coming weeks. This release will include several new experiments and revisions based on the excellent community feedback we've received.
November 3rd!
Thank you for looking into this. My understanding is that one of the takeaways is that one can undo emergent misalignment in your setup by fine-tuning on positive AI discourse, as posited in Self-Fulfilling Misalignment. I'm concerned about a missing ablation here. It is unclear whether you'd get the same effect by fine-tuning on general text unrelated to AI. It is plausible that this final fine-tuning stage acts as catastrophic forgetting on EM training rather than teaching the LLM that AI is good, and thus it would be good. Unless I'm missing something, I'm quite skeptical of the results in the absence of more ablations.
Anthropic recently released a pretraining data filtering paper with similar results to Deep Ignorance. It is very exciting that both teams arrived at the same broad conclusion, even with differences in methodologies. It becomes more difficult to square these positive results with data filtering against OpenAI's negative results. We need more public and fully transparent research into pretraining interventions. I'm especially excited to study scaling laws for pretraining filtering.
https://alignment.anthropic.com/2025/pretraining-data-filtering/
I'm also interested in these questions! I'm particularly interested in exploring how effectively we can filter out offensive cyber knowledge without compromising non-security software engineering skills. Since so many of our proposed control protocols are cyber-based, limiting the cyber knowledge of our models seems like it would help with AI control. This is plausibly difficult, but I also thought the same about biology. Regardless, I think that it would be such a big win if we pulled it off that it's worth studying.
I'm actively studying the degree to which AI safety discourse is present in popular pretraining datasets, with discussions of control and evaluation protocols. The hope is then that we could train models with this knowledge filtered out and see how the model's behavior changes. This work is still in the early stages, though. An ambitious version of this is to perform empirical experiments studying Self-Fulfilling Misalignment.
I think my main point is that there is a lot of low-hanging fruit here! It's plausible that there are more conceptually simple interventions we can make, like Deep Ignorance, that in aggregate buy us some safety. I think it's worth it for the community to put a lot more effort into pretraining research.
In episode 1, you and Ryan discussed how you both came close to disbanding Redwood after the initial AI Control paper. I think folks would benefit from hearing more of your thoughts on why you decided to remain an external research organization, especially since my understanding is that you want to influince the practices of the frontier labs. This is a consideration that many folks should grapple with in their own research efforts.