Kyle O’Brien — LessWrong

Thank you for looking into this. My understanding is that one of the takeaways is that one can undo emergent misalignment in your setup by fine-tuning on positive AI discourse, as posited in Self-Fulfilling Misalignment. I'm concerned about a missing ablation here. It is unclear whether you'd get the same effect by fine-tuning on general text unrelated to AI. It is plausible that this final fine-tuning stage acts as catastrophic forgetting on EM training rather than teaching the LLM that AI is good, and thus it would be good. Unless I'm missing something, I'm quite skeptical of the results in the absence of more ablations.

Kyle O’Brien's Shortform

Kyle O’Brien2mo30

Anthropic recently released a pretraining data filtering paper with similar results to Deep Ignorance. It is very exciting that both teams arrived at the same broad conclusion, even with differences in methodologies. It becomes more difficult to square these positive results with data filtering against OpenAI's negative results. We need more public and fully transparent research into pretraining interventions. I'm especially excited to study scaling laws for pretraining filtering.

https://alignment.anthropic.com/2025/pretraining-data-filtering/

Kyle O’Brien's Shortform

Kyle O’Brien2mo20

I'm also interested in these questions! I'm particularly interested in exploring how effectively we can filter out offensive cyber knowledge without compromising non-security software engineering skills. Since so many of our proposed control protocols are cyber-based, limiting the cyber knowledge of our models seems like it would help with AI control. This is plausibly difficult, but I also thought the same about biology. Regardless, I think that it would be such a big win if we pulled it off that it's worth studying.

I'm actively studying the degree to which AI safety discourse is present in popular pretraining datasets, with discussions of control and evaluation protocols. The hope is then that we could train models with this knowledge filtered out and see how the model's behavior changes. This work is still in the early stages, though. An ambitious version of this is to perform empirical experiments studying Self-Fulfilling Misalignment.

I think my main point is that there is a lot of low-hanging fruit here! It's plausible that there are more conceptually simple interventions we can make, like Deep Ignorance, that in aggregate buy us some safety. I think it's worth it for the community to put a lot more effort into pretraining research.

Kyle O’Brien's Shortform

Kyle O’Brien2mo444

A team of researchers at EleutherAI, UK AISI, and Oxford, along with me, released a paper on pretraining data filtering. The TLDR is that simple pretraining data filtering seems like an effective technique for preventing unsafe knowledge, though there are limitations and open questions. See our full paper, articles, and models at https://deepignorance.ai/

Overview:

Today's LLM safeguards focus on suppressing unsafe knowledge in post-training, often via refusal training. The unsafe knowledge remains in the model's weights. However, these safeguards are ineffective for open-weight models since bad actors can easily remove them through fine-tuning.

We explore an intuitive yet understudied question: Can we prevent LLMs from learning unsafe technical capabilities (such as biorisk) by filtering out enough of the relevant pretraining data before we begin training a model? Even a fully jailbroken model is unlikely to be helpful if it is deeply ignorant of dangerous knowledge. For example, it may be the case that a model that does not know how to make a bomb is unlikely to be helpful even if it never refuses bomb-related prompts.

We train multiple 6.9B LLMs from scratch on an unfiltered dataset and on filtered versions where we filtered out biorisk knowledge. We are one of only a handful of papers outside of frontier companies that train LLMs from scratch. We observe three main results:

1. Knowledge Prevention: The filtered models perform significantly worse on our biorisk knowledge evaluations, nearly at random chance. Crucially, filtering does not lead to notable regressions in general knowledge. These results suggest that data filtering may be a simple way to prevent models from learning dangerous capabilities without sacrificing utility.

2. Tamper-Resistance: Open-weight models can be fine-tuned by downstream users on biorisk data. We study this attack by fine-tuning our models on 300M tokens of high-quality biorisk-related documents. We find that performance can improve, but that it is still well below the no-filtering baseline. Data filtering is significantly more tamper-resistant than current safeguards.

3. Defense-in-Depth: We demonstrate that data filtering cannot prevent LLMs from leveraging harmful knowledge provided in-context, but that Circuit-Breaking-based techniques offer complementary defenses. However, we show that none of the defenses we test are resistant to staged attacks that combine fine-tuning and in-context retrieval.

Taken together, these results suggest that rigorous pretraining data filtering is a promising method for preventing acquisition of dangerous technical capabilities without obvious degradation in overall model utility. There are still many limitations and open questions, however. This post only skims the surface of our results and their implications!

ryan_greenblatt's Shortform

Kyle O’Brien2mo10

It's exciting to see OpenAI acknowledge that pre-training data filtering is a part of their safety stack. When it comes to advanced technical content, minimizing the model’s exposure to sensitive material seems pretty intuitive. However, it is difficult to draw any strong conclusions about data filtering effectiveness from this work, given the understandably few details. They do not indicate the effort invested, the volume of data removed, or the sophistication of their filtering pipeline. I expect a company could share far more details about this process without divulging trade secrets.

Was it public knowledge that they did data filtering for GPT-4o? I've been studying this space and was not aware of this. It's also interesting that they're using the "same" filtering pipeline a year later.

Kyle O’Brien's Shortform

Kyle O’Brien2mo51

A common theme I've read from folks who're less concerned about (near) future AI risks is an offense-defense balance. AGI-level offensive capabilities may be offset by AGI-level defensive capabilities. However, we know that LLMs have jagged capability levels; they are excellent at some capabilities but terrible at adjacent ones. For instance, are LLMs excellent at generating malware but bad at malware detection? Understanding settings where the nature of defenses makes them ill-suited for automation, but where offense is easily automated, seems critical for societal resilience.

ryan_greenblatt's Shortform

Kyle O’Brien4mo32

What are your views on open-weights models? My thoughts after reading this post are that it may not be worth giving up the many benefits of open models if closed models are actually not significantly safer concerning these risks.

SAEs are highly dataset dependent: a case study on the refusal direction

Kyle O’Brien1y71

Great works, folks. This further highlights a challenge that wasn't obvious to me when I first began to study SAEs — which features are learned is just super contingent on the SAE size, sparsity, and training data. Ablations like this one are important.

Looking for an alignment tutor

Kyle O’Brien3y20

I agree with this suggestion. EleutherAI's alignment channels have been invaluable for my understanding of the alignment problem. I typically get insightful responses and explanations on the same day as posting. I've also been able to answer other folks' questions to deepen my inside view.

There is a alignment-beginners channel and a alignment-general channel. Your questions seem similar to what I see in alignment-general . For example, I received helpful answers when I asked this question about inverse reinforcement learning there yesterday.

Question: When I read Human Compatible a while back, I had the takeaway that Stuart Russel was very bullish on Inverse Reinforcement Learning being an important alignment research direction. However, I don’t see much mention of IRL on EleutherAI and the alignment forum. I see much more content about RLHF. Is IRL and RLHF the same thing? If not, what are folks’ thoughts on IRL?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments