Can we prevent LLMs from learning unsafe technical capabilities (such as biorisk) by filtering out enough of the relevant pretraining data before we begin training a model? Even a fully jailbroken model is unlikely to be helpful if it is deeply ignorant of dangerous knowledge.

They find that data filtering is significantly more tamper-resistant than current safeguards without impacting general capability. It doesn't provide against use of in-context knowledge.

https://deepignorance.ai/

5 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:13 AM

[-]Viliam3mo81

Could you ask an AI to filter out the text you don't want?

(Like, ask an AI1 to filter the text, then use the rest to train AI2.)

Reply

[-]Gunnar_Zarncke3mo82

Sure, but that is expensive. Why would more than one team need to do it?

Hm. It turns out it wouldn't be soo expensive. ChatGPT estimated at least 12K$.

Reply

[-]Steven Byrnes3mo70

Didn’t read it in detail but I think Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs discusses filtering approaches.

Reply

[-]lemonhope3mo20

Llama 8b might do a decent job

Reply

[-]Trevor Hill-Hand3mo32

Sort of semi-related, there is the "Common Pile", a successor to "The Pile". It was not focused on "safe" data, but rather "public domain" data. But, maybe that excludes at least some dangerous data, and could make further filtering easier?

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

21

[ Question ]

Is there a safe version of the common crawl?

21

21

1 Answers sorted by
top scoring

Aug 14, 2025

21

[ Question ]

Is there a safe version of the common crawl?

21

21

1 Answers sorted by top scoring

Aug 14, 2025

1 Answers sorted by
top scoring