Filtering is effective at making models safer.
A team at EleutherAI, UK AISI, and Oxford University asked:
Can we prevent LLMs from learning unsafe technical capabilities (such as biorisk) by filtering out enough of the relevant pretraining data before we begin training a model? Even a fully jailbroken model is unlikely to be helpful if it is deeply ignorant of dangerous knowledge.
They find that data filtering is significantly more tamper-resistant than current safeguards without impacting general capability. It doesn't provide against use of in-context knowledge.
Could you ask an AI to filter out the text you don't want?
(Like, ask an AI1 to filter the text, then use the rest to train AI2.)
Sure, but that is expensive. Why would more than one team need to do it?
Hm. It turns out it wouldn't be soo expensive. ChatGPT estimated at least 12K$.
Didn’t read it in detail but I think Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs discusses filtering approaches.
Sort of semi-related, there is the "Common Pile", a successor to "The Pile". It was not focused on "safe" data, but rather "public domain" data. But, maybe that excludes at least some dangerous data, and could make further filtering easier?
The larger LLMs are trained on the common crawl, a publicly available dump of significant parts (400TB) of the public internet. They are also trained on all kinds of additional data, but presumably a large fraction of dangerous content is likely from the common crawl.
Is there a safe version of the common crawl that has the dangerous parts removed (or at least labeled, such that it would be easy to remove)?
From a safety perspective it would probably most useful if material on AI (esp. about misalignment and alignment strategies) were removed. It would also be interesting if material on consciousness were removed to allow testing if LLMs discover it without prior knowledge.
Obviously, this wouldn't solve the alignment problem because instrumental convergence still holds. But it could by some time.