We Built a Tool to Protect Your Dataset From Simple Scrapers

Edward Turner; Dipika Khullar; royrin

Have you seen Anubis?

Ugh, I hate Anubis. Any website with Anubis is spectacularly annoying or, in some cases, outright impossible to browse, if you have an even slightly outdated browser/system. It’s a really user-hostile solution.

[-]ProgramCrafter3mo10

If we are protecting against AI-building labs specifically, there are two outcomes:

They do not care to set up a special way of data extraction for those protected sites. Then, we can set Anubis to whatever low difficulty that it does not hinder users.
They do care and want to get the data behind. We lose.^[1]

^{^}
Because AI labs have a large number of... what they are called, again... massive parallel computation units... ah, GPUs, they have an advantage at proof-of-work. And if they choose to use them against humanity (a sleight of hand here, I admit), it will not go well.

[-]Said Achmiz3mo101

Then, we can set Anubis to whatever low difficulty that it does not hinder users.

… unless it breaks entirely because someone has an old browser / weird browser / text browser / screen reader / NoScript enabled / etc., and the difficulty setting doesn’t matter entirely; and the result is that the user still can’t get through.

[-]Thane Ruthenis3mo80

Hmm. This approach relies partly on the AGI labs being cooperative and wary of violating the law, and partly on creating minor inconveniences for accessing the data which inconvenient human users as well. In addition, any data shared this way would have to be shared via the download portal, impoverishing the web experience.

I wonder if it's possible to design some method of data protection that (1) would be deployable on arbitrary web pages, (2) would not burden human users, (3) would make AGI labs actively not want to scrape pages protected this way.

Here's one obvious idea. It's pretty hostile and might potentially have net-negative results (for at least some goals), but I think it's worth discussing.

We could automatically seed the text with jailbreaks, spam, false information, and other low-quality/negative-quality training data, in a way that is invisible to users but visible to LLMs. Pliny-style emoji bombs, invisible/very tiny text, and other techniques along those lines, randomly inserted into the human-readable text between paragraphs or words.

How easy would it be for AGI labs to clean this up? Might be pretty easy, if we only have select few methods of hiding the text: then they can just automatically look for text within invisible tags/added via Unicode variation selectors, and remove it. But it might be possible to create a diverse, ever-growing family of text-hiding methods, such that static countermeasures don't work. Tasking an LLM with cleaning the document, instead of manually designed methods, might backfire, with the cleaner LLM itself getting jailbroken by the embedded cognitohazards.

Making the hidden text actually harmful seems easier: jailbreaks are invented faster than countermeasures against them, I think.

Ideally, the whole setup would be continuously updated: instead of individual writers having to inject this stuff on their own, there would be a centralized public API or GitHub repo which web developers could use, embedding this functionality into websites. This centralized API/repo can then be continuously updated with new jailbreaks and counters to whatever countermeasures AGI labs come up with (AdBlock-style).

Again, it's obviously pretty hostile, but if paired with a canary string, any actor that decides to ignore the canary string and scrape the page anyway deserves what they get.

Any obvious reasons this is a bad idea which I'm missing? I guess the obvious failure mode is people deploying this without the canary string, meaning even cooperative AGI labs might accidentally train on the data poisoned this way. If the goal is to prevent training on bad data (because of e. g. misalignment concerns), that's obviously counterproductive.

[-]ProgramCrafter3mo20

You have to make that poison inactive in accessibility cases, or a person using screen reader would hear all that. However, if a correctly configured screen reader skips the invisible data, then labs will just use it (assuming they can be bothered with cleaning dataset at all).

Also, training-time jailbreaks are likely quite different from inference-time jailbreaks. The latter will tend to hit Operator-style stuff harder.

[-]anaguma3mo10

This seems like a great idea! However, I think it might degrade the usefulness of the dataset, especially if it’s meant to later be used to evaluate LLMs since any jailbreaks etc. would apply in that setting as well. If you provide utilities to clean up the text before evaluation, these could be used for scraping as well.

[-]Thane Ruthenis3mo30

Yeah, I guess the use-case I had in mind is generally people who don't want LLMs trained on (particular pieces of) their writing, rather than datasets specifically.

[-]winstonBosan3mo*50

go-away is my personal choice.

Doesn’t require weird js and text mode browsing like Anubis. Widely(ish) used. Not nuclear like anubis.

A download portal in minutes

We reduce the friction of serving data in a scraper-resistant fashion.

At most, users click a box. They don't have to complete any annoying tasks.

The protected download page after verification.

Generally, you'd just download a single .zip – possibly created by our easy-dataset-share command-line tool.

While you'll need to click some buttons on the GitHub, Vercel, and Cloudflare websites, our guide and data-share-vercel-setup command automate the tricky parts, like creating API keys and configuring environment variables.

A CLI tool

The underlying command-line tool^[2] (easy-dataset-share) wraps the dataset in several ways:

robots.txt tells well-behaved crawlers to leave your data be and reserves rights against commercial Text and Data Mining in jurisdictions like the EU and UK.

Terms of Service prohibit training on AI data and require that the terms be provided alongside copies of your data.

Canary strings – these are not for preventing scraping, but for detecting it. We know that labs don't seem to reliably filter out data containing canary strings. If models can complete a canary string, that provides strong evidence that your dataset was trained upon.

Hash verification guarantees data integrity.

Possible improvements

Because of the pseudo-volunteer nature of this effort, we are releasing this tool with obvious improvements left on the table. We wanted to provide a v1.0 and perhaps invite further collaboration.

Use OAuth2 to deanonymize crawlers, requiring them to use a Google-verified account on the record. We hope to force scrapers to overcome Google’s sophisticated bot-detection apparatus in order to access the dataset.

Include a clickwrap Terms of Service. Currently, a user can download the dataset without explicitly agreeing to the Terms of Service. We could require users to check a box stating "I accept the Terms of Service" before revealing the download link. Clickwrap agreements seem to be more legally enforceable and a stronger legal deterrent.

Think you see a better way to do things or just want to help out? Feel free to join our collaborative Discord or submit a pull request. If needed, Neel Nanda has volunteered to pay someone to work on this full-time until the project is done.^[3]

Please protect datasets

After the alignment faking leakage, Anthropic took a positive step by committing^[4] to add canary strings to their transcripts in the future. But rather than trusting AI labs to properly filter canary-tagged data, be proactive. If you host your own data, use this tool to put it behind a Turnstile. By taking these steps, you somewhat protect against train-set leakage, making your dataset more valuable in the long-run. Plus, we can all rest a teeny bit easier about the alignment of future models. To get started, follow the README.

Thank you to the core contributors: Dipika Khullar, Ed Turner, and Roy Rinberg. They also maintain the repository. While I put out the original $500 bounty, I was then joined by Anna Wang ($500), James Aung ($500), and Girish Sastry ($1,000).

This article is viewable in a prettier form on my website

Find out when I post more content: newsletter & RSS

Thoughts? Email me at alex@turntrout.com

^{^}

Anthropic conducted measurements to test whether the alignment faking data had broader impacts:

We conducted several small exploratory experiments to assess whether the use of this data influenced the model’s behavior more broadly, and now believe that this is very unlikely. For example, on multiple measures of alignment, post-mitigation snapshots of the model act no less aligned when prompted to use <SCRATCHPAD_REASONING> tags, rather than ordinary <antml:thinking> tags.
— Claude 4 system card

^{^}

When you just need to share a file directly, use easy-dataset-share to produce a single file that is safer than a standard .zip.

^{^}

The full-time opportunity is separate from the bounties, which have already been claimed by the current contributors.

^{^}

Anthropic committed to add canary strings on the bottom of page 38 of the Claude 4 system card.

LESSWRONG
LW

LESSWRONG
LW

60

We Built a Tool to Protect Your Dataset From Simple Scrapers

60

Ω 29

60

Ω 29

A download portal in minutes

What we provide

A web portal

A CLI tool

Possible improvements

Please protect datasets