LessWrong? EA Forum? Medium? Substack? my personal website (https://matiroy.com/)?

New Answer
New Comment

2 Answers sorted by

gwern

Oct 30, 2021

350

I can confirm that Twitter is generally not in Common Crawl, and that Gwern.net is. You can check Common Crawl by querying domains in the index. (This came up when I was investigating Gary Marcus's excuse that GPT-3 solved his GPT-2 counterexamples only because it was somehow cheating by reading his Twitter, though his tweets don't contain the answers; a truly moot claim inasmuch as CC, and hence GPT-3, doesn't include Twitter at all.)

I would predict that LW/EAF, Substack, and your personal website are in it, but Medium is chancy (it is a horrible website from the scraping POV and regularly causes me problems just making snapshots).

Being in CC is generally necessary for a regular website to show up in the big English LM datasets, but is not sufficient. The datasets typically do a lot of duplication checking and other filtering. If you are prone to quoting things, that may result in a lot of your pages being tossed out. As a proxy, you can try checking The Pile. This tells you if you are in GPT-J or Megatron-Turing NLG 530B (the current recordholder). Another way to test would be to simply prompt GPT-3 with yourself in various ways and try to see what it seems to be drawing on.

You are almost certainly not in any of the Chinese models because they generally focus purely on Chinese language material.


I don't particularly recommend 'optimizing' your writing for current NN models, because their preprocessing is bad in ways that can and should be fixed relatively soon while your writing will hopefully have value for many years to come. I also don't think we know how to write usefully for current language models. Looking at the perplexity of each word (which the OA Playground lets you do) isn't a good proxy because a bad likelihood may mean you worded simply badly, rather than hit on a weak point of GPT-3. Like generating adversarial datapoints or active learning on the large text corpuses, this is an open area of research. Can we dialogue like with LaMDA? Who knows. It's hard to do any research like that through the API.

but if you are seriously contemplating that, what you want to keep in mind is:

  1. simple flat HTML. As simple as possible.

    WET files, or GPT-3-style preprocessing, are roughly BBS dialup textfile era formatting. If it wouldn't work equally well as an 80-col ASCII missive from the Cult of the Dead Cow, then at least some of it is being lost on GPT-3.

    Numbered lists, paragraphs, but try to minimize headers, italics/bold, hyperlinks, and multimedia. Most preprocessing drops these structures entirely. If you use hyperlinks or multimedia, try to describe them inline and have alt text. (So for example, the usual practice of hyperlinking sentence phrases is bad because they will be erased in preprocessing and might as well not be there. But a list of references with titles is fine.)

  2. Avoid blockquotes or long quotations. You want to avoid being tossed out by the n-gram checks as 'duplicate' content.

  3. Avoid phonetics, rhymes, puns etc for BPE reasons (as I've covered at great length). Forms of humor which are not phonetic, like sarcasm or satire, are not a problem.

  4. Aim for at least 500 words, but less than 2000 words. Long texts will overflow context windows (GPT-3's 2048 BPEs will get ~600 words, so if you go to 5000 words in an essay, <1/8th will even be visible during training), while short texts don't stress long-range understanding and may be discarded during filtering as 'too short'.

  5. Favor a terse style. Every word uses up precious BPEs. If an adjective is not conveying real information, and is only elegant variation or emphasis, then it is bad training data.

  6. Try to add implicit supervision like structured lists, summaries/abstracts, and definitions of terms/acronyms. This might help with #4: if you need >1k words do a series of posts, with summaries of the previous material at the beginning as scaffolding for prediction and to help train it to abstract/summarize. (Breaking up writing into chunks also reduces the risk of a piece being filtered out. I'm not sure how summaries would interact with the n-gram and fuzzy hash checks.)

    Highly speculative: Q&A format may be helpful if you think you can come up with good questions that a model might screw up and hence will learn from. (Adversarial training data is useful but hard to predict in advance without an actual model to test against.) I also wonder if it is worth trying to be more explicit about explaining why things are wrong, as if explaining things to an inexperienced child lacking commonsense, rather than simply expecting the human reader to know all the reasons you shouldn't 'put ice cream in a microwave', eg.

  7. Avoid any 'sensitive' terms. Both Chinese and Western ML researchers practice heavy censorship of text, albeit with, shall we say, slightly different emphases.

  8. You might prioritize contributions to non-general websites. If you add text to, say, Wikipedia, you're mainlining knowledge straight into the veins of every ML model in the world. Even if half your contributions get deleted by deletionists or POV-pushing editors, that's still way more bang for the buck than a random blog lost in the depths of CC. Similarly for Arxiv, but that's more work & less effective. You can look at The Pile for 'gold' sources that would be higher priority than CC.

Matthew Barnett

Oct 30, 2021

130

The GPT-3 paper says,

Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset [RSR+19] constituting nearly a trillion words. This size of dataset is sufficient to train our largest models without ever updating on the same sequence twice. However, we have found that unfiltered or lightly filtered versions of Common Crawl tend to have lower quality than more curated datasets. Therefore, we took 3 steps to improve the average quality of our datasets: (1) we downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora, (2) we performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting, and (3) we also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.

Their appendix A describes the their process for filtering documents in greater detail. It is difficult to know exactly what was in their real dataset after the filtering process, as I am not aware of any version available that people can download (though perhaps there is one; I haven't looked hard).

That said, we can guess the likelihood of a given document being used to train GPT-3 by searching for it in the Common Crawl URL Index. GPT-3 was presumably trained in early 2020. If you assume March 2020, visit this link, and enter a query. To search a whole domain, such as lesswrong.com, add a wildcard e.g. "https://www.lesswrong.com/*" (without quotes).

1 comment, sorted by Click to highlight new comments since: Today at 8:46 PM

This might influence where I post_