This is a link-post for the paper Scalable Extraction of Training Data from (Production) Language Models, and associated blog-post Extracting Training Data from ChatGPT, followed by my reactions, including an analysis of the major implications for copyright lawsuits, and a suggestion of an alignment technique to solve this problem.

Please note that I am not a lawyer, and my remarks below concerning intellectual property law might be mistaken or oversimplified (particularly for non-US jurisdictions).

The Paper and Blog Post

The paper and blog post are both readable, interesting, and seem like important work: I highly recommend reading them yourself. So I'm not even going to try to summarize them, except for those of the authors' basic results relevant to what I want to say.

IMO they conclusively demonstrate, for both ChatGPT and a wide range of open-source LLMs that:

  1. All the language models memorize a significant amount of content from their pretraining set: the authors estimate that  of the model's compressed size is devoted to this (as they point out, this sounds wasteful, and my back-of an-envelope calculation suggests that might make the memorized material something of the pretraining set). This proportion appears to be higher in newer/larger models [and the authors suggest possible reasons for this].
  2. This memorized material (in pieces of varying size from around 20 words, sufficient to be unique at this scale, up to about a page) can be recovered from the models, by various methods. In the case of ChatGPT, it appears that the model has been RLHF-trained to discourage it from returning this data, yet the authors discovered an ingenious attack to nevertheless get it to do so at a much higher rate [OpenAI appear to have since patched this specific attack, when I tried to reproduce it.] They remark that this demonstrates that it can be really hard to know whether or not you have manged to make an LLM safe, particularly when trying to train it not to use a capability that it learnt during pretraining — this seems like an important observation for aligning any AI that includes an LLM. Some of their methods are targeted, requiring starting with a chunk of the memorized document and getting the model to continue it (verbatim, or at least to within a few words), while the ChatGPT attack was a fishing expedition with only minimal control over what you caught. The authors give a hundred large chunks they retrieved for ChatGPT. To me some look like text that might well occur multiple times in the pretraining set (things lists of small integers in consecutive order written out in English, lists of country names and codes, contractual boilerplate), while others look very mundane, like randomly-chosen pages taken for the Internet (these could of course also have been repeated if errors occurred  in whatever process was used to attempt to deduplicate the pretraining set).

A significant number of copyright lawsuits related to LLMs and diffusion models have been making their way through the courts. The superscalers are taking the legal position that training an AI model on copyrighted material is fair use, comparable to an author having first read other authors' books or an artist having first seen other artists' paintings, and then being influenced by their styles (without actually plagiarizing them). This is routine for human creators, and intellectual copyright law acknowledges this. They are also claiming that the LLM's output is influenced by so many different things that it is not a "derivative work" of any one of them, in the sense of copyright law.

Very few of the cases have been decided, so there isn't clear precedent yet, but for the few rulings that have occurred so far, the superscalars appear to be winning the legal arguments. Relatedly, various superscalars, some of whom own image models that they say were not trained on any copyrighted work that they didn't either own or have bought access to, plus OpenAI (whose LLMs almost certainly were trained on some copyrighted works) have promised to indemnify customers using them commercially against any copyright claims by intellectual property owners — i.e. they are promising to pay their customer's legal fees and any fines (I gather in exchange for getting to be involved in the lawsuits). Anthropic, notably, have not yet chosen to match this promise, though they have supported a similar legal theory.

So far, none of these cases have included a situation where the model memorized, and then was caught reciting verbatim (or almost verbatim), a significant fraction of a page from a copyrighted work belonging to a plaintiff. However, this paper makes it clear that that will happen, sooner or later (and sooner if the plaintiffs have skilled technical help). An average copyrighted author has probably published several thousand pages of material: if these were all in the pretraining set and the model memorized  of it, that might be chunks adding up to a few pages. If so an sufficiently exhaustive effort might be able to recover one-or-more of these (especially if they got access to the base model). At that point, claiming that the LLM's output was not a derivative work seems likely to be an extremely difficult legal position to defend. The defense could try claiming that the excerpt was a 'fair use" excerpt, but generally, outside a few circumstances like parody where it's expected that the reader knows what's going on, you're not allowed to pass off even an excerpt as your own work: you're expected to credit it to the original author/copyright holder. So there are a variety of conditions around how a fair use excerpt can be used for it to remain fair use, and if an LLM spits out a chunk of copyrighted material without it being clear that that's what it just did, then there's nothing in place to make those legally-required things actually happen.

So, now that this paper is out, the plaintiffs may, if they're lucky and have competent technical help who can replicate these techniques, be able to demonstrate that the model has memorized a chunk of their copyrighted work, and that they were in fact able to persuade it to regurgitate it. What they are still very unlikely to able to demonstrate is that it has actually previously done so during normal use, or that they lost any money as a result. But their lawyers will certainly ask the superscaler's lawyers to tell the court what is being done to ensure that that doesn't happen, and the only answer they'll have will basically be "complicated technical efforts to make it less likely, but not impossible, nor detectable if it did happen". Which the judge may or may not consider a satisfactory answer. (I suspect "our latest version now commits plagiarism at a 40% reduced rate!" is not going to go down well in court.) 

So my reading of this is that sooner or later OpenAI and similar superscalers, or in the case of Anthropic, possibly their customers, may be in what is generally referred to as Big, Fat, Legal Trouble.

Also, when a superscaler loses a case like this (or perhaps decides to settle out of court, since in this situation something did actually happen that they didn't intend to have happen: they didn't want their model to memorize any copyrighted material from the pretraining set, they just couldn't stop it), then as well as a Big, Fat, compensation payment, they're almost certainly going to need to demonstrate that they removed all memorized text from the plaintiff's intellectual property from their model; that they have ceased and desisted. Saying "we'll make sure to remove that that next time we redo our very expensive pretraining from scratch" seems unlikely to cut it. If so, their options are to install a filter that filters all of their inference output for any of the plaintiff's intellectual property that was in their pretraining set (which seems likely to make the judge ask "why didn't you already do that?"), or else find a way to do surgery on the model. There has been some work done on overwriting factual knowledge in LLMs, such as the ROME technique [for some strange reason, a lot of this work seems to have been done at universities in the People's Republic of China…] — that approach might also be able to overwrite memorized text, possibly at the cost of some model quality.

A Large Practical Problem

So, this sounds like a tricky problem for superscalers (and indeed also slightly smaller companies pretraining open-source base models). The most obvious solution would be to carefully filter all the copyrighted material out of their pretraining set (or at least, try hard, document the fact, and if some slipped through tell the judge "Oop! Sorry, we goofed…"). Copyrighted material is supposed to have a copyright notice on it, and if it didn't, and you didn't spot it as a result, the judge might consider that an understandable goof, but to the best of my knowledge an accidentally-removed copyright notice doesn't actually remove copyright, as a legal right.

However, this obvious solution is deeply unappealing to the superscalers: a lot of copyrighted material is extremely valuable as LLM pretraining material. It tends to be well-edited, informative, high quality material that people put a lot of time and effort into (unlike stuff given away for free on the Internet). Things like college textbooks contain a great deal of very high-quality technical information (that's why they're expensive). Scientific papers are also really valuable pretraining data, and are (technically) copyrighted by the journal publisher and author once actually published. So the superscalers have a very strong motivation to include at large amount of (at least selected) copyrighted text in their pretraining set, as great material for their model to learn from. They just need to make sure that their models don't memorize and then plagiarize chunks from it.

Note that this is a genuine alignment problem, not a capabilities problem: the LLMs are breaking the law by committing plagiarism, and everyone involved wants to make them stop. Doing this is a technical problem in alignment: a rather simple toy model of more serious alignment problems with more serious consequences. We need to make the model not do something that's very hard to recognize on the fly, because its detailed definition (the set of copyrighetd material in the pretraining set) is complex and involves inobvious causal/historical factors (copyright law, authors' dates of death, and things like that).

Being even more careful to detect and remove duplications in the pretraining set (don't include two different editions of the same textbook, for example) will likely help reduce the rate of memorization. The authors of the paper also suggest avoiding pretraining models for more than one epoch. However going from the current state where LLMS memorize somewhere  of their pretraining set down to zero isn't going to be possible. So it doesn't seem practicable to be entirely certain that an LLM hasn't memorized any copyrighted samples from the pretraining set.

You could presumably use the same techniques the authors of the paper did to identify some of these memorized excerpts, but they only actually found a small proportion of them (and then used fancy statistics to estimate what proportion, so they could estimate the total). They also showed some memorized excerpts are easier to get the model to regurgitate than others, so finding all of them is going to be impractical.

An expensive brute-force solution would be to build a suffix array index containing all the known-copyrighted text you used in your pretraining set, and run all your inference output past this, then stop inference if you detect a run of more than  matching tokens (for  probably somewhere in the range 30-50). This is likely to be expensive (the paper's authors did something similar and it took them weeks), and the cost scales with the amount of inference you're doing.

Conditional Pretraining

I recently posted How to Control an LLM's Behavior (why my P(DOOM) went down), which discusses another paper: Pretraining Language Models with Human Preferences. This paper and my post both discuss a promising new alignment technique called conditional pretraining (or "pretraining from human feedback"[1]). As my recent post's title suggests, I'm very excited about this technique — if you haven't read both of them, please do. As I outline in that post, this alignment technique might be usable for things up to the complexity of learned-from-humans deceit, criminality, or various other unaligned behaviors models will inevitable learn from human and/or fictional bad examples.

This alignment technique is also ideally suited to this problem: indeed, applying it here is almost trivial. Here's all you have to do:

  1. Add a new token to your tokenizer, representing a new special tag that I'll call <|copyright|/>
  2. For all the known copyrighted material in your pretraining set, use standard NLP techniques to identify sentence breaks in text, line breaks in code or poetry, and similarly frequent periodic break-points in anything else, and insert a <|copyright|/> tag at each of them (plus at the beginning and end of the document), all through every copyrighted portion of every document in the pretraining set.
  3. Next time you're re-pretraining you base model in order to update it (to a new knowledge cutoff date, most likely), use this tagged version of all the known copyrighted material in the pretraining set.

Now, if the model has memorized a passage in the pretraining set, and then regurgitates it, the passage will include the <|copyright|/> tags between every sentence/line, because those were all through the text that it memorized.

However, the model may also have come to associate certain styles of writing/content (ones where a noticeable proportion of the material in that style in the pretraining set had <|copyright|/> tags in it) with there sometimes being <|copyright|/> tags between all the sentences/at the ends of all the lines, so it may produce these on occasion in regular not-memorized generations that it creates. I.e. it may hallucinate that something is copyrighted that is in fact its own work.

Now, you have two choices of mode to use at inference time:

  1. [Cheap, possibly lower quality] Treat <|copyright|/> as a banned token during inference: calculate the logit value for it, then reset this to  (so probability ) before picking a token to generate. This forces the model to never produce the <|copyright|/> tag/token, so it can only produce text that it thinks is not copyrighted. It will thus be unable to regurgitate any memorized copyrighted chunks: at the end of each sentence/line that it might attempt to regurgitate, its attempt to do will get disrupted. (In practice, it might occasionally manage to get 2 or perhaps even 3 sentences out before it gets knocked off track — we should experiment and test this.) Even if that happened occasionally, that's a piece of plagiarism short enough that it would probably be a hard case for the plaintiff to win in court, and trying to make some sort of "fair use without attribution" argument on something that short might be more successful. More often, the LLM will emit at most one sentence then get knocked off track, which is a chunk short enough that the legal jeopardy should be pretty small (even if we got unlucky and the sentence itself were very specific). Or, in styles/contexts where some of the most similar documents were copyrighted and some weren't, if the model was now about to hallucinate that what it was then going to produce was copyrighted, and thus next produce a <|copyrigh|/> tag, it's forced to "reroll" and instead start something that it won't be hallucinating as copyrighted. If almost all of the relevant pretraining material was copyrighted, or if the relevant copyrighted materiel was consistently better quality than the corresponding uncopyrighted material, then this might cause the LLM to produce text of worse quality than it would otherwise have (if we instead hadn't banned the <|copyright|/> tag). By monitoring the generated logits for the <|copyright|/> tag and seeing when they reach a significant level, you can detect when and where in the text one of these two things is happening, but it doesn't tell you which of the two it was due to: real memorized material or hallucinated copyright.
  2. [Expensive, full quality] Generate the <|copyright|/> tags., and just filter them out before sending the output on to your users. Build the big expensive-to-run suffix array index to filter for known-copyrighted material that's in your pretraining set, as desribed previously, but now you only need to actually use this to check material from your inference output if it has been marked with <|copyright|/> tags before or after it. This should reduce the cost of this filtering process by , depending on how often these tags get generated. (You might be able to reduce the hallucinated <|copyright|/> tag generation frequency with fine-tuning, but this might risk the model regurgitating a memorized chunk without tags — research is needed. The results might depend on, say, which layers were frozen/unfrozen during that fine-tuning, or how it was done: ideally you want to still regurgitate the <|copyright|/> tags in memorized material, but hallucinate them less often. There are likely to be different circuits involved in these two processes, so if your interpretability was good enough, or using something along the lines of ROME, you might be able to identify these, or just try trial and error on which layers do what, or come up with some ingenious fine-tuning method to maintain the <|copyright|/> tags in memorized text chunks.) Also, you will gradually start to identify what material in your pretraining set got memorized, so you can a) work out kinks in the pretraining process that caused this to happen, and perhaps also b) optimize the expensive filtering process.

Note that if the price/quality difference between these two inference modes was significant, you could charge your users different rates for them. Or you could use some piece of context to intelligently select which one to use for which query.

If there was a legal need, this strategy could be extended to also label things like trademarks, company names, or character names unusual and specific enough that it would be embarrassing for the model to reuse them in a fictional context ("Albus Dumbledore" comes to mind).

Why this is the Ideal Alignment Technique for this Problem

There are alignment techniques, such as Activation Engineering, that work in the residual embedding space (at some layer), or ones like Causal Scrubbing that work at individual neural circuit levels, or ones like ROME that work at the small-number-of-individual-parameters level.. The complexity of the sets of things each of these can affect is thus going to be limited by the dimensionalities of these representations. The entire contents of all copyrighted documents in the pretraining set has a complexity far, far higher than any of those (in fact, bigger than the entire model: even just what actually gets memorized is  of your entire model complexity). So clearly none of those approaches can possibly work reliably for this problem. Thus the only alignment approach that can work reliably for this is one that gives supervision to the model at an extremely fine grained level during pre-training: like, for every sentence of each document in the pretraining set. I.e. this alignment technique (or something very similar) is the only way to solve this problem.

  1. ^

    The paper's authors called this technique "Pretraining with Human Feedback" (PHF) — however that name doesn't quite describe the specific use case for the technique I'm proposing here, unless "Don't quote this, it's copyright!" is considered "human feedback".

New Comment