This is a link-post for the paper Scalable Extraction of Training Data from (Production) Language Models, and associated blog-post Extracting Training Data from ChatGPT, followed by my reactions, including an analysis of the major implications for copyright lawsuits, and a suggestion of an alignment technique to solve this problem.
Please note that I am not a lawyer, and my remarks below concerning intellectual property law might be mistaken or oversimplified (particularly for non-US jurisdictions).
The paper and blog post are both readable, interesting, and seem like important work: I highly recommend reading them yourself. So I'm not even going to try to summarize them, except for those of the authors' basic results relevant to what I want to say.
IMO they conclusively demonstrate, for both ChatGPT and a wide range of open-source LLMs that:
A significant number of copyright lawsuits related to LLMs and diffusion models have been making their way through the courts. The superscalers are taking the legal position that training an AI model on copyrighted material is fair use, comparable to an author having first read other authors' books or an artist having first seen other artists' paintings, and then being influenced by their styles (without actually plagiarizing them). This is routine for human creators, and intellectual copyright law acknowledges this. They are also claiming that the LLM's output is influenced by so many different things that it is not a "derivative work" of any one of them, in the sense of copyright law.
Very few of the cases have been decided, so there isn't clear precedent yet, but for the few rulings that have occurred so far, the superscalars appear to be winning the legal arguments. Relatedly, various superscalars, some of whom own image models that they say were not trained on any copyrighted work that they didn't either own or have bought access to, plus OpenAI (whose LLMs almost certainly were trained on some copyrighted works) have promised to indemnify customers using them commercially against any copyright claims by intellectual property owners — i.e. they are promising to pay their customer's legal fees and any fines (I gather in exchange for getting to be involved in the lawsuits). Anthropic, notably, have not yet chosen to match this promise, though they have supported a similar legal theory.
So far, none of these cases have included a situation where the model memorized, and then was caught reciting verbatim (or almost verbatim), a significant fraction of a page from a copyrighted work belonging to a plaintiff. However, this paper makes it clear that that will happen, sooner or later (and sooner if the plaintiffs have skilled technical help). An average copyrighted author has probably published several thousand pages of material: if these were all in the pretraining set and the model memorized of it, that might be chunks adding up to a few pages. If so an sufficiently exhaustive effort might be able to recover one-or-more of these (especially if they got access to the base model). At that point, claiming that the LLM's output was not a derivative work seems likely to be an extremely difficult legal position to defend. The defense could try claiming that the excerpt was a 'fair use" excerpt, but generally, outside a few circumstances like parody where it's expected that the reader knows what's going on, you're not allowed to pass off even an excerpt as your own work: you're expected to credit it to the original author/copyright holder. So there are a variety of conditions around how a fair use excerpt can be used for it to remain fair use, and if an LLM spits out a chunk of copyrighted material without it being clear that that's what it just did, then there's nothing in place to make those legally-required things actually happen.
So, now that this paper is out, the plaintiffs may, if they're lucky and have competent technical help who can replicate these techniques, be able to demonstrate that the model has memorized a chunk of their copyrighted work, and that they were in fact able to persuade it to regurgitate it. What they are still very unlikely to able to demonstrate is that it has actually previously done so during normal use, or that they lost any money as a result. But their lawyers will certainly ask the superscaler's lawyers to tell the court what is being done to ensure that that doesn't happen, and the only answer they'll have will basically be "complicated technical efforts to make it less likely, but not impossible, nor detectable if it did happen". Which the judge may or may not consider a satisfactory answer. (I suspect "our latest version now commits plagiarism at a 40% reduced rate!" is not going to go down well in court.)
So my reading of this is that sooner or later OpenAI and similar superscalers, or in the case of Anthropic, possibly their customers, may be in what is generally referred to as Big, Fat, Legal Trouble.
Also, when a superscaler loses a case like this (or perhaps decides to settle out of court, since in this situation something did actually happen that they didn't intend to have happen: they didn't want their model to memorize any copyrighted material from the pretraining set, they just couldn't stop it), then as well as a Big, Fat, compensation payment, they're almost certainly going to need to demonstrate that they removed all memorized text from the plaintiff's intellectual property from their model; that they have ceased and desisted. Saying "we'll make sure to remove that that next time we redo our very expensive pretraining from scratch" seems unlikely to cut it. If so, their options are to install a filter that filters all of their inference output for any of the plaintiff's intellectual property that was in their pretraining set (which seems likely to make the judge ask "why didn't you already do that?"), or else find a way to do surgery on the model. There has been some work done on overwriting factual knowledge in LLMs, such as the ROME technique [for some strange reason, a lot of this work seems to have been done at universities in the People's Republic of China…] — that approach might also be able to overwrite memorized text, possibly at the cost of some model quality.
So, this sounds like a tricky problem for superscalers (and indeed also slightly smaller companies pretraining open-source base models). The most obvious solution would be to carefully filter all the copyrighted material out of their pretraining set (or at least, try hard, document the fact, and if some slipped through tell the judge "Oops! Sorry, we goofed…"). Copyrighted material is supposed to have a copyright notice on it, and if it didn't, and you didn't spot it as a result, the judge might consider that an understandable goof, but to the best of my knowledge an accidentally-removed copyright notice doesn't actually remove copyright, as a legal right.
However, this obvious solution is deeply unappealing to the superscalers: a lot of copyrighted material is extremely valuable as LLM pretraining material. It tends to be well-edited, informative, high quality material that people put a lot of time and effort into (unlike stuff given away for free on the Internet). Things like college textbooks contain a great deal of very high-quality technical information (that's why they're expensive). Scientific papers are also really valuable pretraining data, and are (technically) copyrighted by the journal publisher and author once actually published. So the superscalers have a very strong motivation to include at large amount of (at least selected) copyrighted text in their pretraining set, as great material for their model to learn from. They just need to make sure that their models don't memorize and then plagiarize chunks from it.
Note that this is a genuine alignment problem, not a capabilities problem: the LLMs are breaking the law by committing plagiarism, and everyone involved wants to make them stop. Doing this is a technical problem in alignment: a rather simple toy model of more serious alignment problems with more serious consequences. We need to make the model not do something that's very hard to recognize on the fly, because its detailed definition (the set of copyrighted material in the pretraining set) is complex and involves inobvious causal/historical factors (copyright law, authors' dates of death, and things like that).
Being even more careful to detect and remove duplications in the pretraining set (don't include two different editions of the same textbook, for example) will likely help reduce the rate of memorization. The authors of the paper also suggest avoiding pretraining models for more than one epoch. However going from the current state where LLMS memorize somewhere of their pretraining set down to zero isn't going to be possible. So it doesn't seem practicable to be entirely certain that an LLM hasn't memorized any copyrighted samples from the pretraining set.
You could presumably use the same techniques the authors of the paper did to identify some of these memorized excerpts, but they only actually found a small proportion of them (and then used fancy statistics to estimate what proportion, so they could estimate the total). They also showed some memorized excerpts are easier to get the model to regurgitate than others, so finding all of them is going to be impractical.
An expensive brute-force solution would be to build a suffix array index containing all the known-copyrighted text you used in your pretraining set, and run all your inference output past this, then stop inference if you detect a run of more than matching tokens (for probably somewhere in the range 30-50). This is likely to be expensive (the paper's authors did something similar and it took them weeks), and the cost scales with the amount of inference you're doing.
I recently posted How to Control an LLM's Behavior (why my P(DOOM) went down), which discusses another paper: Pretraining Language Models with Human Preferences. This paper and my post both discuss a promising new alignment technique called conditional pretraining (or "pretraining from human feedback"[1]). As my recent post's title suggests, I'm very excited about this technique — if you haven't read both of them, please do. As I outline in that post, this alignment technique might be usable for things up to the complexity of learned-from-humans deceit, criminality, or various other unaligned behaviors models will inevitable learn from human and/or fictional bad examples.
This alignment technique is also ideally suited to this problem: indeed, applying it here is almost trivial. Here's all you have to do:
Now, if the model has memorized a passage in the pretraining set, and then regurgitates it, the passage will include the <|copyright|/> tags between every sentence/line, because those were all through the text that it memorized.
However, the model may also have come to associate certain styles of writing/content (ones where a noticeable proportion of the material in that style in the pretraining set had <|copyright|/> tags in it) with there sometimes being <|copyright|/> tags between all the sentences/at the ends of all the lines, so it may produce these on occasion in regular not-memorized generations that it creates. I.e. it may hallucinate that something is copyrighted that is in fact its own work.
Now, you have two choices of mode to use at inference time:
Note that if the price/quality difference between these two inference modes was significant, you could charge your users different rates for them. Or you could use some piece of context to intelligently select which one to use for which query.
If there was a legal need, this strategy could be extended to also label things like trademarks, company names, or character names unusual and specific enough that it would be embarrassing for the model to reuse them in a fictional context ("Albus Dumbledore" comes to mind).
There are alignment techniques, such as , that work in the residual embedding space (at some layer), or ones like that work at individual neural circuit levels, or ones like ROME that work at the small-number-of-individual-parameters level.. The complexity of the sets of things each of these can affect is thus going to be limited by the dimensionalities of these representations. The entire contents of all copyrighted documents in the pretraining set has a complexity far, far higher than any of those (in fact, bigger than the entire model: even just what actually gets memorized is of your entire model complexity). So clearly none of those approaches can possibly work reliably for this problem. Thus the only alignment approach that can work reliably for this is one that gives supervision to the model at an extremely fine grained level during pre-training: like, for every sentence of each document in the pretraining set. I.e. this alignment technique (or something very similar) is the only way to solve this problem.
The paper's authors called this technique "Pretraining with Human Feedback" (PHF) — however that name doesn't quite describe the specific use case for the technique I'm proposing here, unless "Don't quote this, it's copyright!" is considered "human feedback".