For years, researchers have trained machine learning systems on whatever data they could find. People mostly haven't cared about this or paid attention, I think because the systems hadn't been very good. Recently, however, some very impressive systems have come out, including ones that answer questions, complete code, and generate images from prompts.

Because these are so capable a lot more people are paying attention now, and there are big questions around whether it's ok that these systems were trained this way. Code that I uploaded to GitHub and the writing that I've put into this blog went into training these models: I didn't give permission for this kind of use, and no one asked me if it was ok. Doesn't this violate my copyrights?

The machine learning community has generally assumed that training models on some input and using it to generate new output is legal, as long as the output is sufficiently different from the input. This relies on the doctrine of "fair use", which does not require any sort of permission from the original author as long as it is sufficiently "transformative". For example, if I took a book and replaced every instance of the main characters name with my own I doubt any court would consider that sufficiently transformative, and so my book would be considered a "derivative work" of the original book. On the other hand, if I took the words in the book and painstakingly reordered them to tell a completely unrelated story, there's a sense in which my book was "derived" from the original one but I think it would pretty clearly be transformative enough that I wouldn't need any permission from the copyright holder.

These models can be used to create things that are clearly derivative works of their input. For example, people very quickly realized that Copilot would complete the code for Greg Walsh's fast inverse square root implementation verbatim, and if you ask any of the image generators for the Mona Lisa or Starry Night you'll get something close enough to the original that it's clearly a knock-off. This is a major issue with current AI systems, but it's also a relatively solvable one. It's already possible to slowly check that the output doesn't excessively resemble any input, and I think it's likely they'll soon figure out how to do that efficiently. On the other hand, all of the examples of this I've seen (and I just did some looking) have been people trying to elicit plagiarism.

The normal use case is much more interesting, and more controversial. While the transformative fair use justification I described above is widely assumed within the machine learning community as far as I can tell it hasn't been tested in court. There is currently a large class action lawsuit over Copilot, and it's possible this kind of usage will turn out not qualify. Speculating, I think it's pretty unlikely that the suit will succeed, but I've created a prediction market on it to gather information:

Aside from the legal question, however, there is also a moral or social question: is it ok to train a model on someone's work without their permission? What if this means that they and others in their profession are no longer able to earn a living?

On the second question, you could imagine someone creating a model where they used only data that was either in the public domain or which they'd purchased appropriate licenses for. While that's great for the particular people who agree and get paid, a much larger number would still be out of work without compensation. I do think there's potentially quite a bad situation, where as these systems get better more and more people are unable to add much over an automated system, and we get massive technological unemployment. Now, historically worries here proved unfounded, and technology has consistently been much more of a human complement than human substitute. As the saying goes, however, that was also the case for horses until it wasn't. I think a Universal Basic Income is probably the best approach here.

On the first question, learning from other people's work without their consent is something humans do all the time. You can't draw too heavily on any one thing you've seen without following a complex set of rules about permission and acknowledgement, but human creative work generally involves large amounts of borrowing. These machine learning systems are not humans, but they are fundamentally doing a pretty similar thing when they learn from examples, and I don't see a strong reason to treat their work differently here. Because these systems don't currently understand how much borrowing is ok we do need to apply our own judgment to avoid technologically-facilitated plagiarism, but the normal case of creating something relatively original that pulls from a wide range of prior work is fine for us to do with our brains and should be equally ok for us to do with our tools.

Comment via: facebook, mastodon


New Comment
14 comments, sorted by Click to highlight new comments since: Today at 4:02 PM

Aside from the legal question, however, there is also a moral or social question: is it ok to train a model on someone's work without their permission? What if this means that they and others in their profession are no longer able to earn a living?

Every invention meant that someone lost a job. And although the classical reply is that new jobs were created, that doesn't necessarily mean that the people who lost the old job had an advantage at the new job. So they still lost something, even if not everything. But their loss was outweighed by the gain of many others.

I don't even think that an ideal society would compensate those people, because that would create perverse incentives -- instead of avoiding the jobs that will soon be obsolete, people would hurry to learn them, to become eligible for the compensation.

Universal Basic Income seems okay, but notice that it still implies a huge status loss for the artists. And that is ok.

A more complicated question is what if the AI can in some sense only "remix" the existing art, so even the AI users would benefit from having as many learning samples as possible... but now it is no longer profitable to create those samples? Then, artists going out of business becomes everyone's loss.

Perhaps free market will solve this. If there is no way to make the AI generate some X that you want, you can pay a human to create that X. That on one hand creates a demand for artists (although much fewer than now), and on the other hand creates more art the AI can learn from. "But what about poor people? They can't simply buy their desired X!" Well, today they can't either, so this is not making their situation worse. Possibly better, if some rich people wants the same X, and will pay for introducing it to the AI's learning set.

(Or maybe the market solution will fail, because it simply requires too much training to become so good at art that someone would pay you, and unlike now, you won't be able to make money when you're just halfway there. In other words, becoming an artist will be an incredibly risky business, because you spend a decade or more of your life learning something that ultimately maybe someone will pay you for... or maybe no one will. Or would the market compensate by making good hand-made art insanely expensive?)

The permissions are only a temporary solution, anyway. Copyrights expire. People can donate their work to public domain. Even with 100% legal oversight, the set of freely available training art will keep growing. Then again, slowing down a chance can prevent social unrest. The old artists can keep making money for another decade or two, and the new ones will grow up knowing that artistic AIs exist.

Is it okay for a human to look at someone else's work and learn from it?

Yes; that's what my last paragraph ("learning from other people's work without their consent is something humans do all the time...") covers.

The human usually won't reproduce the original work too closely. And if yes, the human will be accused of plagiarism.

follow up question in my mind, is it okay for a game playing agent to look at someone else's work and learn from it? we are guessing at the long-term outcomes of the legal system here, so I would also like to answer what the legal system should output, not merely what it is likely to. should game playing agents be more like humans than like supervised agents? My sense is that they should because reinforcement learners trained from scratch in an environment have an overwhelming amount of their own knowledge and only a small blip of their training data is the moment where they encounter another agent's art.

Competetive multiplayer games already have a situation where things are "discovered" and that you have to literally limit the flow of information if you want to control what others do with the information. I guess the modifier that often money flows ared not involved might make it so that it has not been scrutinised that much. "History of strats" is already a youtube genre.

It is kinda sad that for many games now you will "look up how it is supposed to be played"ie you first "learn the meta" and then on your merry way forward.

I guess for computer agents it could be practical for the agents to have amnesia about the actual games that they play. But for humans any that kidn of information is going to be shared when it is applied in the game. And there is the issue of proving that you didn't cheat by providing a plausible method.

no, I mean, if the game playing agent is highly general, and is the type to create art as a subquest/communication like we are - say, because of playing a cooperative game - how would an ideal legal system respond differently to that vs to a probabilistic model of existing art with no other personally-generated experiences?

Here are two artists exploring the issues of AI in art, and here is another artist arguing against it.

The former includes a few comments on AI in general and what is coming in the near future. "AI is not human. You play with a lion cub and it's fun, but that is before it's tasted human blood. So we may be entertaining something that is a beast that will eat us alive, and we cannot predict, we can speculate but we cannot predict, where this is going. And so there is a legitimate concern that it's going to do what it does in ways that we don't know yet."

maybe this is neither here nor there, but I'd love to see models that fully trace the impact of each individual training example through a model.

This is an interesting thought, but it seems very hard to realize as you have to distill the unique contribution of the sample, as opposed to much more widespread information that happens to be present in the sample.

Weight updates depend heavily on training order of course, so you're really looking for something like the Shapley value of the sample, except that "impact" is liable to be an elusive, high-dimensional quantity in itself.

hmmmm. yeah, essentially what I'm asking for is certified classification... and intuitively I don't think that's actually too much to ask for. there has been some work on certifying neural networks, and it has led me to believe that the current bottleneck is that models are too dense by several orders of magnitude. concerningly, more sparse models are also significantly more capable. One would need to ensure that the update is fully tagged at every step of the process such that you can always be sure how you are changing decision boundaries...

How likely is it that this becomes a legal problem rendering models unable to be published? Note that using models privately (even within a firm) will always be an option, as copyright only applies to distribution of the work.

I think it's pretty likely that the distribution of models trained on unlicensed copyrighted works that are capable of regurgitating close matches for those works is already a copyright violation. If the fair use defense relies on the combination of the model and how you use it being sufficiently transformative, that doesn't mean that the model itself qualifies.

Code that I uploaded to GitHub and the writing that I've put into this blog went into training these models: I didn't give permission for this kind of use, and no one asked me if it was ok. Doesn't this violate my copyrights?

Github requires that you set licence terms for your code. And you can't let outside parties access the code by accident, you have to specifically allow access. Either the use is or is not permitted by set licences. And you published your blog. Would you go after people that apply things mentioned in your blog? You did in fact give the permission.

Now it is a little bit murky when there are novel uses which the licensor didn't have in mind. But it is not like we should assume that everything is banned by default if quite wide permissions have been granted. Old licences have to mean something in the new world.