I was recently talking with a Daniel Kokotajlo about AI art. It turned out that he and I initially disagreed about ethical questions, but by the end of the conversation, I had somewhat won him over to my position.
I have the vague impression that a lot of people (on the technology side) haven't thought through these issues so much, or (like me) have only recently thought these issues through (as a result of artists making a lot of noise about it!).
So I thought I would write a post. Maybe it will be persuasive to some readers.
Is this the most important conversation to be having about AI?
No. Copyright-adjacent issues with AI art are less important than AI-induced unemployment, which is in turn less important than the big questions about the fate of the human race.
However, it's possible that copyright-adjacent issues around intellectual property and AI will be one of the first major issues thrusting AI into the political sphere, in which case this discussion may help to shape public policy around AI for years to come.
The basic issues.
Large language models such as GPT, and AI image generators such as DALL-E, Imagen, Stable Diffusion, etc etc are (very often) trained on copyrighted works without the permission of the copyright holder. This hasn't proven to be a legal problem, yet, but "legal" doesn't mean "ethical".
When models like GPT and DALL-E started coming out, I recall having the thought: oh, it's nice how these models don't really need to worry about copyright, because (I thought) deep learning turns out to generalize quite well, which means deep-learning-based systems aren't liable to regurgitate copyrighted material.
This turns out to be simply false; these systems are in fact quite liable to reproduce, or very nearly reproduce, copyrighted material when prompted in the right way.
Whether or not copyrighted material is precisely reproduced, or nearly reproduced, or not reproduced at all, there is, in any case, an argument to be made that these AI systems (if/when they charge for use) are turning a profit based on copyrighted material in an illegitimate way.
After all: the purpose of copyright law is, to a very large extent, to preserve the livelihood of intellectual property creators, who would otherwise have limited ability to profit from their own works due to the ease of reproducing it once made. Modern AI systems are threatening this, whether or not they technically violate copyright.
But I want to firmly distinguish between a few different issues:
- AI systems training on copyrighted data without the consent of the copyright holder. This is the main issue I will discuss.
- AI systems being capable of reproducing copyrighted works exactly or almost exactly. This is a consequence of the first bullet point, plus properties of modern ML systems, plus the absence of safeguards specifically preventing this from happening.
- AI systems imitating work in a more general sort of way, such as copying the style of specific artists who never consented to their work being used as training data. This is one of the main reasons to think that training on copyrighted work (without permission) has occurred, in cases where there isn't much public information about what data was used to train an AI. It is also one of the main reasons (I have seen) that artists want these systems to stop training on copyrighted works.
- AI putting artists and writers out of work. This is not the main topic of the post, but is an obvious underlying reason why people might be upset.
Some initial arguments.
It's not illegal.
Artists who take a position against AI art will sometimes describe the situation as follows: AI programmers steal our art, and use it to train AIs, which can then steal our artistic style, and thereby deprive us of business and livelihood (because AI can do it cheaper).
Several months ago, A.R. Stone made a LessWrong comment somewhat along these lines (quoted in part):
I'm having real trouble finding out about Dall E and copyright infringement. There are several comments about how Dall E can "copy a style" without it being a violation to the artist, but seriously, I'm appalled.
Some defenders of AI art then object, saying the law does not consider it theft, therefore no theft has taken place.
I am not a lawyer, and confess to ignorance about how the law currently treats AI or the likely outcome of court cases about it.
However, it seems clear to me that the current legal system is an attempt to codify reasonable rules in the absence of significant AI technology. The fact(?) that it's not legally considered theft doesn't mean it's not morally theft in a significant sense.
It seems to me like we're at a point where it would be very reasonable to have a society-wide conversation about what should and shouldn't be allowed.
It's what humans do.
Human artists "train on copyrighted works" (ie, look at what other artists do and take inspiration from it). Furthermore, "fair use" allows humans to make significant use of copyrighted works, so long as the new work is "transformative" of the copyrighted material (amongst a short list of other fair-use conditions, including educational use).
Shouldn't we just treat AI the same way? So isn't "training on copyrighted material" fine?
In the same thread as the A.R. Stone comment I mentioned earlier, gbear605 makes an argument along these lines (quoted in part):
It seems to me that the only thing that seems possible is to treat it like a human that took inspiration from many sources. In the vast majority of cases, the sources of the artwork are not obvious to any viewer (and the algorithm cannot tell you one). Moreover, any given created piece is really the combination of the millions of pieces of the art that the AI has seen, just like how a human takes inspiration from all of the pieces that it has seen. So it seems most similar to the human category, not the simple manipulations (because it isn’t a simple manipulation of any given image or set of images).
Again, I would argue that this is a new situation which very well may call for different norms from the human case. Here are a few differences which we might consider relevant:
|Human artist learning from (copyrighted) works:||AI learning from (copyrighted) works:|
|Not very output-scalable. One human can only do so much work.||Very very output-scalable. Once you've trained a network, producing work is relatively inexpensive. One AI can disrupt the whole market. This is much less of a "level playing field".|
|Not very input-scalable. One human can only see so much media.||Much more input-scalable. Modern systems are trained on a significant fraction of human-produced media. Again, less of a "level playing field".|
|Humans form rich generalizations from a small number of examples.||Deep learning systems require huge amounts of data to approach human-level generalizations. This indicates, to an extent, that what's learned from a single example is "shallow". Perhaps this could be seen as closer to plagiarism.|
|Humans can understand and avoid the idea of copyright violation, and are often cautious to "not steal ideas" even beyond the legal requirements. With some notable exceptions, humans are really trying to create unique works.||Most current AI systems have no safeguards with respect to copyright violations, and certainly don't have the human idea of "not stealing ideas". Indeed, to a large extent, these systems are being trained to mimic their input data as closely as possible.|
|It's a human, gosh darn it!||It's not a human, gosh darn it! As anthropocentric as the idea may be, it's pretty standard for the law to treat humans differently.|
My opinion would be that this calls for a civilization-wide discussion of what the new norms should be.
There's no precedent for calling this immoral.
"Sure", you say, "There's no precedent for AI creativity at the level we're now seeing. But I'm afraid that argument cuts both ways. You can't call modern training methods 'unethical' out of the blue. If there had been previous illustrations of this kind of dilemma in science fiction, for example, with a clear consensus amongst sci-fi authors that training AIs on copyrighted works would be considered unethical, fine. But prior to current complaints, there was no such consensus against these techniques! Artists are clearly making up new ethical rules because they are upset about losing jobs."
Counter #1: But I vaguely felt like there was a consensus on this?!
You could easily accuse me of hindsight bias and/or constructed memory, but as I've already mentioned, I recall assuming that OpenAI and other companies had done their due diligence to make sure that they weren't stepping over the line.
I imagine a lot of other AI-oriented grad students have thought about trying to train image-generation stuff at one point or another in their career. I certainly did. I have the impression that, say, 2014-me included in such plans steps such as "obtain permission from the artists, or otherwise, seek out training material that has fallen out of copyright."
This is definitely more like "academic caution" than "legal caution"; but it's standard practice in academia to make attribution clear, just as it is in art. It seems like just a mistake to think that caution about proper attribution should go away when those two worlds cross over.
For example, I think there's a clear academic consensus that you should obtain permission (and properly attribute) if you reproduce someone else's figure in your paper. It doesn't make a difference whether it's publicly available on the web.
It's not a logical deduction or anything, but it seems to me like natural academic caution about attribution extends to the point where you ask copyright holders before using copyrighted data to train an AI.
I also seem to recall a very early writing-assistance tool based on Markov chains (I'm not claiming it was commercially successful or anything), which advertised, as an explicit feature that it had a filter to make absolutely sure that it would not auto-suggest sections from copyrighted works. This isn't a precedent for "don't train on copyrighted works without permission", but it is a precedent for "be cautious around copyright", and in particular "put precautions in place to make sure your AI doesn't reproduce copyrighted work".
Counter #2: There's a clear moral consensus about user data.
Another argument which Daniel Kokatajlo pointed out to me is that in recent years, there has been a growing consensus that there's something skeezy about harvesting user data and using it for things in general, especially without transparency about what's happening.
Harvesting data to train AI, without consent from the original creators, seems like it falls under this.
The Case For Dialogue?
In discussions like this, it's easy for one side to demonize or dismiss the other side. I think a lot of the problems here are arising because programmers weren't really thinking of artists at all when they made certain decisions. (Of course, this is only a guess.)
I was really glad to see a dialogue between a San Francisco techie and a prominent YouTube art channel. However, I was also disappointed by some aspects of the conversation.
I could write a long rant about my exact critique of that discussion, but I guess it would not be very interesting to read.
Basically, I think it could be done better. However, I worry that if I had a super-public conversation with an artist like this, my personal views would inevitably get attributed to MIRI, and this doesn't seem so good. I think other people who work for prominent organizations are in a similar position.
So I guess I'm saying: consider whether it might be worth a little of your time to reach out to artists, or (if you're an artist) reach out to AI programmers, or otherwise facilitate such conversations?
I think it's moderately plausible that this becomes an important issue in another election cycle or two, and a little plausible that conversations which take place now could help.
I'm not sure exactly which systems were and were not trained on copyrighted material; and in some cases, I think the information is not publicly available. The fact that most/all modern deep-learning image-generation tools I am aware of can copy the styles of a broad variety of specific artists when asked seems like significant evidence that most/all of these systems have been trained on copyrighted material.
But at least we know that Stable Diffusion has been, since its data-set is public.
I initially thought that modern ML (meaning, very very large transformer networks) was safe from this kind of risk because it showed an ability to generalize very well, and be very creative when output was generated by random sampling.
However, it turns out that modern ML memorizes its data quite well, meaning that it achieves extremely low loss when the same work is shown to it again during training. This means it's possible for it to generate stuff directly from its training data, just by sampling.
On the pro-AI-art side, I've seen the argument made that modern ML can't be memorizing its training data, since the size of the neural network (in bytes) is far far smaller than the size of the data-set. But this seems to be wrong.
Obviously, it's possible to compress the training data a lot. Obviously, it's possible for the network to memorize some things but not all.
But the most persuasive argument is when we re-generate images almost precisely, with only a text prompt.
I'm not sure exactly which systems have safeguards, or lack them. There was discussion of DALL-E
Being able to reproduce the style of a specific artist hurts the livelihood of that artist in ways that AI art in general does not. It allows scammers to pretend to be that artist, for example on social media websites. It also allows companies to produce products which use the style, where previously they would be forced to pay the original artist (or a good human imitator, which can be harder to find and might not save any money).
Of course, the reality is we don't yet know what's legal or illegal, because this hasn't yet been tested in court.
Daniel Kokotajlo made an argument similar to this.