Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

OpenAI developed DALLE-2. Then StabilityAI made an open source copycat. This is a concerning case study for AI alignment.

Stephen Casper (scasper@mit.edu

Phillip Christoffersen (philljkc@mit.edu)

Rui-Jie Yew (rjy@mit.edu

Thanks to Tan Zhi-Xuan and Dylan Hadfield-Menell for feedback. 

different version of post has been posted in the EA forum. It is somewhat longer, focuses more on policy, and is written for a slightly more general audience. 

This post talks about NSFW content but does not contain any. All links from this post are SFW.

Abstract

Since OpenAI published their work on DALLE-2 (an AI system that produces images from text prompts) in April, several copycat text-to-image models have been developed including StabilityAI’s Stable Diffusion. Stable Diffusion is open-source and can be easily misused, including for the almost-effortless development of NSFW images of specific people for blackmail or harassment. We argue that OpenAI and StabilityAI’s efforts to avoid misuse have foreseeably failed and that both share responsibility for harms from these models. And even if one is not concerned about issues specific to text-to-image models, this case study raises concerns about how copycatting and open-sourcing could lead to abuses of more dangerous systems in the future. We discuss design principles that developers should abide by when designing advanced AI systems to reduce risks. We conclude that this case study highlights issues with working on risky capabilities and repudiates attempts to achieve AI alignment via racing to advance capabilities. 

What’s wrong?

Recent developments in AI image generation have made text-to-image models very effective at producing highly realistic images from captions. For some examples, see the paper from OpenAI on their DALLE-2 model or the release from Stability AI of their Stable Diffusion model. Deep neural image generators like StyleGan and manual image editing tools like Photoshop have been on the scene for years. But today, DALLE-2 and Stable Diffusion (which is open source) are uniquely effective at rapidly producing highly-realistic images from open-ended prompts. 

There are a number of risks posed by these models, and OpenAI acknowledges this. Unlike conventional art and Photoshop, today’s text-to-image models can produce images from open-ended prompts by a user in seconds. Concerns  include (1) copyright and intellectual property issues (2) sensitive data being collected and learned (3) demographic biases, e.g. producing images of women when given the input, “an image of a nurse” (4) using these models for disinformation by creating images of fake events, and (5) using these models for producing non-consensual, intimate deepfakes. 

These are all important, but producing intimate deepfakes is where abuse of these models seems to be the most striking and possibly where we are least equipped to effectively regulate misuse. Stable Diffusion is already being used to produce realistic pornographyReddit recently banned several subreddits dedicated to AI-generated porn including r/stablediffusionnsfw, r/unstablediffusion, and r/porndiffusion for a violation of Reddit’s rules against non-consensual intimate media.

This is not to say that violations of sexual and intimate privacy are new. Before the introduction of models such as DALLE-2 and Stable Diffusion, individuals have been victims of non-consensual deepfakes. Perpetrators often make this content to discredit or humiliate people from marginalized groups, taking advantage of the negative sociocultural attitudes that already surround them. An estimated 96% of deepfake videos online are porn, almost all featuring women. In one case, when a video of a journalist committing a sex act she never did went viral on the Internet, she was met with death threats. Her home address was leaked alongside false advertisements that she was available for sex. She could not eat, and she stopped writing for months. Other forms of sexual privacy violations, have had similar consequences for victims, leading to economic injuries from damaged reputations in job searches and even to suicide

The unique danger posed by today’s text-to-image models stems from how they can make harmful, non-consensual content production much easier than before, particularly via inpainting and outpainting, which allows a user to interactively build realistic synthetic images from natural onesdreambooth, or other easily used tools, which allow for fine-tuning on as few as 3-5 examples of a particular subject (e.g. a specific person). More of which are rapidly becoming available following the open-sourcing of Stable Diffusion. It is clear that today’s text-to-image models have uniquely distinct capabilities from methods like Photoshop, RNNs trained on specific individuals or, “nudifying” apps. These previous methods all require a large amount of subject-specific data, human time, and/or human skill. And no, you don’t need to know how to code to interactively use Stable Diffusion, uncensored and unfiltered, including in/outpainting and dreambooth

If Photoshop is like a musket, Stable Diffusion is like an assault rifle. And we can expect that issues from the misuse of these models will only become more pressing over time as they get steadily better at producing realistic content. Meanwhile, new graphical user interfaces will also make using them easier. Some text-to-video models are even beginning to arrive on the scene from Meta and Google. And a new version of Stable Diffusion will be released soon. New applications, capabilities, and interfaces for diffusion models being released daily. So to the extent that it isn’t already easy, it will become easier and easier for these models to be tools for targeted harassment.

Unfortunately, current institutions are poorly equipped to adapt to increased harms in a way that protects those who are the most vulnerable. Concerted political action and research is often focused on the capacity for deepfakes to spread misinformation. This makes sense in light of how those in positions of political power stand to be affected the most by deepfake news. On the other hand, a combination of lack of oversight and sociocultural attitudes has often led victims of deepfake sex crimes to be met with indifference–law enforcement has told victims to simply “go offline”.

But even if one does not view the risks specific to text-to-image models as a major concern, the fact that these models have quickly become open-source and easily-abused does not bode well for the arrival of more capable AI systems in the future. There has been a slippery slope from DALLE-2’s release to today’s environment where Stable Diffusion can be easily used to cause devastating harm to people. And this offers a worrying case study on the difficulty of keeping risky AI capabilities out of the control of people who will misuse them. 

How did we get here?

On April 13, 2022, OpenAI released the paper on DALLE-2 which set the state of the art for producing realistic images from text. Along with the release of the paper, OpenAI also created a website that allows users to query the model for image generation and editing. OpenAI did a great deal of work to avoid misuse of the model and wrote an extensive technical report on it. Their measures include (1) curating training data to avoid offensive content, (2) testing their own model for issues, (3) having an independent red team try to find problems with it, (4) not releasing the architecture or weights, (5) requiring users to sign up, provide an email, and explain their motivations to use the model, (6) having a waiting period for access (although a waiting period was no longer required as of late September 2022), (7) filtering prompts from users that contained explicit content, famous peoples’ names, etc, (8) filtering images from the model, (9), suspending/banning users who enter too many suspicious prompts, and (10), continually updating their backend to respond to issues. 

To their credit, these measures seem to have been very successful at preventing the use of DALLE-2 for creating offensive content. We have seen anecdotal posts on Reddit from users who have reportedly tried and failed to generate porn with DALLE-2 using crafty prompts like “transparent swimsuit” to no avail, often getting banned in the process. We are not aware of any clearly successful examples of anyone getting DALLE-2 to produce particularly offensive content at all, much less systematically.

So what’s the problem? Despite all of OpenAI’s efforts to avoid misuse of DALLE-2, they still provided the proof of concept for this type of model, and they still wrote about many of the details to their approach in their paper. This enabled others to fund and develop copycat models which can be more easily misused. OpenAI’s technical report on risks had no discussion of problems from copycat models other than a cursory mention that “DALLE-2…may accelerate both the positive and negative uses associated with generating visual content”. It seems odd that OpenAI did not meaningfully discuss copycats given the thoroughness of the report and the fact that past systems of theirs such as GPT-3 have also been copycatted before (e.g. BLOOM). 

Two notable DALLE-2 copycats are Midjourney and eDiffi. But most relevant to this case study is Stable Diffusion from StabilityAI. StabilityAI is a startup whose homepage says it is “a company of builders who care deeply about real-world implications and applications.” It was founded in 2020 but came into the spotlight only recently upon entering the image generation scene. For example, it only made a Twitter account in July, a few months after the DALLE-2 paper. Their copycat, Stable Diffusion, is comparable to DALLE-2, and they confirmed that it was a principal source of inspiration.

Relative to OpenAI, StabilityAI did a very poor job of preventing misuse of Stable Diffusion. In August, they announced the model would be open-sourced. This release was accompanied with some mostly-ineffective measures to reduce harms from the model like providing a safety classifier for images which both doesn’t work that well and which users can simply disable. They also tried to restrict access to people who signed up for it with an email and provided a justification. The plan was to make Stable Diffusion available via HuggingFace on August 22, 2022 to those who were approved for access. This mattered very little though because the weights were leaked online a few days earlier. Then predictably, people either used the model directly or finetuned versions of it to produce the type of offensive content that can be used for targeted harassment of individuals. Later in September, HuggingFace also made access to Stable Diffusion available for anyone on the internet with no signup, albeit with automated filtering for NSFW content built into this particular interface.  

Overall, the slippery slope from the carefully-guarded DALLE-2 to the fully-open-source Stable Diffusion took less than 5 months. On one hand, AI generators for offensive content were probably always inevitable. However (1) not this soon. Delays in advancements like these increase the chances that regulation and safety work won’t be so badly outpaced by capabilities. (2) Not necessarily in a way that was enabled by companies like OpenAI and StabilityAI who made ineffective efforts to avoid harms yet claim to have clean hands while profiting greatly off these models. And (3) other similar issues with more powerful models and higher stakes might be more avoidable in the future. What will happen if and when video generatorsGPT-N, advanced generalist agents, or other potentially very impactful systems are released and copycatted? 

What do we want?

There are general principles that any AI system, with any degree of public exposure, ought to obey. Specifically, we propose three design principles as conditions for responsible development of such systems. The key theme behind all of this is that companies should ideally be accountable not just for what their AI does in the use cases they say it should be used in, but for all of the foreseeable consequences of the systems they release.

Scoping of function

Both the power and risk of general-purpose AI systems lies in their broad applicability. For example, general-purpose text, image, or video generation could be used in a wide array of contexts, making it much harder to reason about their safety and impact. Therefore, it is useful to more precisely scope down technologies so that safety assurances can more readily be given. Note, however, that scoping of function requires that this scope is fixed. In other words, it means ensuring that, once an AI system is scoped down, it is not meaningfully usable for other purposes. For instance, a language model released for a relatively harmless purpose, should not be easily hackable or fine-tuneable in order to do another more harmful one.

Simple examples of implementing this could include only developing narrow versions of powerful models, either fine-tuned on narrower data, or including a penalty for all out-of-scoped outputs in the training objective. This allows fulfillment of the scientific goals of releasing such models (i.e. demonstrating strong capability over a specified domain) while making misuse more difficult. But, if restrictions to general AI models can’t work in practice, this criterion means one ought not develop, deploy, or release unwieldy general models in the first place.

Limitations for access

This could include things as simple as forbidding screenshotting or copy-pasting of outputs from APIs. It could also include measures like filters on prompts or outputs.  Stronger versions of this, for particularly capable technologies, might include restricting the set of people who can access these models or keeping as much about the model secret as possible in order to slow efforts at replication. Even if some of these measures are circumventable with effort, they may be able to meaningfully hinder abuses by adding friction.

This requirement is strong, and not at all an afterthought that can be tacked on after development. For example, as mentioned above, deployed models (e.g. DALLE-2) often come with some restrictions for access. However, even just publishing the training details of these models makes them replicable and therefore totally annuls any other effort to scope access. In this respect, the weakest access link determines how accessible a given AI technology system is. For this reason, whenever developing or deploying a model, one must thoroughly consider how accessible it truly is. 

Complete cost/benefit analysis

Even if a system is reasonably-scoped in function and access, given state of the art techniques in AI, it ultimately remains hard to totally rule out potential abuses. Therefore, since such abuses are to some degree an inherent risk, it is incumbent on the creators of such systems to clearly articulate the set of all possible costs and benefits of their models. They should also faithfully argue why those benefits outweigh these costs, why it is worth the inherent risk of deployment.

Especially in the case of DALLE-2 and Stable Diffusion, we are not convinced of any fundamental social benefits that access to general-purpose image generators provide aside from, admittedly, being entertaining. But this does not seem commensurate with the potentially-devastating harms that deepfakes can have on victims of sex crimes. Thus, it seems that these models, as they have been rolled out, fail this basic cost/benefit test. 

Takeaways

Originators are the key bottleneck in the slippery slope: The slippery slope from DALLE-2 to text-to-image model anarchy demonstrates a rapid pathway for this technology from originators (e.g. OpenAI), to copycatters (e.g. StabilityAI with the help of platforms like Huggingface and Github for sharing models), to distributors of content (e.g social media). Ideally, the norms and laws around AI governance should recognize these steps and work to add friction where possible. Originators play a unique role. The first is that origination is much more difficult than copycatting, so originators represent a smaller and more easily-targeted bottleneck in the pipeline. Second, originators often tend to have and require immense resources and talent. Third, originators have a huge say over how these technologies are proliferated. This points to not only the resources these firms have in knowledge generation and system creation, but also their influence in granting broader access to these technologies. Then, with the large number of resources and the impact their decisions can have, originators make a natural target for regulatory or non-regulatory reform. Among other regulatory avenues, the FTC in the United States in particular may have useful power in this case. See the EA Forum version of this post for further discussion. 

It is important to deprioritize risky capabilities work. Researchers should take great care in what they work on and how they release it. The best solutions to avoiding harms from copycat models may be to (1) curtail advanced capabilities work in general such as DALLE-2, the soon-to-be-released GPT-4, or video generators and (2) investing in work that incorporates harm mitigation at a systems-level and in infrastructure for recourse. Even if no details at all are provided about how something was done, simply knowing that it can be done makes copycats easier to fund and work on. Proofs of concept make the choice to invest in building a technology more cost-efficient. Rather than working on models with broad-domain capabilities like DALLE-2 and GPT-4, non-capabilities work or models with narrow capabilities are safer directions for work. An exception to this would be if certain progress on risky capabilities is inevitable within a certain timeframe and if the only choice is between less dangerous and more dangerous models. And as discussed above, those who build abusable systems should carefully scope their function, limit access, and honestly articulate costs and benefits. 

There are deep problems with the “let’s build transformative AI in order to make sure it’s safe” strategy. In particular, OpenAI and DeepMind both express that they want to race to generate highly transformative intelligent systems. The goal they both profess is to be the first to develop them so that they can exercise responsible stewardship and ensure that it is as aligned and beneficial as possible. This is a benevolent form of what Nick Bostrom refers to in Superintelligence as gaining a “decisive strategic advantage” which may make the first developer of particularly transformative AI too powerful to compete with. There are many problems with this strategy including: (1) It is entirely based on racing to develop transformative AI, and faster timelines exacerbate AI risks. This is especially perverse if multiple actors are competitively racing to do so. (2) Nobody should trust a small set of people like Sam Altman and Demis Hassabis to unilaterally exercise benevolent stewardship over transformative AI. Arguably, under any tenable framework for AI ethics, a regime in which a small technocratic set of people unilaterally controlled transformative AI would be inherently unethical. Meaningful democratization is needed. (3) OpenAI’s approach to DALLE-2 should further erode confidence in them in particular. Their overly-convenient technical report on risks that failed to make any mention of copycatting combined with how quickly they worked to profit off of DALLE-2 are worrying signs. (4) Copycatting makes racing to build transformative AI strictly more risky. Even if one fully-trusted a single actor like OpenAI or DeepMind to exercise perfect stewardship over transformative AI if they monopolized it, how quickly DALLE-2 was copycatted multiple times suggests that copycatting may undermine attempts at benevolent strategic dominance. Copycatting would most likely serve to broaden the set of technocrats who control transformative AI but still fail to democratize it. So if a company like OpenAI or DeepMind races to build transformative AI, and if it is still copycatted anyway, we get the worst of all worlds: unsecure, non-democratized, transformative AI on a faster timeline. If a similar story plays out with highly transformative AI as has with DALLE-2, humanity may be in trouble.

Conclusion

With text-to-image models, the Pandora's box is already opened. It is extremely easy to abuse Stable Diffusion, and it will get easier over time. Some people, particularly victims of sex crime, will be devastatingly harmed while OpenAI and StabilityAI make large profits. This offers a compelling case study on risks from text-to-image models in particular and open-sourcing risky models in general. The troubling story of DALLE-2 should serve to repudiate originators of risky AI systems and the strategy of ensuring AI safety by racing to be the first to build transformative AI. 
 

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 3:09 AM
[-]ecrows1yΩ1116

Especially in the case of DALLE-2 and Stable Diffusion, we are not convinced of any fundamental social benefits that access to general-purpose image generators provide aside from, admittedly, being entertaining.

 

It's fairly clear that widespread access to these models has dramatically accelerated a broader transformation of creative industries.  Early users are fascinated with the novelty of playing with the AI model as a "new toy", but the model and related models are rapidly becoming integrated into other tools and creative workflows.  Artists, game developers, interior designers, and other creatives will likely use these models extensively in the coming decade.  Free, open-source software reduces the barrier to entry, allowing models to be more easily brought to other languages, and to be provided to individuals who don't have the resources to use paid services.  To borrow the analogy -- many amateur artists use alternatives to Photoshop due to financial limitations, or due to a desire to be able to customize their own tools.

I suppose the argument could be that innovation in artistic creation is not a social benefit -- that Photoshop has limited social benefit, that film VFX have limited social benefit, and that it is ultimately too dangerous for humans to possess powerful creative tools without a council of elders controlling how such tools are used.  Is it a good thing that more humans have the tools to create more beautiful things more easily?  There will certainly be harms -- I cannot imagine what targeted ad content that is synthesized by companies to be perfectly tailored to each individual viewer will do to our minds -- but I believe this technology ought to be handled through legislation and law enforcement, rather than strangling the technology in the crib.

More fundamentally, I think it's clear that general-purpose AI models have an immense possibility for transformative positive impact on human society.  General purpose language models such as BERT are a clear example.  A staggering 1 in 3 internet users aged 16 to 64 have used an online translation tool in the last week, a figure representing over 1 billion people. Text summarization can create understandable summaries of complex legal text or medical records.  Open-source models like BERT are vital to this kind of development.

re "anarchy", you'll have a hard time convincing me that lack of an ai monarch is bad. use of a word that describes lack of central authority to describe safety failure seems to me to imply a fundamental misunderstanding of what it is we need to protect, which is everyone's individual agency over their own shape.

safety needs to be a solution to an incentive problem; if your solution to safety is to not give individuals the means of intelligent computation, then you aren't able to solve safety at all. we need to be thinking about safety from a multi-agent systems perspective because I think at this point we can conclude that AI safety is just a new interspecies version of the same conflicts of intention that lead to all human and animal fights and wars. The difficulty is how to ensure that, in a world where everyone can imagine other people in compromising positions and then share that imagination, we can protect each other's dignity. What content filter would you choose to apply to your own view of the world in order to respect people who don't want certain images of them to be seen? can we make good tools for automatically filtering out content by your own choice, without relying on the incredibly fragile authority of platforms?

I would propose that what we need to be optimizing for is to end up in a pro-social anarchy rather than a destructive anarchy. authority cannot last; So authority needs to be aiming to build structures that will preserve each individual's rights at least as well as the authorities did, but keeping in mind that the network of co-protective agents need to be able to operate entirely open source.

I would suggest checking out open source game theory, a subset of game theory research that focuses specifically on what you can do when you can verify that you have both read each other's minds perfectly in a game theory situation. it's not a full solution to the safety problem and it doesn't plug directly in to current generation high noise models, but it's relevant to the fact that it is not possible to ban AI as long as people have computers; we need to aim for a world where publishing new capability developments that will be cloned open source will change the game such that the price of anarchy decreases to 1.

Did you read the post?

I think it is clear that we are not advocating for centralized authority.  All of the three points in "takeaways" lead into this. The questions you asked in the second paragraph are ones that we discuss in the "what do we want section" with some additional stuff in the EA forum version of the post.

Without falling into the trap of debating over definitions. Anarchy can be used colloquially to refer to chaos in general and it was intended here to mean a lack of barriers to misuse in the regulatory and dev ecosystems -- not a lack of someone's monopoly on something. If you are in favor of people not monopolizing capabilities, I'm sure you would agree with out third "takeaways" point. 

The "what do we want" section is about solutions that don't involve banning anything. We don't advocate for banning anything. The "banning" comments are strawpersoning. 

fair enough. I read much of the post but primarily skimmed it like a paper, it seems I missed some of the point. Sorry to write an unhelpful comment!

You didn't provide any evidence that it's easier to create almost-effortless development of NSFW images of specific people for blackmail or harassment or even that it's easier to create them with the new technology than with older technology. 

Why make your argument without providing evidence for that claim?

One of your sources even explicitly argues that you are wrong:

Right now, the results are still much too rough to even begin to trick anyone into thinking they’re real snapshots of nudes

Slander was always easy, and you can likely cite a lot of cases where it harmed people.  What should we conclude from LessWrong allowing you to post slander? That we someone need to delete your post?

The Vice article came out on August 24th. That was 5 days after the SD leak and 2 days after its official open-source release. The claim that it made that SD couldn't "begin to trick anyone into thinking they're real snapshots of nudes" did not stand the test of time.  We linked the vice article in context of the discussion of deepfake porn in general, not on the specific photorealistic capabilities of SD. 

Speaking of which, dreambooth does allow for this. See this SFW example. This is the type of thing would not be possible with older methods. https://www.reddit.com/r/StableDiffusion/comments/y1xgx0/dreambooth_completely_blows_my_mind_first_attempt/

And bear in mind that new updates, GUIs, APIs, capabilities, etc are arriving almost daily

I will not link NSFW examples. But I have seen them. They are just as realistic. Others have agreed. I've gotten several people banned from social media platforms after reporting them. 

The key argument is that StableDiffusion is more accessible, meaning more people can create deepfakes with fewer images of their subject and no specialized skills. From above (links removed):

“The unique danger posed by today’s text-to-image models stems from how they can make harmful, non-consensual content production much easier than before, particularly via inpainting and outpainting, which allows a user to interactively build realistic synthetic images from natural ones, dreambooth, or other easily used tools, which allow for fine-tuning on as few as 3-5 examples of a particular subject (e.g. a specific person). More of which are rapidly becoming available following the open-sourcing of Stable Diffusion. It is clear that today’s text-to-image models have uniquely distinct capabilities from methods like Photoshop, RNNs trained on specific individuals or, “nudifying” apps. These previous methods all require a large amount of subject-specific data, human time, and/or human skill. And no, you don’t need to know how to code to interactively use Stable Diffusion, uncensored and unfiltered, including in/outpainting and dreambooth.”

If what you’re questioning is the basic ability of StableDiffusion to generate deepfakes, here [1] is an NSFW link to www.mrdeepfakes.com who says, “having played with this program a lot in the last 48 hours, and personally SEEN what it can do with NSFW, I guarantee you it can 100% assist not only in celeb fakes, but in completely custom porn that never existed or will ever exist.” He then provides links to NSFW images generated by StableDiffusion, including deepfakes of celebrities. This is apparently facilitated by the LAION-5B dataset which Stability AI admits has about 3% unsafe images and which he claims has “TONS of captioned porn images in it”.

[1] Warning, NSFW: https://mrdeepfakes.com/forums/threads/guide-using-stable-diffusion-to-generate-custom-nsfw-images.10289/

“having played with this program a lot in the last 48 hours, and personally SEEN what it can do with NSFW, I guarantee you it can 100% assist not only in celeb fakes, but in completely custom porn that never existed or will ever exist.” 

Completely, custom porn is not necessarily porn that actually looks like existing people in a way that would fool a critical observer. 

More importantly, the person who posted this is not someone without specialized skills. Your claim is that basically anyone can just use the technology at present to create deepfakes. There might be a future where it's actually easy for someone without skills to create deepfakes but that link doesn't show that this future is here at present.

With previous technology, you create a deepfake porno image by taking a photo of someone, cropping out the head, and then putting that head into a porn image. You don't need countless images of them to do so. For your charge to be true, the present Stable Diffusion-based tech would have to be either much easier than existing photoshop based methods or produce more convincing images than low skill photoshop deepfakes.

The thread in the forum demonstrates that neither of these is the case at present.

[+][comment deleted]1y10