The Hacker Learns to Trust

by Ben Pace7 min read22nd Jun 201918 comments



This is a linkpost for some interesting discussions of info security norms in AI. I threw the post below together in 2 hours, just to have a bunch of quotes and links for people, and to have the context in one place for a discussion here on LW (makes it easier for common knowledge of what the commenters have and haven't seen). I didn't want to assume people follow any news on LW, so for folks who've read a lot about GPT-2 much of the post is skimmable.

Background on GPT-2

In February, OpenAI wrote a blogpost announcing GPT 2:

We’ve trained a large-scale unsupervised language model which generates coherent paragraphs of text, achieves state-of-the-art performance on many language modeling benchmarks, and performs rudimentary reading comprehension, machine translation, question answering, and summarization—all without task-specific training.

This has been a very important release, not least due to it allowing fans to try (and fail) to write better endings to Game of Thrones. Gwern used GPT-2 to write poetry and anime. There have been many Medium posts on GPT-2, some very popular, and at least one Medium post on GPT-2 written by GPT-2. There is a subreddit where all users are copies of GPT-2, and they imitate other subreddits. It got too meta when the subreddit imitated another subreddit about people play-acting robots-pretending-to-be-humans. Stephen Woods has lots of examples including food recipes.

Here in our rationality community, we created user GPT-2 trained on the entire corpus of LessWrong comments and posts and released it onto the comment section on April 1st (a user who we warned and then banned). And Nostalgebraist created a tumblr trained on the entire writings of Eliezer Yudkowsky (sequences+HPMOR), where Nostalgebraist picked their favourites to include on the Tumblr.

There was also very interesting analysis on LessWrong and throughout the community. The post that made me think most on this subject is Sarah Constantin's Human's Who Are Not Concentrating Are Not General Intelligences. Also see SlateStarCodex's Do Neural Nets Dream of Electric Hobbits? and GPT-2 As Step Toward General Intelligence, plus my teammate jimrandomh's Two Small Experiments on GPT-2.

However, these were all using a nerfed version of GPT-2, which only had 175 million parameters, rather than the fully trained model with 1.5 billion parameters. (If you want to see examples of the full model, see the initial announcement posts for examples with unicorns and more.)

Reasoning for only releasing a nerfed GPT-2 and response

OpenAI writes:

Due to our concerns about malicious applications of the technology, we are not releasing the trained model.

While the post includes some discussion of how specifically GPT-2 could be used maliciously (e.g. automating false clickbait news, automated spam, fake accounts) the key line is here.

This decision, as well as our discussion of it, is an experiment: while we are not sure that it is the right decision today, we believe that the AI community will eventually need to tackle the issue of publication norms in a thoughtful way in certain research areas.

Is this out of character for OpenAI - a surprise decision? Not really.

Nearly a year ago we wrote in the OpenAI Charter: “we expect that safety and security concerns will reduce our traditional publishing in the future, while increasing the importance of sharing safety, policy, and standards research,” and we see this current work as potentially representing the early beginnings of such concerns, which we expect may grow over time.
Other disciplines such as biotechnology and cybersecurity have long had active debates about responsible publication in cases with clear misuse potential, and we hope that our experiment will serve as a case study for more nuanced discussions of model and code release decisions in the AI community.

Public response to decision

There has been discussion in news+Twitter, see here for an overview of what some people in the field/industry have said, and what the news media has written. The main response that's been selected for by news+twitter is that OpenAI did this primarily as a publicity stunt.

For a source with a different bias than the news and Twitter (which selects heavily for anger and calling out of norm violation), I've searched through all Medium articles on GPT-2 and copied here any 'most highlighted comments'. Most posts actually didn't have any, which I think means they haven't had many viewers. Here are the three I found, in chronological order.

OpenAIs GPT-2: The Model, The Hype, and the Controvery

As ML researchers, we are building things that affect people. Sooner or later, we’ll cross a line where our research can be used maliciously to do bad things. Should we just wait until that happens to decide how we handle research that can have negative side effects?

OpenAI GPT-2: Understanding Language Generation through Visualization

Soon, these deepfakes will become personal. So when your mom calls and says she needs $500 wired to the Cayman Islands, ask yourself: Is this really my mom, or is it a language-generating AI that acquired a voice skin of my mother from that Facebook video she posted 5 years ago?

GPT-2, Counting Consciousness and the Curious Hacker

If we have a system charged with detecting what we can and can’t trust, we aren’t removing our need to invest trust, we are only moving our trust from our own faculties to those of the machine.

I wrote this linkpost to discuss the last one. See below.

Can someone else just build another GPT-2 and release the full 1.5B parameter model?

From the initial OpenAI announcement:

We are aware that some researchers have the technical capacity to reproduce and open source our results. We believe our release strategy limits the initial set of organizations who may choose to do this, and gives the AI community more time to have a discussion about the implications of such systems.

Since the release, one researcher has tried to reproduce and publish OpenAI's result. Google has a program called TensorFlow Research Cloud that gives loads of free compute to researchers affiliated with various universities, which let someone train an attempted copy of GPT-2 with 1.5 billion parameters. They say:

I really can’t express how grateful I am to Google and the TFRC team for their support in enabling this. They were incredibly gracious and open to allowing me access, without requiring any kind of rigorous, formal qualifications, applications or similar. I can really only hope they are happy with what I’ve made of what they gave me.
...I estimate I spent around 200 hours working on this project.... I ended up spending around 600–800€ on cloud resources for creating the dataset, testing the code and running the experiments

That said, it turned out that the copy did not match up in skill level, and is weaker even than nerfed model OpenAI released. The person who built it says (1) they think they know how to fix it and (2) releasing it as-is may still be a helpful "shortcut" for others interested in building a GPT-2-level system; I don't have the technical knowledge to assess these claims, and am interested to hear from others who do.

During the period where people didn't know that the attempted copy was not successful, the person who made the copy wrote a long and interesting post explaining their decision to release the copy (with multiple links to LW posts). It discussed reasons why this specific technology may cause us to better grapple with misinformation on the internet that we hear. The author is someone who had a strong object level disagreement with the policy people at OpenAI, and had thought pretty carefully about it. However, it opened thus:

Disclaimer: I would like it to be made very clear that I am absolutely 100% open to the idea that I am wrong about anything in this post. I don’t only accept but explicitly request arguments that could convince me I am wrong on any of these issues. If you think I am wrong about anything here, and have an argument that might convince me, please get in touch and present your argument. I am happy to say “oops” and retract any opinions presented here and change my course of action.
As the saying goes: “When the facts change, I change my mind. What do you do?”
TL;DR: I’m a student that replicated OpenAI’s GPT2–1.5B. I plan on releasing it on the 1st of July. Before criticizing my decision to do so, please read my arguments below. If you still think I’m wrong, contact me on Twitter @NPCollapse or by email ( and convince me. For code and technical details, see this post.

And they later said

[B]e assured, I read every single comment, email and message I received, even if I wasn’t able to respond to all of them.

On reading the initial I was genuinely delighted to see such pro-social and cooperative behaviour from the person who believed OpenAI was wrong. They considered unilaterally overturning OpenAI's decision but instead chose to spend 11,000 words explaining their views and a month reading others' comments and talking to people. This, I thought, is how one avoids falling prey to Bostrom's unilateralist curse.

Their next post The Hacker Learns to Trust was released 6 days later, where they decided not to release the model. Note that they did not substantially change their opinions on the object level decision.

I was presented with many arguments that have made me reevaluate and weaken my beliefs in some of the arguments I presented in my last essay. There were also many, maybe even a majority of, people in full support of me. Overall I still stand by most of what I said.
...I got to talk to Jack Clark, Alec Radford and Jeff Wu from OpenAI. We had a nice hour long discussion, where I explained where I was coming from, and they helped me to refine my beliefs. They didn’t come in accusing me in any way, they were very clear in saying they wanted to help me gain more important insight into the wider situation. For this open and respectful attitude I will always be grateful. Large entities like OpenAI often seem like behemoths to outsiders, but it was during this chat that it really hit me that they were people just like me, and curious hackers to boot as well.
I quickly began to understand nuances of the situation I wasn’t aware of. OpenAI had a lot more internal discussion than their blog post made it seem. And I found this reassuring. Jack in particular also gave me a lot of valuable information about the possible dangers of the model, and a bit of insight into the workings of governments and intelligence agencies.
After our discussion, I had a lot to think about. But I still wasn’t really convinced to not release.

They then talked with Buck from MIRI (author of this great post). Talking with Buck lead them to their new view.

[T]his isn’t just about GPT2. What matters is that at some point in the future, someone will create something truly dangerous and there need to be commonly accepted safety norms before that happens.

We tend to live in an ever accelerating world. Both the industrial and academic R&D cycles have grown only faster over the decades. Everyone wants “the next big thing” as fast as possible. And with the way our culture is now, it can be hard to resist the pressures to adapt to this accelerating pace. Your career can depend on being the first to publish a result, as can your market share.
We as a community and society need to combat this trend, and create a healthy cultural environment that allows researchers to take their time. They shouldn’t have to fear repercussions or ridicule for delaying release. Postponing a release because of added evaluation should be the norm rather than the exception. We need to make it commonly accepted that we as a community respect others’ safety concerns and don’t penalize them for having such concerns, even if they ultimately turn out to be wrong. If we don’t do this, it will be a race to the bottom in terms of safety precautions.

We as a community of researchers and humans need to trust one another and respect when one of us has safety concerns. We need to extend understanding and offer help, rather than get caught in a race to the bottom. And this isn’t easy, because we’re curious hackers. Doing cool things fast is what we do.

The person also came to believe that the AI (and AI safety) community was much more helpful and cooperative than they'd expected.

The people at OpenAI and the wider AI community have been incredibly helpful, open and thoughtful in their responses to me. I owe to them everything I have learned. OpenAI reached out to me almost immediately to talk and they were nothing but respectful and understanding. The same applies to Buck Shlegeris from MIRI and many other thoughtful and open people, and I am truly thankful for their help.
I expected a hostile world of skepticism and competition, and there was some of that to be sure. But overall, the AI community was open in ways I did not anticipate. In my mind, I couldn’t imagine people from OpenAI, or MIRI, or anywhere else actually wanting to talk to me. But I found that was wrong.
So this is the first lesson: The world of AI is full of smart, good natured and open people that I shouldn’t be afraid of, and neither should you.

Overall, the copy turned out not to be strong enough to change the ability for malicious actors to automate spam/clickbait, but I am pretty happy with the public dialogue and process that occurred. It was a process whereby, in a genuinely dangerous situation, the AI world would not fall prey to Bostrom's unilateralist's curse. It's encouraging to see that process starting to happen in the field of ML.

I'm interested to know if anyone has any different takes, info to add, or broader thoughts on information-security norms.

Edited: Thanks to 9eB1 for pointing out how nerfed the copy was, I've edited the post to reflect that.