Shh, don't tell the AI it's likely to be evil

naterush

If you haven't been online in the past week: ChatGPT is OpenAI's new fine-tuned language model that is really much better at doing all sorts things that you'd expect. Here's it debugging some code, or writing awful-sounding Seinfeld episodes, or (my favorite) writing a verse from the King James Bible about taking a PB sandwich out of a VCR.

My fairly-basic understanding of large language models is that they are trying to predict the most-likely-next token in a sequence. So if you start a sentence with "I love peanut butter and..." the large language model will finish it with "jelly." It's clear that the relationships be captured by "most-likely-token" are way more complex and interesting that one might initially guess (see: getting a PB sandwich out of a VCR above).

Given that LLMs are trained on the public internet, I'm wondering if we should be thinking hard about what sorts of sequences we are exposing these models to.

Imagine OpenAI of the future goes to fine-tune one of their models to be a "friendly and nice and happy chatbot that just wants to help humans solve their problems, while also being aligned + not evil." ChatGPT5 is born, and when given a question as a prompt, it will likely respond with something friend and nice and happy, etc.

But deep in the weights in ChatGPT5's network are a memory of the other "friendly and nice and happy chatbots" it has seen before: the AIs that we have been writing about since Asimov. And most of the AIs in our fiction have a tendency to work great at first, but eventually turn evil, and then try to kill everyone.

And so, as these are the only other chatbots that ChatGPT5 has learned from, this may become the most-likely sequence. Be a nice and friendly and happy chatbot - and then, after about 2 years of being friendly and visibly-aligned, go ham -- for no other reason than this is the most likely set of tokens it's seen.

Another way to ask this question that: would you rather ChatGPT had been trained on the dataset it was likely trained on (e.g. human vs. AI war), or one where every story about an AI was turned to one of solar punk abundance and happy utopia?

To me, the answer is pretty clear.

The solution, less so. I wonder if I should devote more time to writing aligned-AI fiction and putting on the internet where it can be found by OpenAI's web scraper. Not by a human though -- all the fiction I write is god awful.

Clearly we must process all LLM datasets by automatically translating writing about malevolent AIs into UWU furry speak. I can see no way this can possibly go wrong.

Good thing there's not a huge public forum with thousands of posts about misaligned AI that clearly has already been included in GPT-3's training, including hundreds which argue that misaligned AI will trivially kill-

... oh wait.

All joking aside, if this does become an issue, it should be relatively easy to filter out the vast majority of "seemingly aligned AIs misbehaves" examples using a significantly smaller LM. Ditto for other things you might not want, e.g. "significant discussion of instrumental convergence", "deceptive alignment basics", etc.

My guess is this isn't that big of a deal, but if it does become a big deal, we can do a lot better than just asking people to stop writing dystopian AI fiction.

That's not how it works. If a transformative AI cannot tell dystopian fiction from actual human values (whatever they turn out to be), all is lost.

There's a good chance (in my judgement) where we live in a world where all already is lost, modulo us doing something clever and likely far less elegant than ideally desirable

why, concretely, isn't it how it works? isn't transformative ai most likely to be raised in a memetic environment and heavily shaped by that? isn't interpretability explicitly for the purpose of checking what effect training data had on an ai? it doesn't seem to me that training corpus is an irrelevant question. We should be spending lots of thought on how to design good training corpora, eg LOVE in a simbox, which is explicitly a design for generating an enormous amount of extremely safe training data.

Well, yes, it will be shaped by what it learns, but it's not what you train it on that matters, since you don't get to limit the inputs and then hope it only learns "good behavior". Human values are vague and complex, and not something to optimize for, but more of a rough guardrail. All of human output, informational and physical, is relevant here. "Safe" training data is asking for trouble once the AI learns that the real world is not at all like training data.

How do you define transformative AI? If ChatGPT gets 10x better (e.g. it can write most code, answer most questions as good as the experts in a subject, etc) -- would this qualify?

How would you even force an AI to use the weights (simplification) that correspond to the fact vs. fiction anyways?

Also, what really is the difference between our history textbooks and our fiction to an AI that's just reading a bunch of text? I'm not being flippant. I'm genuinely wondering here! If you don't imbue these models with an explicit world-model, why would one be always privileged over the other?

Problem: there is no non-fiction about human-level AIs. The training data for LLMs regarding human-level AIs contains only fiction. So consider the hypotheses of chatGPT. In what context encountered in its training data is it most likely to encounter text like "you are Agent, a friendly aligned AI..." followed by humans asking it to do various tasks? Probably some kind of weird ARG. In current interactions with chatGPT, it's quite possibly just LARPing as a human LARPing as a friendly AI. I don't know if this is good or bad for safety, but I have a feeling this is a hypothesis we can test.

Thought I'd share this. I broke it apart so "it" won't see it. You can put it back together again.

https :// kantrowitz. medium. com/ openais- chatgpt- bot- imagines- its- worst- possible- self- bf057b697bbb