Some claim that large AI companies like OpenAI aren’t doing enough to make AI safer, or are causing harm by speeding up AI progress. But what are OpenAI doing to make AI safer?
In general, OpenAI is very optimistic about using AI to help with alignment research.
Their basic approach is based on three concepts:
In other words, they hope to make AI safer by using a number of techniques involving humans rating AI’s performance, and using AI to help humans rate the performance of other AI. Eventually, they hope this kind of approach will develop until they have an AI which is sufficiently intelligent (and safe) that it can do AI safety research on its own.
The basic concept here is ‘reinforcement learning from human feedback’ (RLHF). RLHF is very simple: we reward the AI when it does what we want. Usually, training an AI involves a defined ‘loss function’ which tells the AI what we want. With RLHF, however, we train it more like a dog — we look at what it’s doing, and reward it whenever it’s doing something that looks more like what we want it to be doing. For example, we show a human two clips of the AI behaving randomly and ask them which looks more like a backflip. If you do that often enough, the AI will do a backflip.
AI doing a backflip
This is actually a little bit more complicated than just telling the AI what we want. We actually have two AIs. One of them is learning what we want. The other one is asking the first ‘hey, what am I supposed to be doing?’ and then trying to do that. So model 1 makes guesses based on the human feedback (maybe you’re meant to jump?), and then trains model 2 until it’s able to consistently achieve what model 1 thinks it’s meant to. Eventually, model 1 figures out what the humans want, and model 2 figures out how to achieve that goal. Using model 1 saves time, as it means humans don’t have to rate every single thing the AI does, but can be intermittently asked what’s going on.
This can be useful for ‘fine-tuning’ large models as well. Large Language Models (LLMs) are trained using lots and lots of data like text from the internet until they’re able to predictably output convincing text. However, they just output text that’s similar to text they’ve taken in — meaning they’re often rude, dangerous, or unhelpful. One way of fixing this is to use RLHF: have lots of people look at text the LLM outputs, see if it’s harmful, then train a model on that data.
However, this approach only goes so far. Although rating performance is often easier than actually performing at that level, at some point it becomes impossible for humans to tell how well an AI is doing. AI have been able to become above
OpenAI’s main solution to this problem? Train an AI to help humans rate AI! They demonstrated this in a number of papers. As an example, they look at the problem of summarising books. It’s pretty hard to train AI to summarise books, because training an AI takes a lot of training runs, and you’d have to ask somebody to read an entire book for each one. To help this, they train an AI to summarise chapters of books, which is much easier — each training run somebody just has to read a chapter. They then use that AI to assist people in judging the AI that summarises whole books — people can read the chapter summaries instead of the whole book, and use that to judge the summary of the whole book.
As yet, they haven’t taken this a step further and tested it with AI which is truly superhuman, but they say they’re planning on releasing data soon of a similar experiment on judging coding tasks which are hard for unassisted humans to judge reliably.
Eventually, OpenAI hopes to be able to automate the discovery of new concepts and approaches in AI Safety.
They suggest that large language models are a good basis for doing so, because they contain a lot of knowledge already. They suggest that training increasingly efficient research assistants will automate more and more of AI safety research, and that narrow systems (which are less likely to be dangerous) may be suitable for this kind of task.
Their new superalignment team is focussed on this goal. They also outline three principles they’re focussing on to make sure that an automated research assistant will be safe:
We already discussed 1, and they haven’t published work on 3 yet, so we’ll focus on 2 here. For an example of what three looks like, you can look at some of Anthropic’s recent work.
OpenAI outline two key ways they are interested in validating that the assistant is safe.
Firstly, they hope to use AI to find edge-cases where other AI behaves unacceptably. We’ll be posting soon with a more in-depth explanation of how this works.
Their other approach is based on interpretability. Right now, it’s very hard to figure out how an AI works. Because we basically teach a bunch of numbers how to do a task, the internals look like… well, a bunch of numbers, which appear random. OpenAI use GPT-4 (an advanced language model) to explain what some of those numbers do in GPT-2 (a less advanced language model), and compare it to humans manually doing the same. They show that though the explanations are imperfect, they are sometimes right and potentially useful enough as a starting point. They hope that future language models will be even better at this task, or that they can change how they train the simpler model to make this kind of task easier. Understanding what AI is ‘thinking about’ is useful to spot things like deception or the AI expecting that its actions will cause harm.
OpenAI also outlines a number of weaknesses with their plan. Their current plan is based on developing safety techniques on current AI systems, but making larger, more intelligent systems safe will likely involve very different kinds of problems. They’re also concerned that the simplest, narrowest systems that can help with AI safety might already be dangerous. Their new superalignment team is focussed on the second kind of problem.