(Basic familiarity with ELK required.)
I noticed that some people who haven’t thought that much about ELK yet (including my former self) are confused about (1) why the ELK setup is structured how it is structured and what we want to achieve, and relatedly (2) what's the underlying problem behind the "human simulator".
TL;DR: The problem with the "human simulator" isn't that the AI wants to "look good to a human" instead of "correctly answering questions". The "human simulator" is just one specific example where the reporter learns to reason on its own, instead of looking at the predictor. When a reporter reasons on its own from answering human-level questions, its capabilities won't generalize to harder questions. This is why we need a predictor with a superhuman world model.The primary goal of ELK isn’t that we want to train a specific objective into the reporter, we want the reporter to have a certain structure, the direct-translator structure.
(EDIT: As I'm using the term "human simulator", I mean the class of AIs that reason very similar to humans. It is possible that other people rather mean the broader class of self-reasoning reporters, even if the reporter reasons with different concepts than humans, but I'm not sure. Let me know if most other people (or just Paul) use the term "human simulator" consistently different than me.)
As said, I think some people mistakenly think the difficulty of getting a "direct translator" vs a "human simulator", is that the AI may learn "look good to humans" rather than "answer questions honestly" as its objective, and that for some reason the former is fundamentally more likely by default than the latter.
That isn't right. It's not that the human simulator wants to look good to humans, it just reasons on its own to find answers, similar to how humans reason.
The problem is that in the ELK setup this kind of reporter reasoning for itself may be much simpler than finding the correct way to extract answers from the predictor.
While making sure that the AI doesn't just learn to "look good to humans" as its objective is a very important and difficult problem in AI safety, and possibly (though not necessarily) a problem we still need to solve for ELK, it isn't the main problem we are trying to solve right now.
But we aren’t primarily trying to train the reporter to have the right objective, we try to train it to have the right structure.
Let me explain...
Forget for a second what you know about ELK.
You want to create an AI that is superhuman at answering questions, i.e. can answer questions humans cannot find an answer for (without years of science or so).
So let's just try to train an AI to answer questions honestly, and we train it on a dataset of (q,a) pairs, where "q" is a question and "a" is the correct answer to the question that humans have determined. (So we don’t have a predictor like in ELK, just train question answering directly.)
(EDIT: Assume the training dataset consists of questions that humans can indeed answer in advance. Actually, it is also possible to have (q,a) pairs where humans only observed the answer, in which case the AI may learn more impressive capabilities and generalize more, but we assume that humans can find the answer "a" given "q".)
So after training for a while, our AI has learned to answer human-level questions correctly.
Assume our training set was perfect and contained no mistakes. Do you think the AI will rather have learned "predict what humans think" or "just figure out the answer" as its objective? (Ok actually it may not really have such an explicit objective, and rather learned behavior patterns, but what objective sounds more right to you?)
I would say "just figure out the answer" because it seems to me to be a simpler objective: Even if human reasoning is the simplest way to answer the questions, why would it learn the extra step of explicitly predicting what humans think, instead of "reason (like humans) to figure out the answer"? It may sound to us like the "predict what humans think" is simpler, but it's just because it's a short code, you still need to learn to actually reason like humans, which isn't simpler.
In practice, we won't have a perfect dataset, and the more human biases show through, the more likely it is that the AI will realize that and rather learn "predict what humans think". But I hope the example showed that the objective of "answer correctly" isn't in general much more complex / harder to achieve than "learn what humans want". The problem in ELK is that the direct-translator structure may be much more complex than the human-simulator structure.
Say we now ask the AI (that we just trained on our (q,a) dataset) to answer a question q that is significantly harder than what humans can answer, will it be able to answer the question correctly? Why/why not?
No, it won't. The AI won't be able to, from where should it have learned the capability of doing that? It doesn't matter whether the AI wants to achieve the objective "just look good to humans" or "reason to try to find the right answers".
So how do we get an AI that is superhuman at answering questions?
One possibility are amplification schemes, and there may be other options, but we are now interested in the approach where we train an AI not only on human-labeled data, e.g. make it predict sensor data, so it isn't bounded by what humanity can label.
We call the part that predicts sensor data "predictor" and the part that should answer questions "reporter", and there we are in the ELK setting!
(Ok in the ELK setting we kind of strictly train the predictor before the reporter, which isn’t strictly required, but we’re roughly in an ELK setting.)
So now we are in the ELK setup, we've trained our predictor, and now train our reporter on the training dataset.
The hope is that if the reporter looks at the predictor to answer our questions, it may generalize to answer questions correctly even when we humans don't know the answer, because it can still easily look up the answer in the predictor.
(One possibly pretty important obstacle to ELK that I see is that it may be harder to translate answers to questions where we humans don't know the answer. A special case of this is that the predictor may have the information about "what humans think" encoded inside and those are easier to translate than the whole ground-truth state. But that's not the topic of this post.)
The problem now is that when the predictor reasons in different concepts than humans and is really large, doing the translation may be really complicated, and so it may be simpler for the reporter to just learn to answer the questions itself just from the input the predictor gets.
One concrete possibility of how that might happen is that the reporter learns to reason like a human, so in the ELK report they just take this concrete scenario and call the reporter the "human simulator".
But there are other possibilities how the reporter might learn to reason for itself to achieve perfect performance on the training set.
(In retrospect, we see that "human simulator" is kind of a bad name as it may lead to the confusion that it wants to simulate the human, which is wrong.)
To see that there are other reasoners/simulators than the human simulator, take a look at this diagram, which shows the space of questions that can be answered by an AI.
As you can see, the direct translator can answer many more questions than the human simulator, but there are many other possibilities that can be learned that aren't exactly the human simulator, but still can answer all the questions a human can answer.
The problem when we get such a reasoner is, as we already saw in the last section, that it won't be capable enough to answer harder questions just from answering questions humans can answer.
If we get a reporter that isn't exactly the human simulator, i.e. reasons a bit differently in some ways, it is possible that the answers may slightly generalize into some domains, but not much. And overall it seems like a good proxy that is easier to learn to answer fewer questions rather than more, so it really won't generalize much. (In practice we may indeed get a self-reasoning reporter that is pretty close to being a human simulator, but the critical point is just that it is self-reasoning.)
So what is the main lesson of the post? (May try it for yourself, the below is just one possible short summary.)
The "human simulator" isn't about the reporter wanting to look good to humans, but about the reporter reasoning for itself in a similar way to how humans reason. The main problem here isn't that those self-reasoning reporters don't *want* to answer hard questions (that humans cannot answer) honestly, it's that they aren't *capable* enough to do it.We aren’t trying to get a reporter with the right objective, but with a nice structure that gives us some hope that its question-answering capability generalizes.