Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I'll try to clarify what I was doing with the AI truth setup in a previous post. First I'll explain the nature of the challenge, and then how the setup tries to solve it.

The nature of the challenge is to have an AI give genuine understanding to a human. Getting the truth out of an AI or Oracle is not that hard, conceptually: you get the AI to report some formal property of its model. The problem is that that truth can be completely misleading, or, more likely, incomprehensible.


For misleading you have old staples like "If you implement this strategy, the incidence of cancer will go down" (neglecting to mention that this will be because everyone will be dead). For incomprehensible, there are statements like "ratios of FtsZ to Pikachurin plus C3-convertase to Von Willebrand Factor will rise in proportion to the average electronegativity in the Earth's orbit". That might be true, but is incomprehensible to a layman and not much better to an expert.

In fact, the AI's answer might be a confusing mix of misleading and incomprehensible. If we've managed to find a good formal definition of "humans have cancer" but not of "humans are still alive", then we'll confront that mix: we'll know this will solve cancer, but we can't figure out whether humanity will survive or not.

Then there's the omnipresent risks from human biases. Confirmation bias, backfire effects, optimism, or pessimism bias... Should the AI take these in account, and present us with inaccurate information that nonetheless ends up with us having a more accurate impression at the end? And how are we going to measure that "accurate impression" anyway?

Measuring true understanding with an exam

Ignore the second AI in the setup for the moment, and focus simply on the first. One mark of true understanding in a human, is if they are capable of accurately answering questions about what they've just learnt. This suggests an exam as a true measure of understanding. And thus the goal for the first AI: it must prep its human to successfully pass the exam.

Now, many have argued that exams are not the be all and end all of understanding. But it does seem that passing exams is a necessary component of true understanding. If you understand something better but can't answer anything new about it, I'd be very curious as to what your understanding consists of.

Now, if there are only a few questions in the exam, the first AI can teach to the test, and simply have the human memorise a few meaningless answers. Technically, it could also do this for billions and billions of questions, but the frailty of human memory partially protects us here: it's much more plausible we have a process for figuring out the answers to that many questions (ie a sort of understanding) than a long list to follow.

However, it's not plausible for normal humans to sit down and answer a billion questions with any degree of competence or speed. Fortunately, that will not be necessary: instead of answering N questions, the human must simply be capable of answering one question chosen randomly from among N. This means that N can be very large indeed, combinatorically large, far too large for even a machine to memorise. Some sort of compression or understanding or simplification will be needed.

So, formally, the first AI is rewarded from the human choosing the right answer from a randomly chosen question. Obviously the AI has to not know the question in advance, or actively advise the human on the answer once the question is given, or the whole setup falls apart.

The roles of the second AI and the surprise-test counterfactual

So, what's the point of having the second AI? Well the setup will work above if we have a list of N questions that are a) formally defined for the first AI, and b) understandable for the human.

But that's rarely the case. And if it were the case, we might have alternatives, like simply having the AI directly answer a whole bunch of the understandable questions.

So typically, the questions are going to be like that "ratios of FtsZ..." statement above, and completely beyond human comprehension. Enter the second AI. It's only role is to take these formally defined questions, and translate them into terms the human is likely to understand -- and hence, answer correctly. Then both AIs are given the motivation to have the human choose the correct answer.

So the second AI is like an exam invigilator clarifying the questions on the exam sheet. Now, if the invigilator knows the answer to the question and really wants the student to pass, it can simply tell the human the answer. So the second AI must be ignorant of the answers itself: only the first AI knows. Similarly, if the second AI has a long conversation with the human, it can extract the info the first AI gave the human, and then use that to figure out the answer, and tell the human what to answer.

Finally, if the questions are biased (for instance the answer is "no" for of the questions), then the first AI could give up on giving genuine understanding and simply tell the human to always answer "no".

All these points can be addressed by question selection and information control. Both AIs start with a prior over whatever is being analysed. Then N boolean questions are selected, where N is a very large number. These questions are selected so that they are almost 50-50, given . That means that they might have a definite answer, given , but the slightest bit of extra evidence is likely to push them towards "yes" or "no" with (almost) equal ease. Furthermore, for every question in N, it's negation is also in N, so there will always be half of the questions with a "yes" answer and half with a "no", whatever happens.

Then the first AI gets extra evidence, and computes a posterior . It then talks with the human, trying to get them to understand . The human goes over to the second AI, and a question is drawn randomly from N. The second AI only knows , so it knows that the question its asking is roughly equally likely to be answered by "yes" or "no". It doesn't get to see or hear or interact with the human in any way; it simply creates an explanation for the question, and leaves it up to the human to act on. Thus, it cannot update in any way, so can't choose to pass on extra information about -- because it doesn't have any.

And then the human, armed with the knowledge of from the first AI and the explanation of the question from the second, will hopefully be able to answer that question with better than random accuracy.

Finally, it is ponderous if we run this exam every single time the first AI communicates anything. So instead this exam is done as surprise test -- with probability it will happen, so the first AI better be on its toes all the time, while freeing up the human to just receive the understanding and act freely on it, most of the time.

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 4:10 PM

Does the following strategy for both AIs work:

The first AI compresses advice that would allow answering the questions well on average into a short string. Then when "giving the human the information" it gets the human to memorize this short string using e.g. a memory palace technique, perhaps with error correcting codes. Then the second AI extracts this string from the human and uses it to answer the questions.

The second AI doesn't get to extract anything, nor does it answer anything - it gets literally no information from the human.

What it would have to do is instruct the human (sight unseen) on how to use the hidden string to answer the question.

I don't think that scenario quite works out for the AIs in most situations, but I am wary that there is something like that, something the AIs can do to allow the human to answer correctly without understanding.

Why wouldn't it work? The second AI takes the question “ratios of FtsZ to Pikachurin plus C3-convertase to Von Willebrand Factor will rise in proportion to the average electronegativity in the Earth’s orbit” and translates it to something like "is the third object in the memory palace a warm color".

Yes, if a human is able and willing to do better with a memory palace rather than true understanding, this could work. Note that it's rather brittle, though (it needs to have the two AIs aware of the exact sequence), and we might be able to break it that way (increasing uncertainties between the two AIs in some way).

The big unknown is that if we have a billion+ questions to choose from, whether memorised data will perform better than genuineish understanding.

Interesting, but it is hard for me to think of an example where (i) we can mechanically generate a huge database of formally defined questions, (ii) the questions are not human understandable a priori and (iii) the questions can be made human understandable. Any example of i+ii that I can think of involves things like "given a dynamical system of type T whose initial conditions are described by the following huge table of numbers, will the 2356th parameter in the system's state after 746.892 time of evolution be larger than 21.783?"

Also, I guess the implicit assumption is that there is an additional safety mechanism in place which prevents the AI from modifying the human's mind in some terrible way?