Crossposted at the Intelligent Agent Forum

I'll try to clarify what I was doing with the AI truth setup in a previous post. First I'll explain the nature of the challenge, and then how the setup tries to solve it.

The nature of the challenge is to have an AI give genuine understanding to a human. Getting the truth out of an AI or Oracle is not that hard, conceptually: you get the AI to report some formal property of its model. The problem is that that truth can be completely misleading, or, more likely, incomprehensible.

For misleading you have old staples like "If you implement this strategy, the incidence of cancer will go down" (neglecting to mention that this will be because everyone will be dead). For incomprehensible, there are statements like "ratios of FtsZ to Pikachurin plus C3-convertase to Von Willebrand Factor will rise in proportion to the average electronegativity in the Earth's orbit". That might be true, but is incomprehensible to a layman and not much better to an expert.

In fact, the AI's answer might be a confusing mix of misleading and incomprehensible. If we've managed to find a good formal definition of "humans have cancer" but not of "humans are still alive", then we'll confront that mix: we'll know this will solve cancer, but we can't figure out whether humanity will survive or not.

Then there's the omnipresent risks from human biases. Confirmation bias, backfire effects, optimism, or pessimism bias... Should the AI take these in account, and present us with inaccurate information that nonetheless ends up with us having a more accurate impression at the end? And how are we going to measure that "accurate impression" anyway?


Measuring true understanding with an exam

Ignore the second AI in the setup for the moment, and focus simply on the first. One mark of true understanding in a human, is if they are capable of accurately answering questions about what they've just learnt. This suggests an exam as a true measure of understanding. And thus the goal for the first AI: it must prep its human to successfully pass the exam.

Now, many have argued that exams are not the be all and end all of understanding. But it does seem that passing exams is a necessary component of true understanding. If you understand something better but can't answer anything new about it, I'd be very curious as to what your understanding consists of.

Now, if there are only a few questions in the exam, the first AI can teach to the test, and simply have the human memorise a few meaningless answers. Technically, it could also do this for billions and billions of questions, but the frailty of human memory partially protects us here: it's much more plausible we have a process for figuring out the answers to that many questions (ie a sort of understanding) than a long list to follow.

However, it's not plausible for normal humans to sit down and answer a billion questions with any degree of competence or speed. Fortunately, that will not be necessary: instead of answering N questions, the human must simply be capable of answering one question chosen randomly from among N. This means that N can be very large indeed, combinatorically large, far too large for even a machine to memorise. Some sort of compression or understanding or simplification will be needed.

So, formally, the first AI is rewarded from the human choosing the right answer from a randomly chosen question. Obviously the AI has to not know the question in advance, or actively advise the human on the answer once the question is given, or the whole setup falls apart.


The roles of the second AI and the surprise-test counterfactual

So, what's the point of having the second AI? Well the setup will work above if we have a list of N questions that are a) formally defined for the first AI, and b) understandable for the human.

But that's rarely the case. And if it were the case, we might have alternatives, like simply having the AI directly answer a whole bunch of the understandable questions.

So typically, the questions are going to be like that "ratios of FtsZ..." statement above, and completely beyond human comprehension. Enter the second AI. It's only role is to take these formally defined questions, and translate them into terms the human is likely to understand -- and hence, answer correctly. Then both AIs are given the motivation to have the human choose the correct answer.

So the second AI is like an exam invigilator clarifying the questions on the exam sheet. Now, if the invigilator knows the answer to the question and really wants the student to pass, it can simply tell the human the answer. So the second AI must be ignorant of the answers itself: only the first AI knows. Similarly, if the second AI has a long conversation with the human, it can extract the info the first AI gave the human, and then use that to figure out the answer, and tell the human what to answer.

Finally, if the questions are biased (for instance the answer is "no" for 99% of the questions), then the first AI could give up on giving genuine understanding and simply tell the human to always answer "no".

All these points can be addressed by question selection and information control. Both AIs start with a prior P over whatever is being analysed. Then N boolean questions are selected, where N is a very large number. These questions are selected so that they are almost 50-50, given P. That means that they might have a definite answer, given P, but the slightest bit of extra evidence is likely to push them towards "yes" or "no" with (almost) equal ease. Furthermore, for every question in N, it's negation is also in N, so there will always be half of the questions with a "yes" answer and half with a "no", whatever happens.

Then the first AI gets extra evidence, and computes a posterior P'. It then talks with the human, trying to get them to understand P'. The human goes over to the second AI, and a question is drawn randomly from N. The second AI only knows P, so it knows that the question its asking is roughly equally likely to be answered by "yes" or "no". It doesn't get to see or hear or interact with the human in any way; it simply creates an explanation for the question, and leaves it up to the human to act on. Thus, it cannot update P in any way, so can't choose to pass on extra information about P' -- because it doesn't have any.

And then the human, armed with the knowledge of P' from the first AI and the explanation of the question from the second, will hopefully be able to answer that question with better than random accuracy.

Finally, it is ponderous if we run this exam every single time the first AI communicates anything. So instead this exam is done as surprise test -- with probability ε it will happen, so the first AI better be on its toes all the time, while freeing up the human to just receive the understanding and act freely on it, most of the time.


New Comment
5 comments, sorted by Click to highlight new comments since: Today at 6:08 PM

Thank you, this is clearer than it was before, and it does seem like a potentially useful technique. I see a couple of limitations:

First, it still seems that the whole plan rests on having a good selection of questions, and the mechanism for choosing them is unclear. If they are chosen by some structured method that thoroughly covers the AI's representation of the prior, the questions asked of the human are unlikely to capture the most important aspects of the update from new evidence. Most of the differences between the prior and the posterior could be insignificant from a human perspective, and so even if the human "understands" the posterior a broad sense they will not be likely to have the answers to all of these. Even if they can figure out those answers correctly, it does not necessarily test whether they are aware of the differences that are most important.

Second, the requirement for the two AIs to have a common prior, and differ only by some known quantum of new evidence, seems like it might restrict the applications considerably. In simple cases you might handle this by "rolling back" a copy of the first AI to a time when it had not yet processed the new evidence, and making that the starting point for the second AI. But if the processing of the evidence occurred before some other update that you want included in the prior, then you would need some way of working backward to a state that never previously existed.

Your first point is indeed an issue, and I'm thinking about it. The second is less of a problem, because now we have a goal description, so implementing the goal is less of an issue.

Possibly a third adversarial AI? Have an AI that generates the questions based on P, is rewarded if the second AI evaluates their probability as close to 50%, is rewarded for the first AI being able to get them right based on P', and for the human getting them wrong.

That's probably not quite right; we want the AI to generate hard but not impossible questions. Possibly some sort of term about the AIs predicting whether the human will get a question right?

Imagine doing this with one AI: It reformulates each question, then it gets the new prior, then it talks to the human. Ignore that it has to do N work in the first step. That might make this easier to see: Why do you think bringing the questions into a form that allows for easy memorization by humans has anything to do with understanding? It could just do the neural-net equivalent of zip compression of a hashmap from reformulated questions to probabilities.

It could just do the neural-net equivalent of zip compression of a hashmap from reformulated questions to probabilities.

But that hashmap has to run on a human mind, and understanding helps us run things like that.

New to LessWrong?