(See also: strong HCH.)
Consider a human Hugh who has access to a question-answering machine. Suppose the machine answers question Q by perfectly imitating how Hugh would answer question Q, if Hugh had access to the question-answering machine.
That is, Hugh is able to consult a copy of Hugh, who is able to consult a copy of Hugh, who is able to consult a copy of Hugh…
Let’s call this process HCH, for “Humans Consulting HCH.”
I’ve talked about many variants of this process before, but I find it easier to think about with a nice handle. (Credit to Eliezer for proposing using a recursive acronym.)
HCH is easy to specify very precisely. For now, I think that HCH is our best way to precisely specify “a human’s enlightened judgment.” It’s got plenty of problems, but for now I don’t know anything better.
Elaborations
We can define realizable variants of this inaccessible ideal:
- For a particular prediction algorithm P, define HCHᴾ as:
“P’s prediction of what a human would say after consulting HCHᴾ” - For a reinforcement learning algorithm A, define max-HCHᴬ as:
“A’s output when maximizing the evaluation of a human after consulting max-HCHᴬ” - For a given market structure and participants, define HCHᵐᵃʳᵏᵉᵗ as:
“the market’s prediction of what a human will say after consulting HCHᵐᵃʳᵏᵉᵗ”
Note that e.g. HCHᴾ is totally different from “P’s prediction of HCH.” HCHᴾ will generally make worse predictions, but it is easier to implement.
Hope
The best case is that HCHᴾ, max-HCHᴬ, and HCHᵐᵃʳᵏᵉᵗ are:
- As capable as the underlying predictor, reinforcement learner, or market participants.
- Aligned with the enlightened judgment of the human, e.g. as evaluated by HCH.
(At least when the human is suitably prudent and wise.)
It is clear from the definitions that these systems can’t be any more capable than the underlying predictor/learner/market. I honestly don’t know whether we should expect them to match the underlying capabilities. My intuition is that max-HCHᴬ probably can, but that HCHᴾ and HCHᵐᵃʳᵏᵉᵗ probably can’t.
It is similarly unclear whether the system continues to reflect the human’s judgment. In some sense this is in tension with the desire to be capable — the more guarded the human, the less capable the system but the more likely it is to reflect their interests. The question is whether a prudent human can achieve both goals.
This was originally posted here on 29th January 2016.
Tomorrow's AI Alignment Forum sequences will take a break, and tomorrow's post will be Issue #34 of the Alignment Newsletter.
The next post in this sequence is 'Corrigibility' by Paul Christiano, which will be published on Tuesday 27th November.
It seems to me like there are two separate ideas in this post.
One is HCH itself. Actually, HCH (defined as an infinitely large tree of humans) is a family of schemes with two parameters, so it's really
HCHh,t
where h is a human, and t the amount of time each human has to think. This is impossible to implement since we don't have infinitely many copies of the same human -- and also because the scheme requires that, every time a human consults a subtree, we freeze time for that human until the subtree has computed the answer. But it's useful insofar as it can be approximated.
Whether or not there exists any human such that, if we set t to an hour, HCHh,t has superintelligent performance on a question-answering task is unclear and relies one some version of the Factored Cognition hypothesis.
Separately, the HCHx schemes are about implementations, but they still define targets that can only be approximated, not literally implemented.
Taking just the first one -- if P is a prediction algorithm, then each HCHPn for n∈N can be defined recursively. Namely, we set HCHP0 to be P's learned prediction of h's output, and HCHPn to be P's learned prediction of HCHPn−1's output. Each step requires a training process. We could then set HCHP:=limn→∞HCHPn, but this is not literally achievable since it requires infinitely many training steps.