Suppose that ELK was solved, and we could train AIs to answer unambiguous human-comprehensible questions about the consequences of their actions. How could we actually use this to guide a powerful AI’s behavior? For example, how could we use it to select amongst many possible actions that an AI could take?
The natural approach is to ask our AI “How good are the consequences of action A?” but that’s way outside the scope of “narrow” ELK as described in Appendix: narrow elicitation.
Even worse: in order to evaluate the goodness of very long-term futures, we’d need to know facts that narrow elicitation can’t even explain to us, and to understand new concepts and ideas that are currently unfamiliar. For example, determining whether an alien form of life is morally valuable might require concepts and conceptual clarity that humans don’t currently have.
We’ll suggest a very different approach:
1. I can use ELK to define a local utility function over what happens to me over the next 24 hours. More generally, I can use ELK to interrogate the history of potential versions of myself and define a utility function over who I want to delegate to—my default is to delegate to a near-future version of myself because I trust similar versions of myself, but I might also pick someone else, e.g. in cases where I am about to die or think someone else will make wiser decisions than I would.
2. Using this utility function, I can pick my “favorite” distribution over people to delegate to, from amongst those that my AI is considering. If my AI is smart enough to keep me safe, then hopefully this is a pretty good distribution.
3. The people I prefer to delegate to can then pick the people they want to delegate to, who can then pick the people they want to delegate to, etc. We can iterate this process many times, obtaining a sequence of smarter and smarter delegates.
4. This sequence of smarter and smarter delegates will gradually come to have opinions about what happens in the far future. Me-of-today can only evaluate the local consequences of actions, but me-in-the-future has grown enough to understand the key considerations involved, and can thus evaluate the global consequences of actions. Me-of-today can thus define utilities over “things I don’t yet understand” by deferring to me-in-the-future.
what happens if this finds a way to satisfy values that the human actually has, but would not have if they had been able to do ELK on their own brain? eg, for example, I'm pretty sure I don't want to want some things I want, and I'm worried about s-risks from the scaled version of locking in networks of conflicting things people currently truly want but truly wouldn't want to truly want. eg, I'm pretty sure mine are milder than this, but some people truly want to hurt others in ways the other doesn't want order to get ahead, and would resist any attempt to... (read more)