I'm now in a position where I can see a possible route to a safe/survivable/friendly Artificial Intelligence being developed. I'd give a 10+% chance of it being possible this way, and a 95% chance that some of these ideas will be very useful for other methods of alignment. So I thought I'd encode the route I'm seeing as research agenda; this is the first public draft of it.

Clarity, rigour, and practicality: that's what this agenda needs. Writing this agenda has clarified a lot of points for me, to the extent that some of it now seems, in retrospect, just obvious and somewhat trivial - "of course that's the way you have to do X". But more clarification is needed in the areas that remain vague. And, once these are clarified enough for humans to understand, they need to be made mathematically and logically rigorous - and ultimately, cashed out into code, and tested and experimented with.

So I'd appreciate any comments that could help with these three goals, and welcome anyone interested in pursuing research along these lines over the long-term.

Note: I periodically edit this document, to link it to more recent research ideas/discoveries.

0 The fundamental idea

This agenda fits itself into the broad family of Inverse Reinforcement Learning: delegating most of the task of inferring human preferences to the AI itself. Most of the task, since it's been shown that humans need to build the right assumptions into the AI, or else the preference learning will fail.

To get these "right assumptions", this agenda will look into what preferences actually are, and how they may be combined together. There are hence four parts to the research agenda:

A way of identifying the (partial^[1]) preferences of a given human $H$ .
A way for ultimately synthesising a utility function $U_{H}$ that is an adequate encoding of the partial preferences of a human $H$ .
Practical methods for estimating this $U_{H}$ , and how one could use the definition of $U_{H}$ to improve other suggested methods for value-alignment.
Limitations and lacunas of the agenda: what is not covered. These may be avenues of future research, or issues that cannot fit into the $U_{H}$ paradigm.

There has been a myriad of small posts on this topic, and most will be referenced here. Most of these posts are stubs that hint to a solution, rather than spelling it out fully and rigorously.

The reason for that is to check for impossibility results ahead...

Research Agendas

Research Agendas