Aug 11, 2018
(Disclaimer: this summary is incomplete and does not accurately represent all the content presented at the summer school, but only what I remember and seem to have understood from the lectures. Don't hesitate to mention important ideas I missed or apparent confusion.)
Last week, I attended the first edition of the human-aligned AI summer school in Prague. After three days, my memories are already starting to fade, and I am unsure about what I will retain in the long-term.
Here, I try to remember the content of about 15h of talks. It serves the following purposes:
Value Learning aims at infering human values from their behavior. Paul Christiano distinguishes ambitious value learning vs. narrow value learning:
Inverse Reinforcement Learning (IRL) studies which reward best explains a behaviour. Two methods of IRL were discussed (the state-of-the-art builds on top of those two, for instance using neural networks):
Why not to do value learning:
The main problem of traditional IRL is that it does not take into account the deliberate interactions between a human and an AI (e.g. the human could be slowing down his behaviour to help learning).
Cooperative IRL solves this issue by introducing a two-player game between the human and the AI, where both are rewarded according to the human's reward function. This incentivizes the human to teach the AI his preferences (if the human only choses its best action, the AI would learn the wrong distribution). Using a similar dynamic, the off-switch game encourages the AI to allow himself to be switched off.
Another adversity when implementing IRL is that the reward function is difficult to completely specify, and will often not capture all of what the designer wants. Inverse reward design makes the AI quantify his uncertainty about states. If the AI is risk-averse, it will avoid uncertain states, for instance situations where it believes humans have not completely defined the reward function because they did not know much about it.
Abram's first talk was about his post "Probability is Real, and Value is Complex". At the end of the talk, several people (including me) were confused about the "magic correlation" between probabilities and expected utility, and asked Abram about the meaning of his talk.
From what I understood, the point was to show a counter-intuitive consequence of choosing Jeffrey-Bolker axioms in decision theory over Savage axioms. Because Bayes' algorithm can be formalized using Jeffrey-Bolker axioms, this counter-intuitive result challenges potential agent designs that would use Bayesian updates.
The second talk was more general, and addressed several problems faced by embedded agents (e.g. naturalized induction).
To make sure an AI would be able to understand humans, we need to make sure it understands their bounded rationality, i.e. how sparse information and a bounded computational power limit rationality.
The first talk on the topic introduced a decision-complexity C(A|B) that expressed the "cost" of going from the reference B to the target A (proportional to the Shannon Information of A given B). Intuitively, it represents the cost in search process when going from a prior B to a posterior A. After some mathematical manipulations, a concept of "information cost" is introduced, and the final framework highlights a trade-off between some "information utility" and this "information cost" (for more details see here, pp. 14-18).
Humans seem to exhibit a strong preference in planning hierarchically, and are "irrational" in that sense, or at least not "Boltzmann-rational" (Cundy & Filan, 2018).
Hierarchical RL is a framework used in planning that introduces "options" in Markov Decision Processes where Bellman Equations still hold.
Techniques aiming at minimizing negative side effects include minimizing unnecessary disruptions when achieving a goal (e.g. turning Earth into paperclips) or designing low-impact agents (avoiding large side effects in general).
To correctly measure impact, several questions must be answered:
A "side-effect measure" should penalize unnecessary actions (necessity), understand what was caused by the agent vs. caused by the environment (causation) and penalize irreversible actions (asymmetry).
Hence, an agent may be penalized for an outcome different from an "inaction baseline" (where the agent would not have done anything) or for any irreversible action.
However, those penalties introduce bad incentives to avoid irreversible actions but still let them happen anyway (for instance preventing a vase to be broken to gain a reward, then break the vase anyway to go back to the "inaction baseline"). Relative reachability provides an answer to this behaviour, by penalizing the agent for making states less reachable than there would be by default (for instance breaking a vase makes the states with an unbroken vase unreachable) and leads to safe behaviors in the Sokoban-like and conveyor belt gridworlds.
Open questions about this approach are:
I thank Daniel Filan and Jaime Molina for their feedback, and apologize for the talks I did not summarize.