adamShimi's Shortform

by adamShimi22nd Jul 20202 comments
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
2 comments, sorted by Highlighting new comments since Today at 1:59 PM
New Comment

I can't find a reference for how to test if an inferred (or just given) reward function for a system can be used to predict decently enough what the system actually does.

What I can find are references about the usual IRL/preference learning setting, where there is a true reward function, known to us but unknown to the inference system; then the inferred reward function is evaluated by training a policy on it and seeing how much reward (or how much regret) it gets on the true reward function.

But that's a good setting for checking if the reward is good for learning to do the task, not for checking if the reward is good for predicting what this specific system will do.

The best thing I have in mind right know is to find a bunch of different initial conditions, train on the reward from these conditions, and mix all policies together to get a distribution on the action at each state, and compare that with what the system actually does. It seems decent enough, but I would really like to know if someone has done something similar in the literature.

(Agents and Devices points in the right direction, but it's focused on prediction which of the agent mixture or the device mixture is more probable in the posterior, which is a different problem.)

A month after writing my post on Focus as a crucial component of goal-directedness, I think I see things clearer about its real point. You can decompose the proposal in my post into two main ideas:

  • How much a system S is trying to accomplish a goal G can be captured by the distance of S to the set of policies maximally goal-directed towards G.
  • The set of policies maximally directed towards G is the set of policies trained by RL (for every amount of resource above a threshold) on the reward corresponding to G.

The first idea is what focus is really about. The second doesn't work as well, and multiple people pointed out issues with it. But I still find powerful the idea that focus on a goal mesures the similarity with a set of policy that only try to accomplish this goal.

Now the big question left is: can we define the set of policies maximally goal-directed towards G in a clean way that captures our intuitions?