I'm planning to start two research projects on model splintering/reward generalisation and learning the preferences of irrational agents.
Within those projects, I'm aiming to work on subprojects that are:
The point is not just to solve the sub-problems, but to solve them in ways that generalise or point to a general solution.
The aim is to iterate and improve fast on these ideas before implementing them. Because of that, these posts should be considered dynamic and prone to be re-edited, potentially often. Suggestions and modifications of the design are valuable and may get included in the top post.
Parent project: this is a subproject of model-splintering.
Imagine an agent capable of evolving in a 3D world - something very similar to DeepMind's XLand (the images here have been taken from that paper).
The agent has a laser that it can use to pick up and move nearby objects:
Initially the agent is completely fixed in position - it can move the laser across its field of vision, but it can't move around or change its field of vision. It is trained in that situation, and is rewarded for moving black cubes to the bottom right of its field of view (where another agent will pick them up). These black cubes are irregularly dropped in front of it. In actual fact, it is part of a chain gang of agents moving the black cubes across the map.
Then the agent is given full mobility, so it can walk around and explore its 3D world:
The agent will continue to learn in the full 3D situation (similarly to the agents in DeepMind's paper who learn through play), but it won't have any more learning about its reward function.
There are two obvious extensions of its initial reward function:
Challenge 1 is a traditional ontology change, or, in ML terms, transfer learning. Seeing how 2. plays out is the key aim of this sub-project - can an agent generate useful rewards as well as the wireheaded versions? 3. is mainly dependent on what comes out of 2., and asks whether it's possible to explicitly guard against wireheading (the idea is to identify what wireheading looks like, and explicitly seek to avoid it). Meanwhile, 4. is an analysis of model splintering that prepares for further subprojects.
This seems very related to Inverse Reward Design. There was at least one project at CHAI trying to scale up IRD, but it proved challenging to get working -- if you're thinking of similar approaches it might be worth pinging Dylan about it.
My sense is that Stuart assuming there's an initial-specified reward function is a simplification, not a key part of the plan, and that he'd also be interested in e.g. generalizing a reward function learned from other sources of human feedback like preference comparison.
IRD would do well on this problem because it has an explicit distribution over possible reward functions, but this isn't really that unique to IRD -- Bayesian IRL or preference comparison would have the same property.
Yeah, I agree with that.
(I don't think we have experience with deep Bayesian versions of IRL / preference comparison at CHAI, and I was thinking about advice on who to talk to)