I googled "model-based RL Atari" and the first hit was this which likewise tries to learn the reward function by supervised learning from observations of past rewards (if I understand correctly)
Ah, the "model-based using a model-free RL algorithm" approach :) They learn a world model using supervised learning, and then use PPO (a model-free RL algorithm) to train a policy in it. It sounds odd but it makes sense: you hopefully get much of the sample efficiency of model-based training, while still retaining the state-of-the-art results of model-free RL. You're right that in this setup, as the actions are being chosen by the (model-free RL) policy, you don't get any zero-shot generalization.
I added a new sub-bullet at the top to clarify that it's hard to explain by RL unless you assume the planner can query the ground-truth reward function in arbitrary hypothetical states. And then I also added a new paragraph to the "other possible explanations" section at the bottom saying what I said in the paragraph just above. Thank you.
Thanks for updating the post to clarify this point -- I agree with you with the new wording.
In ML today, the reward function is typically a function of states and actions, not "thoughts". In a brain, the reward can depend directly on what you're imagining doing or planning to do, or even just what you're thinking about. That's my proposal here.
Yes indeed, your proposal is quite different from RL. The closest I can think of to rewards over "thoughts" in ML would be regularization terms that take into account weights or, occasionally, activations -- but that's very crude compared to what you're proposing.
Thanks for the clarification! I agree if the planner does not have access to the reward function then it will not be able to solve it. Though, as you say, it could explore more given the uncertainty.Most model-based RL algorithms I've seen assume they can evaluate the reward functions in arbitrary states. Moreover, it seems to me like this is the key thing that lets rats solve the problem. I don't see how you solve this problem in general in a sample-efficient manner otherwise.
One class of model-based RL approaches is based on [model-predictive control](https://en.wikipedia.org/wiki/Model_predictive_control): sample random actions, "rollout" the trajectories in the model, pick the trajectory that had the highest return and then take the first action from that trajectory, then replan. That said, assumptions vary. [iLQR](https://en.wikipedia.org/wiki/Linear%E2%80%93quadratic_regulator) makes the stronger assumption that reward is quadratic and differentiable.
I think methods based on [Monte Carlo tree search](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search) might exhibit something like the problem you discuss. Since they sample actions from a policy trained to maximize reward, they might end up not exploring enough in this novel state if the policy is very confident it should not drink the salt water. That said, they typically include explicit methods for exploration like [UCB](https://en.wikipedia.org/wiki/Thompson_sampling#Upper-Confidence-Bound_(UCB)_algorithms) which should mitigate this.
I'm a bit confused by the intro saying that RL can't do this, especially since you later on say the neocortex is doing model-based RL. I think current model-based RL algorithms would likely do fine on a toy version of this task, with e.g. a 2D binary state space (salt deprived or not; salt water or not) and two actions (press lever or no-op). The idea would be:
- Agent explores by pressing lever, learns transition dynamics that pressing lever => spray of salt water.
- Planner concludes that any sequence of actions involving pressing lever will result in salt water spray. In a non salt-deprived state this has negative reward, so the agent avoids it.
- Once the agent becomes salt deprived, the planner will conclude this has positive reward, and so take that action.
I do agree that a typical model-free RL algorithm is not capable of doing this directly (it could perhaps meta-learn a policy with memory that can solve this).
Thanks for the post, this is my favourite formalisation of optimisation so far!
One concern I haven't seen raised so far, is that the definition seems very sensitive to the choice of configuration space. As an extreme example, for any given system, I can always augment the configuration space with an arbitrary number of dummy dimensions, and choose the dynamics such that these dummy dimensions always get set to all zero after each time step. Now, I can make the basin of attraction arbitrarily large, while the target configuration set remains a fixed size. This can then make any such dynamical system seem to be an arbitrarily powerful optimiser.
This could perhaps be solved by demanding the configuration space be selected according to Occam's razor, but I think the outcome still ends up being prior dependent. It'd be nice for two observers who model optimising systems in a systematically different way to always agree within some constant factor, akin to Kolmogorov complexity's invariance theorem, although this may well be impossible.
As a less facetious example, consider a computer program that repeatedly sets a variable to 0. It seems again we can make the optimising power arbitrarily large by making the variable's size arbitrarily large. But this doesn't quite map onto the intuitive notion of the "difficulty" of an optimisation problem. Perhaps including some notion of how many other optimising systems would have the same target set would resolve this.
I feel like there are three facets to "norms" v.s. values, which are bundled together in this post but which could in principle be decoupled. The first is representing what not to do versus what to do. This is reminiscent of the distinction between positive and negative rights, and indeed most societal norms (e.g. human rights) are negative, but not all (e.g. helping an injured person in the street is a positive right). If the goal is to prevent catastrophe, learning the 'negative' rights is probably more important, but it seems to me that most techniques developed could learn both kinds of norms.
Second, there is the aspect of norms being an incomplete representation of behaviour: they impose some constraints, but there is not a single "norm-optimal" policy (contrast with explicit reward maximization). This seems like the most salient thing from an AI standpoint, and as you point out this is an underexplored area.
Finally, there is the issue of norms being properties of groups of agents. One perspective on this is that humans are realising their values through constructing norms: e.g. if I want to drive safely, it is good to have a norm to drive on the left or right side of the road, even though I may not care which norm we establish. Learning norms directly therefore seems beneficial to neatly integrate into human society (it would be awkward if e.g. robots drive on the left and humans drive on the right). If we think the process of going from values to norms is both difficult and important for multi-agent cooperation, learning norms also lets us sidestep a potentially thorny problem.
Thanks for the informative post as usual.
Full-disclosure: I'm a researcher at UC Berkeley financially supported by CHAI, one of the organisations reviewed in this post. However, this comment is just my personal opinion.
Re: location, I certainly agree that an organization does not need to be in the Bay Area to do great work, but I do think location is important. In particular, there's a significant advantage to working in or near a major AI hub. The Bay Area is one such place (Berkeley, Stanford, Google Brain, OpenAI, FAIR) but not the only one; e.g. London (DeepMind, UCL) and Montreal (MILA, Brain, et al) are also very strong.
I also want to push back a bit on the assumption that people working for AI alignment organisations will be involved with EA and rationalist communities. While it may be true in many cases, at CHAI I think it's only around 50% of staff. So whether these communities are thriving or not in a particular area doesn't seem that relevant to me for organisational location decisions.
Description of CHAI is pretty accurate. I think it's a particularly good opportunity for people who are considering grad school as a long-term option: we're in an excellent position to help people get into top programs, and you'll also get a sense of what academic research culture is like.
We'd like to hire more than one engineer, and are currently trialling several hires. We have a mixture of work, some of which is more ML oriented and some of which is more infrastructure oriented. So we'd be willing to consider applicants with limited ML experience, but they'd need to have strengths in other areas to compensate.
If anyone is considering any of these roles and is uncertain whether they're a good fit, I'd encourage you to just apply. It doesn't take much time for you to apply or for the organisation to do an initial screening. I've spoken to several people who didn't think they were viable candidates for a particular role, and then turned out to be one of the best applicants we'd received.