PhD student at MIT (ProbComp / CoCoSci), working on probabilistic programming for agent understanding and value alignment.
Yup! And yeah I think those are open research questions -- inference over certain kinds of non-parametric Bayesian models is tractable, but not in general. What makes me optimistic is that humans in similar cultures have similar priors over vast spaces of goals, and seem to do inference over that vast space in a fairly tractable manner. I think things get harder when you can't assume shared priors over goal structure or task structure, both for humans and machines.
Belatedly reading this and have a lot of thoughts about the connection between this issue and robustness to ontological shifts (which I've written a bit about here), but I wanted to share a paper which takes a very small step in addressing some of these questions by detecting when the human's world model may diverge from a robot's world model, and using that as an explanation for why a human might seem to be acting in strange or counter-productive ways:
Where Do You Think You're Going?: Inferring Beliefs about Dynamics from BehaviorSiddharth Reddy, Anca D. Dragan, Sergey Levinehttps://arxiv.org/abs/1805.08010Inferring intent from observed behavior has been studied extensively within the frameworks of Bayesian inverse planning and inverse reinforcement learning. These methods infer a goal or reward function that best explains the actions of the observed agent, typically a human demonstrator. Another agent can use this inferred intent to predict, imitate, or assist the human user. However, a central assumption in inverse reinforcement learning is that the demonstrator is close to optimal. While models of suboptimal behavior exist, they typically assume that suboptimal actions are the result of some type of random noise or a known cognitive bias, like temporal inconsistency. In this paper, we take an alternative approach, and model suboptimal behavior as the result of internal model misspecification: the reason that user actions might deviate from near-optimal actions is that the user has an incorrect set of beliefs about the rules -- the dynamics -- governing how actions affect the environment. Our insight is that while demonstrated actions may be suboptimal in the real world, they may actually be near-optimal with respect to the user's internal model of the dynamics. By estimating these internal beliefs from observed behavior, we arrive at a new method for inferring intent. We demonstrate in simulation and in a user study with 12 participants that this approach enables us to more accurately model human intent, and can be used in a variety of applications, including offering assistance in a shared autonomy framework and inferring human preferences.
Belatedly seeing this post, but I wanted to note that probabilistic programming languages (PPLs) are centered around this basic idea! Some useful links and introductions to PPLs as a whole:- Probabilistic models of cognition (web book)- WebPPL- An introduction to models in Pyro- Introduction to Modeling in Gen
And here's a really fascinating paper by some of my colleagues that tries to model causal interventions that go beyond Pearl's do-operator, by formalizing causal interventions as (probabilistic) program transformations:
Bayesian causal inference via probabilistic program synthesisSam Witty, Alexander Lew, David Jensen, Vikash Mansinghkahttps://arxiv.org/abs/1910.14124Causal inference can be formalized as Bayesian inference that combines a prior distribution over causal models and likelihoods that account for both observations and interventions. We show that it is possible to implement this approach using a sufficiently expressive probabilistic programming language. Priors are represented using probabilistic programs that generate source code in a domain specific language. Interventions are represented using probabilistic programs that edit this source code to modify the original generative process. This approach makes it straightforward to incorporate data from atomic interventions, as well as shift interventions, variance-scaling interventions, and other interventions that modify causal structure. This approach also enables the use of general-purpose inference machinery for probabilistic programs to infer probable causal structures and parameters from data. This abstract describes a prototype of this approach in the Gen probabilistic programming language.
Bayesian causal inference via probabilistic program synthesisSam Witty, Alexander Lew, David Jensen, Vikash Mansinghkahttps://arxiv.org/abs/1910.14124
Causal inference can be formalized as Bayesian inference that combines a prior distribution over causal models and likelihoods that account for both observations and interventions. We show that it is possible to implement this approach using a sufficiently expressive probabilistic programming language. Priors are represented using probabilistic programs that generate source code in a domain specific language. Interventions are represented using probabilistic programs that edit this source code to modify the original generative process. This approach makes it straightforward to incorporate data from atomic interventions, as well as shift interventions, variance-scaling interventions, and other interventions that modify causal structure. This approach also enables the use of general-purpose inference machinery for probabilistic programs to infer probable causal structures and parameters from data. This abstract describes a prototype of this approach in the Gen probabilistic programming language.
Replying to the specific comments:
This still seems like a fair way to evaluate what the alignment community thinks about, but I think it is going to overestimate how parochial the community is. For example, if you go by "what does Stuart Russell think is important", I expect you get a very different view on the field, much of which won't be in the Alignment Newsletter.
I agree. I intended to gesture a little bit at this when I mentioned that "Until more recently, It’s also been excluded and not taken very seriously within traditional academia", because I think one source of greater diversity has been the uptake of AI alignment in traditional academia, leading to slightly more inter-disciplinary work, as well as a greater diversity of AI approaches. I happen to think that CHAI's research publications page reflects more of the diversity of approaches I would like to see, and wish that more new researchers were aware of them (as opposed to the advice currently given by, e.g., 80K, which is to skill up in deep learning and deep RL).
Reward functions are typically allowed to depend on actions, and the alignment community is particularly likely to use reward functions on entire trajectories, which can express arbitrary views (though I agree that many views are not "naturally" expressed in this framework).
Yup, I think purely at the level of expressivity, reward functions on a sufficiently extended state space can express basically anything you want. That still doesn't resolve several worries I have though:
(I'd cite deep learning generally, not just deep RL.)
You're right, that's what I meant, and have updated the post accordingly.
If you start with an uninformative prior and no other evidence, it seems like you should be focusing a lot of attention on the paradigm that is most successful / popular. So why is this influence "undue"?
I agree that if you start with a very uninformative prior, focusing on the most recently successful paradigm makes sense. But I think once you take into account slightly more information, I think there's reason to think the AI alignment community is currently overly biased towards deep learning:
My own view is that the success of deep learning should be taken in perspective. It's good for certain things, and certain high-data training regimes, and will remain good for those use cases. But in a lot of other use cases, where we might care a lot about sample efficiency and rapid + robust generalizability, most of the recent progress has, in my view, been made by cleverly integrating symbolic approaches with neural networks (even AlphaGo can be seen as a version of this, if one views MCTS as symbolic). I expect future AI advances to occur in a similar vein, and for me that lowers the relevance of ensuring that end-to-end DL approaches are safe and robust.
Thanks for this summary. Just a few things I would change:
Thanks for these thoughts! I'll respond to your disagreement with the framework here, and to the specific comments in a separate reply.First, with respect to my view about the sources of AI risk, the characterization you've put forth isn't quite accurate (though it's a fair guess, since I wasn't very explicit about it). In particular:
Second, with respect to your characterization of AI development and AI risk, I believe that points 1 and 2 above suggest that our views don't actually diverge that much. My worry is that the difficulty of building machines that "follow common sense" is on the same order of magnitude as "defining the good", and just as beset by the meta-ethical and meta-normative worries I've raised above. After all, "common sense" is going to include "common social sense" and "common moral sense", and this kind of knowledge is irreducibly normative. (In fact, I think there's good reason to think that all knowledge and inquiry is irreducibly normative, but that's a stronger and more contentious claim.)
Furthermore, given that AI is already deployed in social domains which tend to have open scope (personal assistants, collaborative and caretaking robots, legal AI, etc.), I think it's a non-trivial possibility that we'll end up having powerful misaligned AI applied to those contexts, and that either violate their intended scope, or require having wide scope to function well (e.g., personal assistants). No doubt, "follow common sense" is a lower bar than "solve moral philosophy", but on the view that philosophy is just common sense applied to itself, solving "common sense" is already most of the problem. For that reason, I think it deserves a plurality of disciplinary* and philosophical perspectives as well.(*On this note, I think cognitive science has a lot to offer with regard to understanding "common sense". Perhaps I am overly partial given that I am in computational cognitive science lab, but it does feel like there's insufficient awareness or discussion of cognitive scientific research within AI alignment spaces, despite its [IMO clearcut] relevance.)
In exchange for the mess, we get a lot closer to the structure of what humans think when they imagine the goal of "doing good." Humans strive towards such abstract goals by having a vague notion of what it would look and feel like, and by breaking down those goals into more concrete sub-tasks. This encodes a pattern of preferences over universe-histories that treats some temporally extended patterns as "states."
Thank you for writing this post! I've had very similar thoughts for the past year or so, and I think the quote above is exactly right. IMO, part of the alignment problem involves representational alignment -- i.e., ensuring that AI systems accurately model both the abstract concepts we use to understand the world, as well as the abstract tasks, goals, and "reasons for acting" that humans take as instrumental or final ends. Perhap's you're already familiar with Bratman's work on Intentions, Plans, & Practical Reason, but to the the extent that "intentions" feature heavily in human mental life as the reasons we cite for why we do things, developing AI models of human intention feels very important.As it happens, one of the next research projects I'll be embarking on is modeling humans as hierarchical planners (most likely in the vein of Hierarchical Task & Motion Planning in the Now by Kaelbling & Lozano-Perez) in order to do Bayesian inference over their goals and sub-goals -- would be happy to chat more about it if you'd like!
Thanks for writing up this post! It's really similar in spirit to some research I've been working on with others, which you can find on the ArXiv here: https://arxiv.org/abs/2006.07532 We also model bounded goal-directed agents by assuming that the agent is running some algorithm given bounded compute, but our approach differs in the following ways:
The abstract for the paper is below -- hope it's useful to read, and I'd be curious to hear your thoughts:
Online Bayesian Goal Inference for Boundedly-Rational Planning Agents
People routinely infer the goals of others by observing their actions over time. Remarkably, we can do so even when those actions lead to failure, enabling us to assist others when we detect that they might not achieve their goals. How might we endow machines with similar capabilities? Here we present an architecture capable of inferring an agent's goals online from both optimal and non-optimal sequences of actions. Our architecture models agents as boundedly-rational planners that interleave search with execution by replanning, thereby accounting for sub-optimal behavior. These models are specified as probabilistic programs, allowing us to represent and perform efficient Bayesian inference over an agent's goals and internal planning processes. To perform such inference, we develop Sequential Inverse Plan Search (SIPS), a sequential Monte Carlo algorithm that exploits the online replanning assumption of these models, limiting computation by incrementally extending inferred plans as new actions are observed. We present experiments showing that this modeling and inference architecture outperforms Bayesian inverse reinforcement learning baselines, accurately inferring goals from both optimal and non-optimal trajectories involving failure and back-tracking, while generalizing across domains with compositional structure and sparse rewards.