AI Alignment Podcast: Inverse Reinforcement Learning and the State of AI Alignment with Rohin Shah (Lucas Perry and Rohin Shah): Lucas interviewed me and we talked about a bunch of different topics. Some quick highlights, without the supporting arguments:
- If we want to use inverse reinforcement learning (IRL) to infer a utility function that we then optimize, we would have to account for systematic biases, and this is hard, and subject to an impossibility result.
- Humans do seem to be good at inferring goals of other humans, probably because we model them as planning in a similar way that we ourselves plan. It's reasonable to think that IRL could replicate this. However, humans have very different ideas on how the future should go, so this seems not enough to get a utility function that can then be optimized over the long term.
- Another issue with having a utility function that is optimized over the long term is that it would have to somehow solve a whole lot of very difficult problems like the nature of identity and population ethics and metaphilosophy.
- Since human preferences seem to change as the environment changes, we could try to build AI systems whose goals are constantly changing by continuously running IRL. This sort of approach is promising but we don't know how to get it working yet.
- IRL, agency and optimization all seem to require a notion of counterfactuals.
- One view of agency is that it is about how a search process thinks of itself, or about other search processes. This gives it a feeling of "choice", even though the output of the search process is determined by physics. This can explain the debates over whether evolution is an optimization process -- on the one hand, it can be viewed as a search process, but on the other, we understand it well enough to think of it as a "deterministic" procedure.
- One way to view the AI alignment problem is to view it as a human-AI interaction problem, so that we get an AI that evolves over time along with us.
- Rather than building a function maximizer, we could aim to build an AI system that is corrigible, or one that follows norms.- Both iterated amplification and debate operate on an exponential deliberation tree, though in different ways, using reasoning learned from humans. If a human would have some desirable property (such as good epistemics), so too should their amplification.- Both iterated amplification and debate are based on explicit human reasoning, as opposed to intuitive reasoning.
- Value drift in the literal sense can be both positive and negative -- I certainly expect and want my stated preferences to change as I become more knowledgeable in the future.
- We only want the combined human-AI system to have a goal, which allows for a space of possibilities where the AI is not optimizing a goal.
- One of the problems that seems most troubling is the issue of inner optimizers, which will hopefully be described in a sequence soon.
Reinterpreting “AI and Compute” (Ben Garfinkel): Data from OpenAI showed that the amount of compute used by the most expensive projects had been growing exponentially with a doubling time of three months. While it is easy to interpret this trend as suggesting that we will get AGI sooner than expected, it is also possible to interpret this trend as evidence in the opposite direction. A surprisingly high rate of increase in amount of compute used suggests that we have been overestimating how helpful more compute is. Since this trend can't be sustainable over decades, we should expect that progress will slow down, and so this data is evidence against near-term AGI.
Rohin's opinion: The surprising part of the data is how fast compute has been growing. One common part of AGI timelines is whether you think compute or algorithms are the bottleneck. Assuming you had a good sense of progress in AI, but were surprised by how fast compute grew, you should update against the relative benefits of compute.
This post seems to be about the way you relate compute to AGI timelines, ignoring algorithms altogether. If you think of AGI as requiring a specific amount of compute that is determined independently of current AI progress (for example, by estimating the compute used by a human brain), then the evidence should shorten your timelines. If you instead predict how close AGI is by looking at the rate of progress in AI and extrapolating over time, then to first order this data should not affect timelines (since compute is not part of the model), and to second order it should lengthen them for the reasons in this post.
Read more: AI and Compute
Three AI Safety Related Ideas and Two Neglected Problems in Human-AI Safety (Wei Dai): Last week, I said that the problem of defending against intentional value corruption was an instance of the problem "Competing aligned superintelligent AI systems could do bad things", and I wasn't sure why we were focusing on value corruption in particular. In this comment, Wei Dai argues that superintelligent AI systems could be really good at cooperating with each other, which solves most of the problems. However, the terms of such cooperation will probably reflect the balance of power between the AI systems, which may tend to benefit simpler value systems rather than ones with a proper amount of value complexity and moral uncertainty. This seems plausible to me, though I'm not confident one way or the other.
Technical AI alignment
Technical agendas and prioritization
AI Alignment Podcast: Inverse Reinforcement Learning and the State of AI Alignment with Rohin Shah (Lucas Perry and Rohin Shah): Summarized in the highlights!
Anthropic paradoxes transposed into Anthropic Decision Theory (Stuart Armstrong)
Anthropic probabilities and cost functions (Stuart Armstrong)
Learning human intent
Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow (Xue Bin Peng et al): Adversarial learning techniques require a delicate balance between the generator and the discriminator. If the discriminator is too weak, it cannot tell the difference between generated samples and true samples, and it cannot provide a learning signal for the generator. If the discriminator is too strong, small changes to the generator are not going to fool the discriminator, and so again the gradient is uninformative. This paper proposes to control the power of the discriminator using an information bottleneck.
Instead of providing data points directly to the discriminator, the data points are first encoded into a new representation, and the discriminator must work with the new representation. The representation is learned to be helpful for the discriminator under the constraint of an upper bound on the mutual information between the representation and the original data points. The choice of upper bound determines how much information the discriminator is allowed to access, which in turn determines how powerful the discriminator is.
They apply this idea to imitation learning (GAIL), inverse reinforcement learning (AIRL), and image generation (GANs), and find that it improves results.
Reinterpreting “AI and Compute” (Ben Garfinkel): Summarized in the highlights!
Reasons compute may not drive AI capabilities growth (Kythe): A common narrative (for example, at OpenAI) is that AI progress will be driven by improvements in compute, but there are a few reasons we may not expect this to be the case. First, there are many known techniques to train faster that only require some engineering effort, that researchers often don't use. Second, researchers still use grid searches to optimize hyperparameters rather than more efficient methods. These two points suggest that researchers spend compute in order to avoid engineering effort, and so compute must not be the bottleneck.
In addition, the trends that have previously powered increasing levels of compute may be slowing down. For example, we had one-time gains by moving to GPUs and then to custom accelerators like TPUs, which probably will not happen again. In addition, many RL experiments require simulations on CPUs, and CPU improvements appear to be slowing down. GPU memory is often a bottleneck as well, though this could start increasing now that there is demand for larger memories, or we could get faster hardware interconnects that allow you to split models across multiple chips.
Rohin's opinion: I think the evidence in the first part suggesting an abundance of compute is mostly explained by the fact that academics expect that we need ideas and algorithmic breakthroughs rather than simply scaling up existing algorithms, so you should update on that fact rather than this evidence which is a downstream effect. If we condition on AGI requiring new ideas or algorithms, I think it is uncontroversial that we do not require huge amounts of compute to test out these new ideas.
The "we are bottlenecked on compute" argument should be taken as a statement about how to advance the state of the art in big unsolved problems in a sufficiently general way (that is, without encoding too much domain knowledge). Note that ImageNet is basically solved, so it does not fall in this category. At this point, it is a "small" problem and it's reasonable to say that it has an overabundance of compute, since it requiresfour orders of magnitude less compute than AlphaGo (and probably Dota). For the unsolved general problems, I do expect that researchers do use efficient training tricks where they can find them, and they probably optimize hyperparameters in some smarter way. For example, AlphaGo's hyperparameters were trained via Bayesian optimization.
More details in this comment. I don't know much about trends in hardware so I won't comment on the second part.
The case for taking AI seriously as a threat to humanity (Kelsey Piper): This is an introduction to the problem of AI safety, from the perspective that it is hard to specify the "right" goal, and that goal-driven behavior leads to convergent instrumental subgoals that will likely be dangerous. It also addresses several common initial reactions that people have.
Rohin's opinion: I really like this introduction, it remains understandable while being technically accurate. It will probably be my new default article to introduce people to the problem.
AI strategy and policy
Scaling shared model governance via model splitting (Miljan Martic, Jan Leike et al): Suppose that two organizations want to develop a deep learning model together without allowing either one to unilaterally use the model. This can be done cryptographically using homomorphic encryption or secure multiparty computation, but this introduces several orders of magnitude of slowdown. What about the much simpler solution of letting each organization have half of the parameters, that are not shared with the other organization? For this to be secure, it should be prohibitively difficult to find the other organization's parameters. In the least convenient world where each organization has access to all training data, hyperparameters etc., this is the security of the model completion problem, where given all of the normal setup for deep learning as well as half of the trained parameters for a model M, the goal is to create a new model that performs as well as M. Of course, we can simply rerun the training procedure that was used to create M, so the cost is bounded above by the cost to create M in the first place. We might be able to do better by leveraging the trained parameters that we know -- for example, by using those parameters as an initialization for the model instead of whatever initialization we normally use. The paper empirically investigates how well strategies like this can work. They find that it is relatively easy to create a model that achieves good performance (getting 80% of the way to the best performance), but quite difficult to achieve performance as good as that of M, typically requiring 40-100% of the time it took to create M.
Rohin's opinion: In the particular setting that they're considering, let's say that we require C compute to train M. Then one of the organizations had to contribute at least 0.5C, and that organization could defect by investing 1.5C. The first 0.5C is used to take part in the model splitting scheme so as not to arouse suspicion, and the remaining 1C is used to train a new version of M from scratch. So, security in this setting requires us to assume that the organization is unwilling to invest 3x the compute they are going to invest. This assumption seems questionable, but when it does hold, the evidence from the paper suggests that model splitting is relatively secure, since it typically takes an additional 0.4-1C in order to fully solve the model completion problem.
When there are N >> 2 parties, each party only has to contribute C/N. So, the assumption that no party will use C compute to recreate the model now translates to an assumption that no party will invest (N+1)C compute, which seems more reasonable for sufficiently large N.
I suspect we can get more mileage if each party had its own training data that it kept secret. It's not clear how to train an AI system such that the training data remains secret, but if we could do that, and the model was split across each group, it would probably be impossible for any one group to recover a new model M' that achieved performance as good as that of M.
2018 AI Alignment Literature Review and Charity Comparison (Larks): This post summarizes relevant papers in AI alignment over the last year, and uses them to compare different organizations working on AI alignment in order to choose which one to donate to.
Rohin's opinion: It's a good roundup of papers, including several papers that I haven't covered in this newsletter.