Alignment Newsletter #21

Rohin Shah

Highlights

80K podcast with Katja Grace (Katja Grace and Rob Wiblin): Rob Wiblin interviewed Katja Grace of AI Impacts about her work predicting the future of AI. My main takeaway was that there are many important questions in this space that almost no one is trying to answer, and that we haven't made a good enough attempt yet to conclude that it's too hard to do, so we should put more time into it. If you haven't seen AI Impacts' work before, you can get some of the most interesting results (at a high level) from listening to this podcast. There's a ton of detail in the podcast -- too much for me to summarize here.

My opinion: I don't currently think very much about timelines, intelligence explosions, and other questions that AI Impacts thinks about, but it seems very plausible to me that these could be extremely important. (I do think about discontinuities in progress and am very glad I read the AI Impacts post on the subject.) One point that the interview brings up is that there are very few (perhaps two?) full time equivalents working on predicting the future of AI, while there are many people working on technical AI safety, so the former is more neglected. I'm not sure I agree with this -- the number of full time equivalents doing technical AI alignment research seems quite small (on the order of 50 people). However, I do see many people who are trying to skill up so that they can do technical AI alignment research, and none who want to do better prediction, and that seems clearly wrong. I would guess that there are several readers of this newsletter who want to do technical AI alignment research, but who would have more impact if they worked in an adjacent area, such as prediction as at AI Impacts, or policy and strategy work, or in better tools and communication. Even though I'm well-placed to do technical research, I still think that common knowledge of research is a big enough bottleneck that I spend a lot of time on this newsletter. It seems likely that there is someone else who would do a better job than me, but who is set on technical safety research even though they wouldn't be as good. So I guess if you are still trying to figure out how to best help with AI alignment, or are about to start training up to do technical research, please do listen to this podcast and consider that alternative route, and various others as well. The goal is not to figure out which question is the most important, so that you can try to solve it. You'll likely do better by considering the field as a whole, and asking which area you would be in if someone optimally assigned people in the field to tasks.

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review (Sergey Levine): I sent this out as a link in AN #5, but only just got around to reading it. This paper shows how you can fit the framework of reinforcement learning into the framework of inference within probabilistic graphical models. Specifically, the states s_t and actions a_t are now represented as nodes in the graphical model, and we add in new nodes O_t that represent whether or not an "event" happened at time t. By assigning the values of P(O_t | s_t, a_t) appropriately, we can encode a reward function. Then, by conditioning on the rewarding events happening, we can infer what actions must have been taken to get these events, which gives us a policy that achieves high reward. They later talk about the connection to variational inference, and how you can get IRL methods in this framework.

My opinion: Remarkably, this paper is both heavy on (useful) math, and very clear and well-explained. I actually didn't try to explain the technical details in my summary as much as I usually do, because you can just read the paper and actually understand what's going on, at least if you're familiar with probabilistic graphical models. Regarding the content, I've found the framework useful for one of my current projects, so I do recommend reading it.

Safety-first AI for autonomous data centre cooling and industrial control (Amanda Gasparik et al): Two years ago, DeepMind built an AI recommendation system that provided suggestions on how best to cool Google's data centers, leading to efficiency gains. Nine months ago, the AI was given autonomous control to take actions directly, rather than going through human operators, and it has been improving ever since, going from 12% savings at deployment to 30% now.

Of course, such a system must be made extremely reliable, since a failure could result in Google's data centers going down. They implemented several safety measures. They throw out any actions that the AI is not confident about. All actions are verified against a set of hand-coded safety rules, both when the actions are generated in the cloud, and at each local data center, for reliability through redundancy. There are human operators monitoring the AI to make sure nothing goes wrong, who can take over control whenever they want to. There is also an automated system that will fall back to the original system of heuristics and rules if the safety conditions are ever violated.

My opinion: This is a remarkable number of safety precautions, though in hindsight it makes total sense given how bad a failure could be. None of the precautions would stop a superintelligent agent in the classical sense (that is, the sort of superintelligent agent in paperclip maximizer stories), but they seem like a really good set of precautions for anything task-based. I am curious how they chose the threshold for when to discard actions that the AI is not confident enough in (especially since AI uncertainty estimates are typically not calibrated), and how they developed the safety rules for verification (since that is a form of specification, which is often easy to get wrong).

Technical AI alignment

Agent foundations

Reducing collective rationality to individual optimization in common-payoff games using MCMC (jessicata): Given how hard multiagent cooperation is, it would be great if we could devise an algorithm such that each agent is only locally optimizing their own utility (without requiring that anyone else change their policy), that still achieves the globally optimal policy. This post considers the case where all players have the same utility function in an iterated game. In this case, we can define a process where at every timestep, one agent is randomly selected, and that agent changes their action in the game uniformly at random with probability that depends on how much utility was just achieved. This depends on a rationality parameter α -- the higher α is, the more likely it is for the player to stick with a high utility action.

This process allows you to reach every possible joint action from every other possible joint action with some non-zero probability, so in the limit of running this process forever, you will end up visiting every state infinitely often. However, by cranking up the value of α, we can ensure that in the limit we spend most of the time in the high-value states and rarely switch to anything lower, which lets us get arbitrarily close to the optimal deterministic policy (and so arbitrarily close to the optimal expected value).

My opinion: I like this, it's an explicit construction that demonstrates how you can play with the explore-exploit tradeoff in multiagent settings. Note that when α is set to be very high (the condition in which we get near-optimal outcomes in the limit), there is very little exploration, and so it will take a long time before we actually find the optimal outcome in the first place. It seems like this would make it hard to use in practice, but perhaps we could replace the exploration with reasoning about the game and other agents in it? The author was planning to use reflective oracles to do something like this if I understand correctly.

Learning human intent

Shared Multi-Task Imitation Learning for Indoor Self-Navigation (Junhong Xu et al)

Preventing bad behavior

Safety-first AI for autonomous data centre cooling and industrial control (Amanda Gasparik et al): Summarized in the highlights!

Interpretability

Learning Explanations from Language Data (David Harbecke, Robert Schwarzenberg et al)

Miscellaneous (Alignment)

80K podcast with Katja Grace (Katja Grace and Rob Wiblin): Summarized in the highlights!

Book Review: AI Safety and Security (Michaël Trazzi): A review of the new AI Safety and Security book. It goes through each of the papers, giving a short summary of each and some comments (similar to this newsletter).

My opinion: I take a very different approach to AI safety, so it was nice to read a summary of what other people are thinking about. Based on the summaries, it sounds like most of the essays that focused on AGI were anthropomorphizing AGI more than I would like (though of course I haven't actually read the book).

Near-term concerns

Privacy and security

Are You Tampering With My Data? (Michele Alberti, Vinaychandran Pondenkandath et al)

AI capabilities

Reinforcement learning

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review (Sergey Levine): Summarized in the highlights!

The International 2018: Results (OpenAI): Two human teams beat OpenAI Five at The International. The games seemed much more like regular Dota, probably because there was now only one vulnerable courier for items instead of five invulnerable ones. This meant that OpenAI Five's strategy of a relentless team attack on the enemy was no longer as powerful, because they couldn't get the health regeneration items they needed to constantly stay alive to continue the attack. It's also possible (but less likely to me) that the matches were more normal because the teams were more even, or because the human teams knew about Five's strategy this time and were countering it in ways that I don't understand.

My opinion: There are still some things that the bots do that seem like bad decisions. You can interpret this a few ways. Five could have learned a large number of heuristics that make it good enough to beat almost all humans, but that break down in edge cases. In this story, Five is not good at learning logical or abstract reasoning, but can compensate for that in the average case with the sheer number of heuristics it can learn. Another interpretation is that Five learns a good representation of Dota which lets it come up with new, novel insights into the game, which we can't see or understand because the representation is alien to us. However, the representation makes it harder to come up with other insights about Dota that we have using our representations of Dota, and as a result Five makes some mistakes that humans can easily recognize as mistakes. I lean towards the first interpretation, but not very strongly.

Deep learning

Skill Rating for Generative Models (Catherine Olsson et al)

Neural Architecture Search: A Survey (Thomas Elsken et al)

Analyzing Inverse Problems with Invertible Neural Networks (Lynton Ardizzone et al)

Unsupervised learning

Life-Long Disentangled Representation Learning with Cross-Domain Latent Homologies (Alessandro Achille et al)

Learning deep representations by mutual information estimation and maximization (R Devon Hjelm et al)

Miscellaneous (Capabilities)

Winner's Curse? (D. Sculley et al): A short paper arguing that we need more empirical rigor in ML, identifying some structural incentives that push against this and suggesting solutions.

My opinion: While this isn't very relevant to technical alignment, it does seem important to have more rigor in ML, since ML researchers are likely to be the ones building advanced AI.

News

DeepMind job: Science Writer: According to the job listing, the role would involve creating content for the blog, videos, presentations, events, etc. and would require a reasonably technical background and strong writing skills. Vishal Maini at DeepMind notes that this person would likely have a significant impact on how AI research is communicated to various key strategic audiences around the world -- from the technical community to the broader public -- and would spend some of their time engaging with AI alignment research, among other areas.

Internship: The Future Society (Caroline Jeanmaire): An internship which will focus on AI policy research as well as support to organize two large AI governance events. To apply, send a CV and a short letter explaining ‘why you?’ to caroline.jeanmaire@thefuturesociety.org.

25