[AN #54] Boxing a finite-horizon AI system to keep it unambitious

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter.

The newsletter now has exactly 1,000 subscribers! It's a perfect time to take the 3-minute survey if you haven't already -- just think of how you'll be making the newsletter better for all 1,000 subscribers! Not to mention the readers on Twitter and the Alignment Forum.

Highlights

Asymptotically Benign AGI (Michael Cohen): I'm a bit out of my depth with this summary, but let's give it a shot anyway. The setting: we are not worried about how much compute we use (except that it should be finite), and we would like to build a powerful AI system that can help us with tasks but does not try to influence the world. We'll assume that we can construct a box from which no signals can pass through, except by pressing a specific button that opens a door.

First, the simple version of BoMAI (Boxed Myopic AI). We'll put the AI system and the operator in the box, and the operator and the AI system can talk via text message, and the operator can enter rewards. Each episode has a maximum length (hence myopic), and if the operator ends the episode early, all future rewards are set to zero. BoMAI maximizes episodic reward in a manner similar to AIXI. It has a distribution (initially a speed prior) over all possible time-bounded Turing Machines as possible models that predict observations and rewards. BoMAI uses the maximum a posteriori (MAP) Turing Machine to predict future observations and rewards given actions, searches over all possible sequences of actions for the best one, and executes the first action of that sequence. (See this comment and its parents for the reason to use the MAP model.)

Intuitively, BoMAI has no incentive to affect anything outside the box: information can only leave the box if the episode ends, but if that happens, reward is guaranteed to be exactly zero. It might deceive the operator into giving it high reward, but any such deception will only incidentally affect the outside world, it hasn't been optimized to cause any such effect. However, this is all assuming that the MAP world model is accurate, which may not be true.

Obviously at initialization our world model will not be accurate: we need to explore. Rather than using something like epsilon-exploration, BoMAI delegates to a human explorer for safety, and reduces exploration over time.

More insidiously, a world model could simulate the world for a year, and then report the value in the computer's memory for "the reward from a year ago". This will usually be a perfect prediction of the reward the operator will give, but if BoMAI tries to optimize for it, it has an incentive to try to e.g. cause the operator to build an unboxed AI whose goal is to change BoMAI's memory cell to the maximum value.

These sorts of bad plans can only happen finitely often. If BoMAI executes such a plan, it will then get the actual reward from the operator, see that it differs from the predicted reward, and update against the world model for making a bad prediction. Eventually the correct model will become more likely. But ideally this sort of thing wouldn't happen at all.

Intuitively, such world models would have to simulate the world past the end of the episode, incurring a speed penalty relative to the correct model. However, they could use that simulation to save on the computation needed for future predictions. To make sure that they do incur a speed penalty, we need to make it so that sometimes their simulations are wrong and can't be reused. So, at the end of each episode, we sample from a source of true randomness, and use that to determine whether or not the operator is given a drug that causes retrograde amnesia. In that case, world models that simulate the future will sometimes be wrong about how that decision plays out, and any simulation they did in the wrong branch is useless and leads to a speed penalty.

With a few non-trivial assumptions, we can prove that asymptotically, BoMAI will do at least as well as the human explorer at accumulating reward, and the MAP world model's rewards do not incentivize BoMAI to affect the outside world.

Rohin's opinion: I think the idea of putting the operator in the box with the AI system is very interesting: with previous attempts at boxing, the human operator talking to the AI system was an obvious glaring hole in the box. In this setting, the only information escaping from the box is the fact that the operator has not yet chosen to end the episode.

I am generally skeptical of intuitive reasoning about what can or can't be done by Turing Machines using extreme amounts of computation. There are lots of comments on the post that debate specifics of this. This usually cashes out as a debate about the assumptions in the proof. But it's also worth noting that the theorem is asymptotic, and allows for arbitrarily bad behavior early on. We might still expect good behavior early on for the reasons laid out in the proof, but it's not implied by the theorem, even if the assumptions hold.

Previous newsletters

AI Safety workshop at IJCAI 2019 (Huáscar Espinoza et al): Previously (AN #49), I said the paper submission deadline was April 12. Either I made a mistake, or the deadline has been extended, because the actual deadline is May 12.

Technical AI alignment

Technical agendas and prioritization

AI Alignment Podcast: An Overview of Technical AI Alignment: Part 1 and Part 2 (Lucas Perry and Rohin Shah): In this podcast, I go through a large swath of research agendas around technical AI alignment. The first part is more of a description of what research agendas exist, who works on them, and what they are trying to do, while the second part delves more into the details of each approach. I'd strongly recommend listening to them if you're trying to orient yourself in the technical AI safety landscape.

Topics covered include embedded agency, value learning, impact regularization methods (AN #49), iterated amplification, debate (AN #5), factored cognition (AN #36), robustness (AN #43), interpretability (no canonical link, but activation atlases (AN #49) is an example), comprehensive AI services (AN #40), norm following (AN #3), and boxing (this newsletter).

Learning human intent

Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations (Daniel S. Brown, Wonjoon Goo et al) (summarized by Cody): This paper claims to demonstrate a technique by which an agent learning from a demonstrator's actions can learn to outperform that demonstrator on their true reward, rather than, in the way of imitation learning or behavioral cloning, just mimicking the demonstrator under the assumption that the demonstrator's performance is optimal (or at least near-optimal). The key structural innovation of the paper is to learn using pairs of ranked trajectories and learn a neural network-based reward function based on correctly predicting which will be higher. This allows the model to predict what actions will lead to higher and lower reward, and to extrapolate that relationship beyond the best demonstration. When an agent is then trained using this reward model as it's ground truth reward, it's shown to be capable of outperforming the demonstrator on multiple tested environments, including Atari. An important distinction compared to some prior work is the fact that these rankings are collected in an off-policy manner, distinguishing it from Deep RL from Human Preferences where rankings are requested on trajectories generated as an agent learns.

Cody's opinion: Seems like potentially a straightforward and clever modification to a typical reward learning structure, but a bit unclear how much of the performance relative to GAIL and BCO derives from T-REX's access to suboptimal demonstrations and subtrajectories giving it more effective training data. It does intuitively seem that adding examples of what poor performance looks like, rather than just optimal performance, would add useful informative signal to training. On a personal level, I'm curious if one implication of an approach like this is that it could allow a single set of demonstration trajectories to be used in reward learning of multiple distinct rewards, based on different rankings being assigned to the same trajectory based on the reward the ranker wants to see demonstrated.

Rohin's opinion: It's pretty interesting that the Deep RL from Human Preferences approach works even with off-policy trajectories. It seems like looking at the difference between good and bad trajectories gives you more information about the true reward that generalizes better. We saw similar things in our work on Active Inverse Reward Design (AN #24).

End-to-End Robotic Reinforcement Learning without Reward Engineering (Avi Singh et al) (summarized by Cody): This paper demonstrates an approach that can learn to perform real world robotics tasks based not on example trajectories (states and actions) but just a small number (10) of pixel-level images of goal states showing successful task completion. Their method learns a GAN-like classifier to predict whether a given image is a success, continually adding data sampled from the still-learning policy to the set of negative examples, so the model at each step needs to further refine its model of success. The classifier, which is used as the reward signal in learning the policy, also makes use of a simple active learning approach, choosing the state its classifier is most confident is success and querying a human about it on fixed intervals, ultimately using less than 75 queries in all cases.

Cody's opinion: This is a result I find impressive, primarily because of its interest in abiding by sensible real-world constraints: it's easier for humans to label successful end states than to demonstrate a series of actions, and the number of queries made was similarly pragmatically low.

Reward learning theory

AI Alignment Problem: “Human Values” don’t Actually Exist (avturchin)

Verification

Optimization + Abstraction: A Synergistic Approach for Analyzing Neural Network Robustness (Greg Anderson et al)

AI strategy and policy

Global AI Talent Report 2019 (Jean-Francois Gagné): This report has a lot of statistics on the growth of the field of AI over the last year.

FLI Podcast: Why Ban Lethal Autonomous Weapons? (Ariel Conn, Emilia Javorsky, Bonnie Docherty, Ray Acheson, and Rasha Abdul Rahim)

Other progress in AI

Reinforcement learning

How to Train Your OpenAI Five (OpenAI): OpenAI Five (AN #13) has now beaten the Dota world champions 2-0, after training for 8x longer, for a total of 800 petaflop/s-days or 45000 years of Dota self-play experience. During this insanely long training run, OpenAI grew the LSTM to 4096 units, added buybacks to the game, and switched versions twice. Interestingly, they found it hard to add in new heroes: they could bring a few new heroes up to 95th percentile of humans, but it didn't look like they would train fast enough to reach pro level. This could be because the other heroes were already so capable that it was too hard to learn, since the new heroes would constantly be beaten. The resulting team was also able to play cooperatively with humans, even though they had never been trained with humans.

As usual, I like Alex Irpan's thoughts. On the Dota side, he found Five's reaction times more believable, but was disappointed by the limited hero pool. He also predicted that with OpenAI Five Arena, which allowed anyone to play either alongside Five, or against Five, at least one of the many teams would figure out a strategy that could reliably beat Five. He was right: while Five had a 99.4% win rate, one team was able to beat it 10 times in a row, another beat it thrice in a row, and two teams beat it twice in a row.

Rohin's opinion: In this era of scaling up compute via parallelism, it was quite surprising to see OpenAI scaling up compute simply by training for almost a year. That feels like one of the last resorts to scale up compute, so maybe we're seeing the limits of the trend identified in AI and Compute (AN #7)?

Back when OpenAI Five beat a strong team in their Benchmark (AN #19), I and a few others predicted that the team would be able to beat Five after playing a few games against it. I think this prediction has been somewhat validated, given that four teams figured out how to beat a much stronger version of the bot. Of course, humans played over 7000 games against Five, not just a few, so this could be that enough random search finds a weakness. Still, I'd expect pros to be able to do this in tens, maybe hundreds of games, and probably this would have been much easier at the time of the Benchmark.

The underlying model here is that Dota has an extremely large space of strategies, and neither Five nor humans have explored it all. However, pros have a better (lower-dimensional) representation of strategy space (concepts like "split-push") that allow them to update quickly when seeing a better opponent. I don't know what it would take to have AI systems learn these sorts of low-dimensional representations, but it seems key to having AI systems that can adapt quickly like humans can.

Deep learning

Do we still need models or just more data and compute? (Max Welling): This is a response to The Bitter Lesson (AN #49), that emphasizes the importance of data in addition to compute. It brings up a number of considerations that seem important to me, and is worth reading if you want to better understand my position on the bitter lesson.

Semantic Image Synthesis with Spatially-Adaptive Normalization (Taesung Park et al.) (summarized by Dan H): This paper shows how to create somewhat realistic images specified by semantic segmentation maps. They accomplish this by modifying batch normalization. Batch normalization modifications can be quite powerful for image generation, even enough to control style. Their modification is that normalization is a direct function of the semantic segmentation map throughout the network, so that the semantic segmentation map is readily available to each ResBlock. Visualizations produced by this method are here.

News

SafeML Workshop: Accepted Papers: The camera-ready papers from the SafeML workshop are now available! There are a lot of good papers on robustness, adversarial examples, and more that will likely never make it into this newsletter (there's only so much I can read and summarize), so I encourage you to browse through it yourself.

Why the world’s leading AI charity decided to take billions from investors (Kelsey Piper)

Read more: OpenAI LP (AN #52)

LESSWRONG
LW

LESSWRONG
LW

20

[AN #54] Boxing a finite-horizon AI system to keep it unambitious

20

Ω 10

Highlights

Previous newsletters

Technical AI alignment

Technical agendas and prioritization

Learning human intent

Reward learning theory

Verification

AI strategy and policy

Other progress in AI

Reinforcement learning

Deep learning

News

20

Ω 10

20

Ω 10