The Alignment Newsletter #1: 04/09/18

Rohin Shah

Highlights

Specification gaming examples in AI (Victoria Krakovna): A list of examples of specification gaming, where an algorithm figures out a way to literally satisfy the given specification which does not match the designer's intent.

Should you read it? There were several examples I hadn't heard of before, which were pretty entertaining. Also, if you have any examples that aren't already listed, it would be great to send them via the form so that we can have a canonical list of specification gaming examples.

My take on agent foundations: formalizing metaphilosophical competence (Alex Zhu): Argues that the point of Agent Foundations is to create conceptual clarity for fuzzy concepts that we can't formalize yet (such as logical uncertainty). We can then verify whether our ML algorithms have these desirable properties. It is decidedly not a goal to build a friendly AI using modules that Agent Foundations develop.

Should you read it? I don't know much about MIRI and Agent Foundations, but this made sense to me and felt like it clarified things for me.

Adversarial Attacks and Defences Competition (Alexey Kurakin et al): This is a report on a competition held at NIPS 2017 for the best adversarial attacks and defences. It includes a summary of the field and then shows the results from the competition.

Should you read it? I'm not very familiar with the literature on adversarial examples and so I found this very useful as an overview of the field, especially since it talks about the advantages and disadvantages of different methods, which are hard to find by reading individual papers. The actual competition results are also quite interesting -- they find that the best attacks and defences are both quite successful on average, but have very bad worst-case performance (that is, the best defence is still very weak against at least one attack, and the best attack fails to attack at least one defence). Overall, this paints a bleak picture for defence, at least if the attacker has access to enough compute to actually try out different attack methods, and has a way of verifying whether the attacks succeed.

Technical AI alignment

Problems

Specification gaming examples in AI (Victoria Krakovna): Summarized in the highlights!

Metaphilosophical competence can't be disentangled from alignment (Alex Zhu): Would you be comfortable taking a single human, and making them a quadrillion times more powerful?

Should you read it? I am curious to see people's answers to this, I think it might be a good question to reveal major differences in worldviews between optimistic and pessimistic safety researchers.

Reframing misaligned AGI's: well-intentioned non-neurotypical assistants (Alex Zhu): Another way to think about problems from AGI is to imagine the AI as a well-intentioned but neuroatypical friend, who learned all about humans from Wikipedia, and who has access to immense resources. You would worry a lot about principal-agent problems in such a scenario.

Should you read it? I like this framing. I'm not sure if it is actually a good model for act-based agents, but it's another way to think about what problems could arise from an AI system that is superintelligent in some domains and subhuman in others.

Read more: Act-based agents

Superintelligent messiahs are corrigible and probably misaligned (Alex Zhu)

Technical agendas and prioritization

My take on agent foundations: formalizing metaphilosophical competence (Alex Zhu): Summarized in the highlights!

Agent foundations

2018 research plans and predictions (Rob Bensinger): Scott and Nate from MIRI score their predictions for research output in 2017 and make predictions for research output in 2018.

Should you read it? I don't know enough about MIRI to have any idea what the predictions mean, but I'd still recommend reading it if you're somewhat familiar with MIRI's technical agenda to get a bird's-eye view of what they have been focusing on for the last year.

Prerequisities: A basic understanding of MIRI's technical agenda (eg. what they mean by naturalized agents, decision theory, Vingean reflection, and so on).

Musings on Exploration (Alex Appel): Decision theories require some exploration in order to prevent the problem of spurious conterfactuals, where you condition on a zero-probability event. However, there are problems with exploration too, such as unsafe exploration (eg. launching a nuclear arsenal in an exploration step), and a sufficiently strong agent seems to have an incentive to self-modify to remove the exploration, because the exploration usually leads to suboptimal outcomes for the agent.

Should you read it? I liked the linked post that explains why conditioning on low-probability actions is not the same thing as a counterfactual, but I'm not knowledgeable enough to understand what's going on in this post, so I can't really say whether or not you should read it.

Quantilal control for finite MDPs (Vadim Kosoy)

Miscellaneous (Alignment)

Papers from AI and Society: Ethics, Safety and Trustworthiness in Intelligent Agents

Guide Me: Interacting with Deep Networks (Christian Rupprecht, Iro Laina et al)

Near-term concerns

Adversarial examples

Adversarial Attacks and Defences Competition (Alexey Kurakin et al): Summarized in the highlights!

Security

Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks (Ali Shafahi, W. Ronny Huang et al): Demonstrates a data poisoning attack in which the adversary gets to choose a poison input to add to the training set, but does not get to choose its label. The goal is to misclassify a single test instance as a specific base class. They achieve this by creating a poison input that looks like the base class in pixel space but looks like the test instance in feature space (i.e. the activations in the penultimate layer). The poison input will be labeled by humans as the base class, and then when the network is retrained with the original dataset and the new poisoned input(s), it will classify the poison input as the base class, and with it the test instance as well (since they have very similar features).

Should you read it? I was pleasantly surprised at how understandable the paper was, and they do a good job of looking at exactly what their method is doing and how it accomplishes the attack in different ways under different settings.

Manipulating Machine Learning: Poisoning Attacks and Countermeasures for Regression Learning (Matthew Jagielski et al)

AI strategy and policy

France's AI strategy: See Import AI's summary.

Initial Reference Architecture of an Intelligent Autonomous Agent for Cyber Defense (Alexander Kott et al): See Import AI's summary.

AI capabilities

Reinforcement learning

Retro Contest (Christopher Hesse et al): OpenAI has released Gym Retro, providing an interface to work with video games from SEGA Genesis, which are more complex than the ones from Atari. They want to use these environments to test transfer learning in particular, where the agent may be pretrained on initial levels for as long as desired, and then must learn how to complete a new test level with only 1 million timesteps (~18 hours) of gameplay. (Humans do well with 2 hours of pretraining and 1 hour of play on the test level.)

Should you read it? If you want to keep track of progress in deep RL, probably -- this seems quite likely to become the new set of benchmarks that researchers work on. There's also another example of specification gaming in the post.

Learning to navigate in cities without a map (Piotr Mirowski et al)

Deep learning

Universal Planning Networks (Aravind Srinivas et al): This is an architecture that has a differentiable planning module, that is, a neural network that takes in (encodings of) states or observations and produces actions. You can use this in conjunction with eg. expert demonstrations (as in imitation learning) in order to learn features that are optimized for the purpose of planning, focusing only on the details relevant to the task, unlike an auto-encoder, which must reconstruct the entire image, including irrelevant details.

Should you read it? It's a good example of the push towards learning more and more complex algorithms using neural nets (in this case, planning). From a safety perspective, differentiable planning networks may be useful for modeling humans.

LESSWRONG
LW