Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Highlights

AI safety via debate (Geoffrey Irving et al): At a high level, a major issue with building superintelligent AI is that humans would not be able to provide strong oversight for the AI. Amplification solves this by using the AI as a tool that can help the human (in particular, if the human can break a task down into subtasks, the AI can solve the subtasks). Debate also provides the AI as a tool for human overseer, but in a different way -- now, in order to train the AI, we have the AI debate against itself in order to convince a human of the answer to some target question. Given some question whose answer is too hard to directly judge, the human can look at the arguments and counterarguments to figure out whether or not the answer is actually correct.

The paper describes debate in a lot more depth and has an initial experiment involving MNIST. I can't possibly do it justice here -- I encourage you to simply read the full paper. You probably have an intuition right now of why this wouldn't work, such as "but humans believe what they want to hear, not what is true". The paper spends 5 (!) pages listing ten such problems and analyzing them, so go read it.

My opinion: It's great to see another approach that directly tackles the problem of defining a training signal that if optimized well would lead to an aligned AI. There are a lot of empirical questions that would influence whether or not debate actually works in practice, and I'm excited to see what experiments find.

AGI Safety Literature Review (Tom Everitt et al): Self-explanatory. It's more of a list of approaches and references within each approach than an integrated whole, but I still expect it to be useful.

My opinion: This is great as a way to find references. I do wish there was more comparison between papers and/or approaches, but that's probably asking too much.

No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling (Xin Wang, Wenhu Chen et al): This paper tackles visual story-telling, the task of generating a story that matches a sequence of photos. It proposes learning a reward function from the labeled dataset that can then be optimized with reinforcement learning, with the hope that the reward function is a good compression of what we want and so leads to more generalizable behavior. They show that the standard automated techniques for evaluating visual stories are not very good, and so they perform a Mechanical Turk study that shows very good results compared to prior work. MTurk workers are often unable to tell whether the stories were generated by their algorithm or a human!

How does it work? Their architecture has a policy network that creates the stories and a reward network that provides the supervision, which are trained adversarially. We can think of the reward function as inducing a probability distribution over stories, where stories with higher reward are more probable. Then, the reward network acts as a discriminator, trying to make its implied probability distribution similar to the empirical data distribution and dissimilar to the policy network distribution, while the policy network acts as a generator, creating a policy that tries to match the implied probability distribution of the reward network. (This is equivalent to maximizing the expected reward from the reward network.)

My opinion: It's exciting to see reward learning applied to a concrete problem that researchers are working on, and having it lead to an actually better system. This work uses reward learning in a context where we are trying to mimic human actions (sentence generation in this case) -- eventually we will want to be able to deal with different action spaces than humans (as in robotics) and aiming to reach superhuman performance.

Technical AI alignment

Technical agendas and prioritization

AGI Safety Literature Review (Tom Everitt et al): Summarized in the highlights!

Scalable oversight

AI safety via debate (Geoffrey Irving et al): Summarized in the highlights!

Agent foundations

Doubts about Updatelessness (Alex Appel)

Learning human intent

No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling (Xin Wang, Wenhu Chen et al): Summarized in the highlights!

Reward Learning from Narrated Demonstrations (Hsiao-Yu Fish Tung et al): This paper learns and optimizes rewards given demonstrations of behavior along with a description of the behavior in natural language. Their dataset is a set of videos of humans demonstrating a task and describing it with natural language (such as "the orange is in the bowl"). They combine several techniques to use this dataset to teach a robot. First, using speech recognition, they get a transcript of the natural language aligned with the video. They use object detectors to figure out what things are present in the image, and a syntactic parser to figure out the subject and object of the sentence, and match up these two results to figure out which objects in the image the natural language refers to, and extract their spatial features. They then train a classifier to take the spatial features and detecting whether it has achieved the goal, conditioned on the natural language description of the task. Now that they have a reward function (1 at a goal state, 0 otherwise) they can train a robot using DQN, though to get this to work they infer 3D object configurations from 2D images and use distance to the goal as a shaped reward.

My opinion: I'm excited to see approaches to reward learning that use information from natural language -- that definitely seems like a rich source of information we have not made use of yet. That said, there are a lot of moving parts in this system and the parts that analyze natural language impose a lot of structure, to the extent that it feels like a domain-specific language instead of natural language. (I'm not confident about this, it's hard to tell from the paper.) I also don't understand the experiments. They say that the sparse reward couldn't train policies with episode length 3 -- sparse reward is a hard problem, but it isn't that hard, so I must be misunderstanding something here.

Heuristic Approaches for Goal Recognition in Incomplete Domain Models (Ramon Pereira et al) (H/T Beth Barnes): The planning community works on algorithms that can plan given a symbolic definition of the environment, how actions affect the environment, and the goal state; analogous to reinforcement learning. The task of inverting the optimal behavior to infer the goal is called goal recognition or plan recognition (analogous to inverse reinforcement learning). This paper looks at goal recognition where the models of the world are incomplete, so that there are possible preconditions and effects of actions. They extract potential landmarks from the plan, which are things (facts or actions) that must happen in order to achieve the goal, and then suggest two heuristics for how to use the landmarks to rank among possible goals.

My opinion: I'm not familiar with this field, but it seems like they have identified a different set of problems with and solutions for goal inference, and it would be useful to see how they apply to IRL. Perhaps the explicit landmark inference leads to more hierarchy in goals? Maybe the unique landmark heuristic is not captured in the standard Boltzmann-rational assumption in IRL? I'd also be interested to see if we can apply IRL algorithms to the plan recognition dataset and get good performance.
Read more: Datasets

Reward learning theory

Rigging is a form of wireheading (Stuart Armstrong): Ideally in reward learning, we want our AI to be learning a fact about the world -- which reward it should be optimizing. However, for most proposals of reward learning, the AI's actions can also influence this "fact" about the world. In this case, the AI can wirehead by influencing the world so that it learns an easy-to-maximize reward. This is what Stuart calls "rigging" of the learning process.

Near-term concerns

Privacy and security

Privacy and machine learning: two unexpected allies? (Nicholas Papernot et al): Differential privacy provides guarantees on how much information you can obtain by making queries of a specific type of a dataset. Normally, in order to achieve such guarantees, you must add in randomness to the input data that can change the decision, so that there is a plausible explanation for any decision. Unsurprisingly, this tends to degrade performance. However, in deep learning, we often have the problem of our models overfitting to specific details in the training set instead of generalizing appropriately, so we might expect that differential privacy could actually help with performance (as well as privacy). Private Aggregation of Teacher Ensembles (PATE) demonstrates that this is the case. It works by training several teacher models on different datasets to solve the task at hand. Then, by aggregating the results across the ensemble with some random noise, we can answer queries and put bounds on the amount of information that is leaked. However, with each query we use up more of our "privacy budget", so it can't be used arbitrarily long. To solve this, we can make a fixed number of queries to label some unlabelled data, use those labels to train a student model, and use the student model to make predictions forever after. An adversary could at worst infer the entire training dataset of the student model -- but that training set was designed to be private.

My opinion: I would have been excited by work that randomizes the inputs to a deep learning technique in order to get better generalizability. It's cool that this goal dovetails so beautifully with the desire for differential privacy.

AI capabilities

Reinforcement learning

TDM: From Model-Free to Model-Based Deep Reinforcement Learning (Vitchyr Pong): In many tasks, we have hierarchical structure where we want to plan at a high level, but to execute the low-level actions we want to rely on learning through experience. For example, when biking from UC Berkeley to the Golden Gate Bridge, you definitely want to plan in advance the route you'll take (as opposed to learning through trial-and-error), but you want to learn how to bike through trial-and-error. Temporal Difference Models allow you to do model-based planning at the high level, and model-free learning at the low level. Specifically, you learn a function Q(s1, a, s2, T), which intuitively says "if I start from state s1, taking action a, and running for T steps, how close can I get to state s2". It turns out that this can be thought of as a Q function and so can be trained using standard model-free RL techniques. Note that the constraint Q(s1, a, s2, T) = 0 says that it is possible to get from s1 to s2 in T steps after first taking action a.

One standard way to solve model-based RL is to search for a sequence of states and actions (s0, a0, s1, a1, ...) that is feasible (agrees with the dynamics) and maximizes the reward, and then take the first action from that sequence. Using TDMs, we can now search for the sequence (s0, a0, sK, aK, s2K, a2k, ...) that is feasible and maximizes reward. The feasibility requirement is expressed by the constraint Q(s0, a0, sK, K) = 0.

My opinion: Firstly, the blog post is very readable and provides a great introduction (it's much more friendly than my summary).

This technique does require that we can reinterpret any state as a goal state, similar to the assumption in Hindsight Experience Replay (HER). They do compare to HER, and find that HER doesn't do very well, which I was quite surprised by. Clicking through to the paper, it turns out the authors were surprised as well, but then realized that this is because HER is designed to work with sparse reward problems, whereas they were evaluating on problems with relatively shaped rewards.

Towards Symbolic Reinforcement Learning with Common Sense (Artur d'Avila Garcez et al)

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review (Sergey Levine)

Deep learning

MLPerf: From their overview: "The MLPerf effort aims to build a common set of benchmarks that enables the machine learning (ML) field to measure system performance for both training and inference from mobile devices to cloud services." They have a track to measure the performance of hardware and software systems that support ML models, as well as a track that aims to advance the state-of-the-art in ML models. They consider a broad set of problems (though it seems like they are all problems where some deep learning technique is state-of-the-art).

The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation (Mia Xu Chen, Orhan Firat, Ankur Bapna et al)

Machine learning

On the Convergence of Adam and Beyond (Sashank J. Reddi et al)

News

RAISE releases prereq modules and is looking for high level feedback: RAISE in collaboration with Erik Istre and Trent Fowler are developing a curriculum for prereqs to AI safety (topics like logic and probability). The first topic (logic) is available here. Also, they are looking for AI safety experts to provide high-level feedback and guidance on the course structure for the MOOC they are building.

Facebook Open Sources ELF OpenGo (Yuandong Tian et al): Facebook has created an open-source AI bot that has beaten world champion professional Go players in matches where the professional player was allowed unlimited time to think.

New to LessWrong?

New Comment