[AN #92]: Learning good representations with contrastive predictive coding

by rohinmshah Alignment Newsletter9 min read25th Mar 20201 comment


Ω 11

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Newsletter #92

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).


Representation Learning with Contrastive Predictive Coding (Aaron van den Oord et al) (summarized by Rohin): This paper from 2018 proposed Contrastive Predictive Coding (CPC): a method of unsupervised learning that has been quite successful. At its core it is quite simple: it simply combines the ideas of predictive coding and contrastive losses, both of which have been significantly studied in the past.

The simplest form of unsupervised learning would be data compression via generative models (as in e.g. VAEs), in which, to model the data p(x), you attempt to encode x into a latent (hidden) state z in such a way that you can then recover the original data point x from z. Intuitively, we want z to have high mutual information with x.

For sequential data in a partially observed setting, you need to deal with the full sequence. Consider natural language: in this setting, each x would be a single word. Consider the sentence "I sat on the chair". If the z corresponding to the word "the" only has to reconstruct the word "the", it's not going to "remember" that the past context involved sitting, and so that z would be terrible at predicting that the next word will be chair. To fix this, we can use predictive coding, where we instead require that we can predict future words using z. This now incentivizes z_t to have high mutual information with x_{t+k}.

There is still a problem: reconstructing the entire input x would require a lot of irrelevant information, such as e.g. the background color of the environment in RL, even if that never changes. How can we get rid of these irrelevant features? Contrastive losses allow us to do this: intuitively, since the irrelevant features are the ones that are common across all the xs (and so are fully captured by p(x) ), if we train the neural net to distinguish between various xs, we can incentivize only the relevant features. In particular, given a latent state z_t, we take the true x_{t+k}, and throw in a bunch of other xs sampled from p(x) (known as negative samples), and train the network to correctly classify x_{t+k}. The authors show that the optimum of this loss function is indeed for the neural net to compute p(x | z) / p(x), which implies that it is maximizing a lower bound on the mutual information between X and Z.

This gives us a pretty simple overall algorithm. Take a sequence x_1 ... x_T, compute z_t using a recurrent model on x_1 ... x_t, put x_{t+k} and some negative samples into a set, and train a classifier to correctly predict which of the samples is the true x_{t+k}. In practice, we do batches of these at the same time, and for every data point in the batch we use all of the other data points as our negative examples. The features you learn are then the ones that help distinguish between x_{t+k} and the negative samples, and you'll ignore any features that are common across all the samples. This means that the results depend quite a lot on how you choose your samples (this effectively determines what p(x) you are using).

The authors evaluate their algorithm on several domains and show that it achieves or surpasses state of the art on them.

Rohin's opinion: I like this paper: the intuition makes sense, the math is straightforward, and the empirical results are strong, and have continued to be strong when looking at later work that builds on it.

On Variational Bounds of Mutual Information (Ben Poole et al) (summarized by Rohin): This paper is a pretty dense and technical explanation of various ways in which we can estimate and/or optimize the mutual information between two variables. I specifically want to highlight that it provides a proof that the Contrastive Predictive Coding objective (summarized above) is a lower bound on the mutual information between the input and the representation, and compares it to other lower bounds on mutual information.



An Analytic Perspective on AI Alignment (Daniel Filan) (summarized by Asya): In this post, Daniel Filan presents an analytic perspective on how to do useful AI alignment research. His take is that in a world with powerful AGI systems similar to neural networks, it may be sufficient to be able to detect whether a system would cause bad outcomes before you deploy it on real-world systems with unknown distributions. To this end, he advocates for work on transparency that gives mechanistic understandings (AN #15) of the systems in question, combined with foundational research that allows us to reason about the safety of the produced understandings.

Rohin's opinion: My broad take is that I agree that analyzing neural nets is useful and more work should go into it, but I broadly disagree that this leads to reduced x-risk by increasing the likelihood that developers can look at their trained model, determine whether it is dangerous by understanding it mechanistically, and decide whether to deploy it, in a "zero-shot" way. The key difficulty here is the mechanistic transparency, which seems like far too strong a property for us to aim for: I would expect the cost of making a neural network mechanistically transparent to far exceed the cost of training that neural network in the first place, and so it would be hard to get developers to mechanistically understand trained models to detect danger.

Right now for e.g. image classifiers, some people on OpenAI's Clarity team have spent multiple years understanding a single image classifier, which is orders of magnitude more expensive than training the classifier. My guess is that this will become superlinearly harder as models get bigger (and especially as models become superhuman), and so it seems quite unlikely that we could have mechanistic transparency for very complex AGI systems built out of neural nets. More details in this comment. Note that Daniel agrees that it is an open question whether this sort of mechanistic transparency is possible, and thinks that we don't have much evidence yet that it isn't.


The Conditional Entropy Bottleneck (Ian Fischer) (summarized by Rohin): While I've categorized this paper under robustness because it can apply to most forms of training, I'll talk about it specifically in the context of unsupervised learning (and in particular its relation to Contrastive Predictive Coding (CPC), summarized in the highlights).

One potential problem with deep learning is that there might be too much information in the input, causing the model to learn spurious correlations that do not actually generalize well (see Causal Confusion in Imitation Learning (AN #79) as an example). The idea with the Conditional Entropy Bottleneck (CEB) is to penalize the model for learning irrelevant information, using a form of information bottleneck.

We consider a setting where we want to learn a representation Z of some input data X in order to predict some downstream data Y. In CPC, X would be the inputs from time 1 to t, Z would be the latent representation z_t, and Y would be the future data x_{t+k}. Then, we want Z to capture the minimum necessary information needed for Z to predict Y as best as possible. The necessary information is I(Y; Z), that is, the mutual information between Z and Y: we want to maximize this to maximize our accuracy at predicting Y. Since Y depends on X, and Z is computed from X, any information about Y must come through mutual information between X and Z. Maximizing just this I(Y; Z) term gives us Contrastive Predictive Coding.

However, we don't want to capture any extra irrelevant information (the minimality criterion), which means that Z shouldn't capture any more information about X beyond what it captured to maximize I(Y; Z). In information-theoretic terms, we want to minimize I(X; Z | Y). Thus, we have the CEB objective: minimizing I(X; Z | Y) - γ I(Y; Z), where γ is a hyperparameter controlling the tradeoff between the two terms. The authors then use some fairly straightforward math to reduce the objective to simpler terms which can be bounded using variational approximations, leading to an algorithm that can work in practice.

The authors perform experiments on Fashion MNIST and CIFAR10 (where Y corresponds to the labels for the images, so we're in the supervised learning setting). Since the main benefit of CEB is to remove unnecessary information from the model, they evaluate adversarial robustness and out-of-distribution detection in addition to standard performance checks. They find that models trained with CEB perform better than ones trained with a variational information bottleneck, or ones trained with vanilla SGD.

Rohin's opinion: While I'm not sure to what extent models learn truly irrelevant information (see Adversarial Examples Are Not Bugs, They Are Features (AN #62)), it seems good to add an incentive against learning information that won't be useful for a downstream task, and the empirical results (especially of the next paper) suggest that it is providing some benefit.

CEB Improves Model Robustness (Ian Fischer et al) (summarized by Rohin): This empirical paper finds that ImageNet classifiers trained with the CEB objective (summarized above) are already somewhat adversarially robust, without having any decrease in accuracy, and without any adversarial training. Notably, since CEB does not rely on knowing the attack method ahead of time, its adversarial robustness generalizes to multiple kinds of attacks, whereas models that were adversarially trained tend to be fragile in the face of previously unseen attacks.



Illuminating Generalization in Deep Reinforcement Learning through Procedural Level Generation (Niels Justesen et al) (summarized by Zach): Deep reinforcement learning has been able to use high-dimensional input, such as images, to learn optimal policies. However, when neural networks are trained in a fixed environment, such as on a single level in a video game, they will usually over-fit and fail to generalize to new levels. This paper uses procedurally generated levels during training in an attempt to increase the generality of deep RL. They make use of the General Video Game AI framework (GVG-AI) which allows rapid design of video games through the specification of rewards, objects, etc. Moreover, they introduce Progressive PCG (PPCG) to smoothly control the difficulty of generated levels to build a curriculum for the agent. The authors show that for some games procedural level generation enables generalization to new levels within the same distribution.

Zach's opinion: The GVG-AI framework seems like a useful tool to explore learning videogames. Setting up curriculum learning by using PPCG is also a clever idea. However, the results are a bit mixed. On two of the games they tested, training on a single difficult level works better than training on a variety of levels for generalization. Having said this, the method can learn the game Frogs (57% win rate) while DQN/A2C make zero progress even after 40 million steps. It seems as though certain conditions make PPCG a good method to use. It'd be interesting to investigate what those conditions are in a future publication.


SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems (Beidi Chen et al) (summarized by Asya): This paper presents an algorithmic technique called SLIDE (Sub-LInear Deep learning Engine) which takes advantage of sparsity in inputs and activations to speed up the training of large neural networks.

Suppose that activations at layer k are a_k. Then, the ith element of a_{k+1} is given by the dot product of a_k and w_i for some weight vector w_i. Call w_i the ith neuron of layer k + 1. The largest activations in a_{k+1} are the ones for whom w_i has high magnitude and points in the same direction as a_k. The core proposal of SLIDE is to only compute the largest elements of a_{k+1}, which they call the “activated neurons”, and approximate all of the others are zero, allowing us to avoid a lot of computation.

In order to do this, we maintain a data structure called a locality-sensitive hash table, which when given an activation a_k can tell us which neurons (w_is) are most similar. We can then compute the outputs for just those neurons to get a_{k+1}. In this way, we can effectively ‘sparsify’ the network, calculating the activations and updating the weights of only a small subset of the neurons. This is what gives us our computational gains.

SLIDE randomly initializes weights in the network and generates the locality-sensitive hash table that maps activations to activated neurons. To take a gradient step on an input, it calculates the activated neurons in a forward pass, then backpropagates through the activated neurons, and then updates the locality-sensitive hash table. The hash table update is computationally expensive, and SLIDE uses several mechanisms to make it less costly, such as updating hash tables less frequently later in the training process since gradients are likely to change less then. Due to the sparsity, the gradients for different inputs are often changing different neurons, and so SLIDE asynchronously parallelizes gradient updates without worrying about race conditions, allowing for much better scaling with additional cores.

The paper evaluates SLIDE on large multi-label classification tasks, which must run on neural networks with extremely wide final layers. It finds that the CPUs running SLIDE are 1.8 times faster in clock-time than the GPU on the Delicious 200k dataset, and 2.7 times faster than the GPU on the Amazon-670K dataset, with an additional ~1.3x speed-up after performing cache optimization on SLIDE. Scalability tests suggest that the SLIDE CPUs beat GPU performance even when using only 8 cores. The paper claims that SLIDE’s computational benefits come because the number of neurons sampled in the wide final layer is extremely small-- fewer than 0.5% of active neurons.

Asya's opinion: The tasks they test on are extremely sparse: since there are hundreds of thousands of possible labels, even if you take the top ~thousand predictions in the final layer (which corresponds to most of the computation), that’s only 1% of the total number of predictions, saving you 99% of the arithmetic you would have had to do. The input features are also very sparse: in both datasets, less than 0.06% (yes, percent) of features are non-zero. It’s cool that under such conditions you can design an algorithm that is ~an order of magnitude better on cost, but it’s not going to be “the death of NVIDIA” or anything like that — without further optimizations, SLIDE will be worse than regular Tensorflow on GPU for something like ImageNet.

I'm also not sure I agree with the 'thesis' of the paper that smart algorithms beat hardware acceleration-- it seems to me like there are large gains from investing in the combination of the two. Even if GPUs aren't optimized to run SLIDE, I can imagine specialized hardware optimized for SLIDE creating even bigger performance gains.

Linear Mode Connectivity and the Lottery Ticket Hypothesis (Jonathan Frankle et al) (summarized by Flo): Instability analysis looks at how sensitive neural network training is to noise in SGD. A network is called stable if the test error remains approximately constant along the line connecting network weights obtained by training on differently ordered data.

The authors find that most popular networks in image classification are unstable at initialization for more challenging tasks but become stable long before convergence. They also find that winning tickets (AN #77) found by iterative magnitude pruning are usually stable, while unstable subnetworks don't manage to match the original network's performance after training. As the original network, pruned subnetworks become more stable when they are initialized with weights from later stages of the training process. This is consistent with previous results showing that resetting subnetwork weights to states in early training leads to increased performance after retraining, compared to resetting to the initial state. While stability seems to correspond to better accuracy for subnetworks, very sparse subnetworks perform worse than the unpruned network, even if they are stable.

Flo's opinion: The correspondence between subnetwork stability and performance after retraining might just be an artefact of both (somewhat obviously) improving with more training. What is interesting is that small amounts of training seem to have disproportionate effects for both factors, although one should keep in mind that the same is true for the loss, at least in absolute terms.


Careers at the Joint AI Center (summarized by Rohin) (H/T Jon Rodriguez): The Joint AI Center is searching for ML experts for a variety of roles.

Rohin's opinion: You might be wondering why I've included these jobs in the newsletter, given that I don't do very many promotions. I think that it is reasonably likely that the US government (and the military in particular) will be a key player in the future of AI, and that there could be a lot to learn from their testing, evaluation, validation & verification (TEV&V) framework (which often seems more risk-averse to me than many alignment schemes are). As a result, I would be excited if readers of this newsletter interested in how the military thinks about AI filled these positions: it seems great to have a flow of ideas between the two communities (so that the government learns about alignment concerns, and so that we learn about TEV&V).


I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.


An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.


Ω 11