A possible training procedure for human-imitators

jessicata

Some proposed ways to train a system to imitate a human imitate a human involve having one system imitate a human, while another system tries to tell the first system apart from an actual human. If the first system can get classified as human by the second system, then (one might think) it is imitating a human well, as long as the second system is sufficiently smart.

I described a possible problem with these approaches in a LessWrong thread:

Here's a scenario that doesn't seem completely implausible. Suppose Bob is someone whose public key is easily available on the internet. The first AI will read things on the internet and output a message. Some of the message will get put on the public internet. Bob suspects that the AI might have secretly sent him a message (e.g. giving him advice on which stocks to buy). So he tries using his private key to decrypt some of the AI's output (e.g. the lowest-order bits in some images the AI has output).

Knowing that Bob (or someone else like Bob) will likely do this, the AI (which has access to Bob's public key, due to reading the internet) encrypts a message with Bob's public key, and puts the ciphertext in the lowest-order bits of some image. The ciphertext looks like random noise to the second AI. Maybe the plaintext contains both actual stock tips and some dangerous messages that trick Bob into running an unrestrained copy of the first AI (I'm assuming the AI is good at psychological manipulation).

As Paul says later in the thread, the underlying problem is that it is easy for the first AI to change its output in an important way without the second AI noticing.

Here's a proposed way of implementing human-imitation while avoiding this problem. The human-imitator is a single AI that is given a prefix of a string produced by a human and predicts the next bit. It is rewarded for predicting the next bit accurately (using some proper scoring rule). We can use this system to imitate a human by sampling from its implied probability distribution over strings, bit by bit. If the system is very good at predicting the next bit in a human-produced string, then this implied probability distribution will be an accurate prediction of the human string.

Unfortunately, predicting the string bit-by-bit might be computationally harder than producing a string that is hard to tell apart from what a human would produce. Here's an extension that makes the problem slightly easier.

We want some way to represent a distribution over strings. One way of representing a distribution over strings S is to introduce some auxiliary data A and function $f : A \to S$ , and represent a distribution $p$ over A. For example, if the model for generating strings is a PCFG, then the auxiliary data A is the actual parse tree. We could assume that $p$ is represented by a bit-by-bit predictor (which takes a prefix of the data A and returns a probability for the next bit). In the PCFG example, this is easy; just order the data A in a "generative" order, and these bit-by-bit predictions will be easy.

Given the actual string S, it is possible to show a lower bound on this string's marginal probability under $p$ by importance sampling possible parse trees consistent with a given string. That is, the system will represent some distribution $q$ over A (after seeing the string S), which only outputs A values for which $f (A) = S$ . This sampler can be used to estimate the total probability under $p$ of $A$ values for which $f (A) = S$ .

Note that this is only one method of estimating the probability. Another method is model-counting using hash functions.

So in order to predict a human-produced string, the AI:

outputs a distribution $p$ over A, as a bit-by-bit generative model
sees the human-produced string S
outputs the distribution $q$ over A, as a bit-by-bit generative model
gets a score according to the estimated log-probability of the string S according to $p$ , using $q$ for importance sampling

We could represent a bit-by-bit generative model as e.g. a split copy of the AI that outputs probabilities for the next bit given previous bits. Note that steps 2-4 only happen during training; during testing, you only need to run step 1 and then run the function $f$ on the result.

This kind of method seems computationally easier than directly predicting a human-produced string bit-by-bit, but it might still be harder than the original adversarial proposal.

This looks a lot like a variational autoencoder, unless I'm missing some distinction (Note that once you have a bit-by-bit predictor, you can assume WLOG that the distribution on A is uniform). Thinking about how variational autoencoders work in the superintelligent case seems worthwhile if we want to scale up something like imitation. I also wouldn't be that surprised if it produced some practically useful insight.

(Incidentally, I think that variational autoencoders and generative adversarial nets are the leading approaches to generative modeling right now.)

I agree that the steganography problem looks kind of bad for adversarial methods in the limit. For coping with the analogous problem with approval-maximization, I think the best bet is to try to make the generative model state transparent to the discriminative model. But this is obviously not going to work for generative adversarial models, since access to the generator state would make distinguishing trivial.

Actually I'm not sure exactly what you mean by importance sampling here.

The variational lower bound would be to draw samples from $q$ and compute $l o g (p / q)$ . The log probability of the output under $p$ is bounded by the expectation of this quantity (with equality iff $q$ is the correct conditional distribution over $A$ ).

I'm just going to work with this in my other comments, I assume it amounts to the same thing.

What I mean is: compute ${^E}_{q} [[f (A) = x] p (A) / q (A)]$ , which is a probabilistic lower bound on $P_{p} (f (A) = x)$ .

The variational score gives you a somewhat worse lower bound if $q$ is different from $p (A | f (A) = x)$ . Due to Jensen's inequality, $E_{q} [log ([f (A) = x] p (A) / q (A))] \leq log E_{q} [[f (A) = x] p (A) / q (A)] \leq log P_{p} (f (A) = x)$

It probably doesn't make a huge difference either way.

It would be great to see an analysis of this from a complexity-theoretic / cryptographic perspective. Are there distributions that can't be imitated correctly in this way, even when they should be within the power of your model? Are there distributions where you get potentially problematic behavior like in the steganography case?

(That's also surely of interest to the mainstream ML community given the recent prominence of variational autoencoders, so it seems quite likely someone has done it.)

After that there is a more subtle question about learnability, but it would be good to start with the easy part.