When generating, we will sample uniformly, which requires
bits to describe. This gives the loss
$∥ final image - target ∥^{2} - iters \sum iter = 1 H (p_{batch mean}^{iter}) .$

You should be using a MSE between uniform and $p_{batch mean}$ instead of a KL divergence in the loss. The batch mean is only an estimate for what you truly want, which is the mean over the entire dataset (or perhaps, all possible images, but there's not much you can do about that). If you directly substitute it in for the dataset mean in the KL divergence, the resulting gradients are not unbiased estimators of the correct gradients. On the other hand, if you use a MSE loss instead, the gradients are unbiased estimators of the correct gradients for the MSE. In the limit as the dataset marginals approach the uniform distribution, the gradients of the KL divergence will be parallel to the gradients of the MSE gradients, so it's okay to use a MSE instead.

[-]joseph_c2mo20

Right now in your code, you only calculate reconstruction error gradients for the very last step.

if random.random() > delta:
    loss = loss + (probs * err).sum(dim=1).mean()
    break

Pragmatically, it is more efficient to calculate reconstruction error gradients at every step and just weight by the probability of being the final image:

loss = loss + (1 - delta) * (probs * err).sum(dim=1).mean()
if random.random() > delta:
    break

[-]joseph_c2mo*21

Although not mentioned in Yang's paper, we can instead select images proportional to ...

This gives the loss $∥ final image - target ∥^{2} - iters \sum iter = 1 H (p_{batch mean}^{iter}) .$ If we want an infinite-depth model, we can choose to sometimes halt, but usually sample another image with probability $δ$ (for 'discount factor'). Also, as the depth increases, the images should become more similar to each other, so $β$ should increase exponentially to compensate. Empirically, I found $β = δ^{- iter}$ as $δ \to 1$ to give decent results.

I think you should choose $β$ so that ${(\frac{β}{2})}^{- 1} = \frac{\sum_{i = 1}^{B} ∥ best x_{i} - {target}_{i} ∥^{2}}{B - 1},$ the sample variance over the batch between the closest choice and the target. This is because a good model should match both the mean and the variance of the ground truth. The ground truth is that, when you encode an image, you choose the $x_{i}$ that has the least reconstruction error. The probabilities $p_{i} \propto e^{- β error (x_{i}, target)}$ can be interpreted as conditional probabilities that you chose the right $x_{i}$ for the encoding, where each $x_{i}$ has a Gaussian prior for being the "right" encoding with mean $x_{i}$ and variance $2 / β$ . The variance of the prior for the $x_{i}$ that is actually chosen should match the variance it sees in the real world. Hence, my recommendation for $β$ .

(You should weight the MSE loss by $β$ as well.)

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

4

Discrete Generation

4

4