Coalescer Models

DaemonicSigil; bhauth

DaemonicSigil

So we're here today to discuss this post on an improvement to diffuser networks: https://www.lesswrong.com/posts/6Cdusp5xzrHEoHz9n/faster-latent-diffusion

Just as some general background for readers, there are various kinds of model one can train to imitate some empirical (i.e. you have a lot of examples) distribution. The main ones are autoencoders, GANs, and diffusers. (At least among architectures general enough to handle images. For text there's also the option of just predicting a probability distribution over the next token, since it has a very linear causal structure to it, and the set of tokens is not too large. IIRC people have tried this to for images too by flattening them to a string of tokens and it works surprisingly okay.) So far it seems like diffusers give the best results, but they're also slow, because to generate an image from the diffuser you have to run the image through a denoising network multiple times and then add back more noise at each step.

The distribution of images is clustered and clumped together in a very complicated hierarchical way. So if we're the denoising network and we're given an image with a small amount of noise to correct, we'll be able to guess which cluster it came from pretty well. If the image instead has a large amount of noise, then it has travelled ("diffused") pretty far through space and there are a lot of different clusters it could have come from. We can't say for sure which cluster is the correct one, so we just output an estimate that is the average of the centers of all the plausible clusters, so our loss isn't too big in the case of any one outcome.

That's why diffusers need so many steps. They need to gradually anneal down from high noise levels to low noise levels, and at each step they're able to become a little more decisive about which clusters and sub-clusters the image is in. GANs don't have this problem because they can just generate an entire image in one shot, but they have other drawbacks such as being a huge pain to train and suffering from mode collapse.

You suggested that I should take a look at Consistency Models as background to your idea. Rather than training a network to predict the average denoised image directly, we train it to be consistent in its predictions for different amounts of noise. The following loss function is Eq. 10 in the paper:

The network is architecturally constrained to predict an output equal to input when the noise level is 0, which provides a boundary condition that requires the predictions to try and match the ground truth (otherwise the network could easily win by always predicting $0$ ). So it seems the point of this kind of loss is to require the model to decide as early as possible in the denoising process which cluster to put an image into, so that it can get points thereafter for its consistency in always predicting that cluster. Then because the model is being decisive about where to put images, it takes fewer diffusion steps to generate a sample.

bhauth

Right, but I think it's easier to think of diffusion "sampling" output being spiky due to the training process, and "consistency models" training another network to predict multiple sampling steps being good because it smooths out the energy landscape. (I don't think "sampling" is a good name, btw.)

Consistency models can give good results in 2 sampling steps. I think the coalescer network approach should be at least as good, and faster to train.

Also, I think the fact that you can do "diffusion" fairly well in 1-2 steps indicates that the metaphor of "diffusion" — the conceptual basis that led to that approach being chosen — isn't actually an accurate representation of what's going on, and the people who developed it just got lucky. When I first proposed coalescer networks, some people on Discord complained "but this isn't properly based on physics like diffusion is" but I think those people are rather confused. The terms "diffusion" and "sampling" come from analogy to gas diffusion, with the metaphor being that you're sampling the velocity of individual gas molecules, and averaging out the thermal movement to get net movement from diffusion. The implication is, that randomness is inherent and inevitable, but that's wrong; the success of consistency models indicates that such randomness comes from a bad training process and can be mostly eliminated.

GANs, unlike diffusion models, are trying to map the latent space region of the training dataset to the entirety of another latent space. Maybe you can see why diffusion gives better results.

DaemonicSigil

Re energy landscape smoothness: What's the precise meaning of "energy" here? Is it just a synonym for the loss landscape, or is it some potential in the space of images rather than the space of parameters?

Re middle paragraphs: I definitely agree that an architecture being physics-based doesn't imply that it's good and similarly a non-physics-based architecture isn't necessarily bad. eg. GANs are game-theory based and they were the best thing for a long time. You've got to take your insights where you can find them, and probably all reasonable people would agree with that.

Continuing that line of thought, even if diffusers are physics, they aren't the physics one might expect they would be. Like if someone tells past-me that there's an amazing new paradigm of image generation where images are gradually formed from chaotic noisy samples that become more and more ordered, my first thought is that the chaotic images are heated-up versions of the original images. So if the probability distribution of the original images is $p (x)$ , the probability distribution of the heated-up images is $\sim exp (a log p (x))$ for some $a < 1$ . But actually the distribution for the "diffused" images is the convolution $(p ⋆ η) (x)$ where $η (x) = exp (- \frac{1}{2} x^{2} / σ^{2}) / \sqrt{(2 π / σ^{2})^{n}}$ . Is there some way in which these diffused images can be thought of as instead being "heated-up" images?

The first thing that comes to mind is to suppose the observable $x$ is actually the sum of two latent vectors: $y$ distributed according to $p$ , and $ϵ$ distributed according to $η$ . Then the distribution $p$ is just induced by some kind of density of states, which doesn't vary with temperature, while $η$ is induced by a quadratic potential, so the noise $ϵ$ does vary with temperature. But then $x$ is a mere observable of the system, rather than a direct configuration. So overall it's not clear how helpful this analogy would be.

Re GANs and latent spaces: If you unroll the diffusion process and concatenate all the noise vectors you added along the way, the resulting vector is a latent space vector for the diffuser. You can take the latent vectors for two separate runs, interpolate between them and it should result in a nice interpolated trajectory between the generated images as well. So in a sense diffusers can also be thought of as functions that map from a (highly redundant) latent space to the space of images.

I've seen such interpolations generated by Karpathy on youtube and they have an interesting characteristic compared to GAN interpolations. While GAN interpolations smoothly morph from one configuration to the next, diffuser interpolations look "jumpier". Objects jump in and out of existence, and larger changes happen in jumps too, though less frequently. I'm guessing this is what you're referring to when calling diffusion sampling "spiky". A natural guess, though I haven't seen the interpolation videos, would be that consistency models would have smoother interpolations, more akin to those of a GAN.

My hypothesis for why diffusers perform better than GANs seems quite different from yours, so I thought I'd share it. I have this heuristic that I jokingly dub the "min-entropy principle". It says that the output of a neural network is going to be as low-dimensional and low in entropy as you allow it to become. The only way to prevent this from happening is to specify a loss function that pushes rather hard to stop it from happening. Cases where it's relevant:

In a (non-variational) autoencoder, only a submanifold of the latent space is used. i.e. orthogonal combinations of latent space coordinates end up having large correlations.
In a GAN, the outputs of the generator have a surprisingly high tendency to mode-collapse, despite the presence of the discriminator network.
I think something similar happens for contrastive losses. Eg. if you try to reward a high standard deviation for each of the output coordinates, the network ends up reusing a single source of variance or a few sources of variance for each of the output coordinates.

So in my view, neural networks are just really bad at generating entropy, and the triumph of diffusion networks is that at each step in the process, the neural network is responsible only for removing entropy by predicting the mean. And then when it's necessary to increase the entropy, this is done by just adding Gaussian noise, which has a very nice and reliable algorithm.

bhauth

What's the precise meaning of "energy" here

Energy is what's being minimized, in this case by the "sampling" process. An energy landscape is some high-dimensional space that gradient descent is attempted on.

I definitely agree that an architecture being physics-based doesn't imply that it's good and similarly a non-physics-based architecture isn't necessarily bad.

Here's a metaphor for the sense in which diffusion models are physics-based. Imagine a society where good sorting algorithms haven't been developed, but computers have been. Somebody then designs a sorting algorithm based on physically stacking blocks, with all the restrictions that has in reality. Later, somebody develops Quicksort, and it's criticized for not being "physics-based". If that seems silly, well, that's my point.

my first thought is that the chaotic images are heated-up versions of the original images

Noise is added to the latent representation. What's the purpose of the "heat" metaphor here? Noise is noise.

the probability distribution of the heated-up images is

During training, diffusion models don't add a little bit of noise, they add a lot. The fully-noised distribution is homogeneous, or rather, gaussian. The training tries to teach the sampler to go from a random point to a real datapoint. That being the case, what would you use some convolution equation for?

the triumph of diffusion networks is that at each step in the process, the neural network is responsible only for removing entropy by predicting the mean. And then when it's necessary to increase the entropy, this is done by just adding Gaussian noise, which has a very nice and reliable algorithm.

I wouldn't say that's unique to diffusion models; denoising autoencoders are a standard technique.

DaemonicSigil

Energy is what's being minimized, in this case by the "sampling" process. An energy landscape is some high-dimensional space that gradient descent is attempted on.

Okay, energy function on parameter space, got it.

Noise is added to the latent representation. What's the purpose of the "heat" metaphor here?

Yeah, in that section I was just noting that while the backward diffusion process looks a lot like annealing, the probability distributions in question can't easily be written as Boltzmann distributions with a temperature parameter that decreases over time.

Moving on to Coalescers: Consistency models try to target predictions of the same model in the presence of less noise. In your case, it seems you're still trying to predict the ground truth image, but with some bias towards picking a ground truth image that has a similar latent vector to the noised image. And so this has the effect of smoothing the network's output because the direction is otherwise too influenced by far away parts of the data manifold. Am I missing anything here in where the smoothing comes from in this technique?

Incidentally, it looks like training examples would be weighted according to the distribution of description latents implied by "choose a random description embedding" rather than the raw distribution in the training dataset. Which could be useful in some cases.

bhauth

And so this has the effect of smoothing the network's output because the direction is otherwise too influenced by far away parts of the data manifold.

The point isn't to avoid far-away targets, per se. It's consistency in what is targeted from what starting point.

DaemonicSigil

In the both the case of classical diffusers and the case of coalescers, we're training on ground truth examples. Taking classical diffusers as a baseline for consistency, there must be some reason the coalescer is more consistent. Would it be correct to say that this is because the vector database always returns the closest vector, and so the coalescer has to make a discrete decision about which cluster of images to target when it's uncertain? (And hopefully it picks the same one the vector db picked, i.e. the closest one.)

Maybe related: Your direction loss is $\sqrt{∣ ∣^n - \frac{x - R}{| x - R |} ∣ ∣}$ where $x$ is the ground truth, $R$ is the randomly selected image embedding and $^n$ is the predicted direction. Is the square root important there, or could we just as easily take the bare magnitude to be our loss?

Also, what is the reason for having a randomly sampled description-embedding weighted by a hyperparameter desc_scale? Only the real description-embedding is used for training, so it seems like having this other thing sitting in the vector db query just adds variance in which vector it will spit out. How does this square with the mission of improving consistency?

bhauth

When the direction of training is more consistent for each region of the latent space, training is faster.

I added ^(1/2) for 2 reasons:

Because the expected number of steps is 2. I figured a big error is worse than a small error when iterating.
The direction field has boundaries where it becomes small and changes direction. The gradient for direction is smaller where the direction vector magnitude is small, but direction is equally important regardless of magnitude, but the training should be weaker near boundaries (because of conflict) and where magnitude is lower (because direction changes faster). So the sqrt was a compromise between (training low-magnitude regions too fast) and (training near boundaries being outweighed by larger gradients from larger magnitudes).

But the exact details here shouldn't be critical, as long as you separate distance and direction.

Desc_scale is there to make description_embedding more important than the random starting point.

DaemonicSigil

Because the expected number of steps is 2. I figured a big error is worse than a small error when iterating.

Just checking: 2 is "a relatively small number" in this case, rather than the $n$ in ${| \dots |}^{1 / n}$ ?

Desc_scale is there to make description_embedding more important than the random starting point.

I guess my question was more along the lines of: Why is it needed at all, let alone being more important than image embedding? Is it just to prevent the model from getting used to predicting unnaturally close images?

bhauth

The exponent doesn't have to be exactly 1/2, if that's what you're asking. It's a hyperparameter that probably depends on training conditions. That's why I said "could be" instead of "should be".

I'm not sure what you mean by "unnaturally close images". desc_scale just weights the importance of the target description relative to the random starting point. A desc_scale of 0 would be unconditional generation, but I wouldn't call that "unnatural".

DaemonicSigil

Re exponent: Got it, thanks.

If you're selecting the image in the vector DB that is closest to the starting point $R$ , then the target distance predicted by the Coalescer net is going to be lower than the average distance between embeddings, since the vector db tries to pick the closest vector. (There could be a high-dimensional geometry thing where the distances still end up being basically the same, but I think that can't happen if the training set has many more examples than the number of dimensions in the image latent space, which is usually the case.)

Also, I think a desc_scale of 0 would still learn conditional generation. The net is trained to predict a target given the ground-truth description of that target, and the correlation between description and image would surely still be present, no? On the other end of the slider, I guess that a very large desc_scale would result in training a conventional diffuser, except with direction and distance separated out from each other. Reasoning being that picking a random description and then querying the vector DB with it is pretty much just like picking a random training example to work from, like one does when training a regular diffuser.

Overall, I guess we should wrap this dialogue up soon. Maybe one last question: It sounds like you did actually implement this method on a small scale. What were the results of that? Would it be possible to share a sample of the images it generated?

bhauth

OK, setting desc_scale to 0 isn't exactly the same as unconditional generation, but the sampling model would mostly learn the closest valid region (ie, region with points from the input dataset) to the current region, instead of using the description input, because that's easier.

I'd just tried it on MNIST. The output with 1 step looked similar to the result of standard diffusion approaches, but that's not saying much; I think you'd need something bigger to get good information about output quality. I could train a larger model to some extent, though I guess I'd need to actually optimize the vector search, but even if the algorithm is more efficient than current diffusion approaches I wouldn't be able to compete with eg Stability AI. If there's enough interest I'll fix up my code, or I guess I can wait for someone else to reinvent it and maybe get some credit then...or maybe I still wouldn't.

DaemonicSigil

Cool, good to know that the empirical results at least somewhat check out.

Thanks for taking the time to discuss the technique with me. This dialogue definitely answered a good number of questions I had, and some that I didn't have to start out with.

bhauth

Sure. Glad you got something out of it.

[-]Wei Dai7mo122

Consistency Models are faster but, based on complaints on Reddit and my own experience, reduce quality and diversity. People who need fast image generation seem to have moved on to "Turbo" models, based on Adversarial Diffusion Distillation. How does coalescer models compare to that?

[-]bhauth7mo40

That approach just adds a loss from a GAN-type discriminator to the distillation loss when training the distilled sampler. Coalescer models should be able to do generation in at least as few steps as consistency models, so, if you want, you could do the same thing with coalescer models: initial training as normal, then more training that adds in a loss from a discriminator.

LESSWRONG
LW

16

Coalescer Models

16

16