=neural networks =machine learning =algorithm =suggestion
abbreviations
- NN = neural network
- LS = latent space
background on diffusion
NN
autoencoders can be trained to convert between images and a LS where
distance corresponds to image similarity. Like how most images are just
"noise", most of that LS does not correspond to meaningful images. For
simpler explanation, let's consider a simplified diffusion-based image
generation model. It has a 2-dimensional LS, and 2 image categories: cats
and dogs.
The "unconditional generation" task is to find a random point in the
image LS which is inside any meaningful region. The "conditional generation"
task is to find a meaningful point in image LS that would also be close to a
target position in a description LS.
Training a diffusion NN involves taking real image LS points and
creating a multistep path between them and random image LS points. The
diffusion NN is trained to reverse those steps.
By training a new diffusion NN to replicate multiple diffusion NN steps
(a type of distillation) it's possible to do the diffusion process in fewer
steps. That technique is done in the
SnapFusion paper which gets
good results with just 8 steps. The number of diffusion steps can be
adjusted, but using fewer steps gave worse results.
the problem
Why are
multiple steps needed for good results? Why can't the "diffusion" be done in
a single step? I believe the problem is related to LS structure.
Consider a random point P outside CAT and DOG, conditioned on a tag
"animals" which may go to either region. The diffusion NN may be trained to
direct the same (or nearly-identical) input to multiple different targets.
As a result, the diffusion NN will not provide an accurate direction
from points that are far from meaningful target areas. That makes it
necessary to use many small steps, both to "average out" diffusion NN output
and to progressively get closer to regions where diffusion NN output is more
accurate.
proposed solution
By
training a NN to produce output which is more consistent and smooth than
what diffusion NNs are trained to produce, we can reduce the above problem.
To distinguish such NNs from diffusion NNs, I propose the name "coalescer
networks".
Here is a process for training and using coalescer
networks:
setup:
1. Train
autoencoders for images and text.
2. From an image-description pair
dataset, use autoencoders to make many image-description embedding pairs.
3. Put the image-description embedding pairs in a vector database, indexed
by concatenate(desc_scale * description_embedding, image_embedding) where
the hyperparameter desc_scale > 1.
training step:
1. Choose a
random description embedding DE and a random point R
in image LS.
2. Use vector search to find a point pair close_pair
which is close to concatenate(desc_scale * DE, R).
3. Train the coalescer NN to do: (R, close_pair.description) ->
(target_distance, target_direction) where target_distance is distance from R
to close_pair.image, and target_direction is a vector pointing from R to
close_pair.image.
The training for direction can
change quickly in some regions where distance is similar. By separating
those outputs, we can keep output smooth by shrinking target_direction where
direction changes rapidly. The magnitude of target_direction is then an
indication of direction accuracy.
Training loss for target_distance
could be: (target_distance - magnitude(R - close_pair.image))^2.
Training loss for target_direction could be: sqrt(magnitude(target_direction
- normalize(close_pair.image - R))).
generation process:
1. Choose an
image description, and find its embedding DE using the text autoencoder NN.
2. Choose a random image LS point R.
3. Use the coalescer NN to do (R,
DE) -> (target_distance, target_direction). If target_direction is very
small, repeat steps 2-3.
4. Find target_point = R +
normalized(target_direction) * target_distance.
5. Optionally, repeat the
coalescer NN process from target_point. The number of steps may depend on
target_direction magnitude.
6. Use the image autoencoder to convert
target_point to an image.
Coalescer networks should generally give good results in 2 steps.
impact statement
My hope is
that this technique will improve the speed of image generation tools,
thereby reducing the disparity of image generation capability between
individuals and large institutions, and thereby ultimately having a net
positive societal impact.