Video Diffusion Models have recently experienced a sharp uptake in interest, both academically and popularly. Despite this increase in impact, relative to LLMs or other popular architectures, video diffusion models have received far less attention from an interpretability perspective. In order to boost awareness and lay a framework for model understanding, we offer a conceptual explanation of video diffusion models, as well as a mathematical framework in similar parlance to that of LLMs, with an emphasis on autoregressive approaches. This blog post assumes basic familiarity with neural networks and machine learning.
The goal of this work is to provide a background to video diffusion models and intuitively discuss what we might expect to find from interpretability studies. Understanding video diffusion models involves the synthesis of information from a few different areas: generative diffusion, including specific techniques for diffusion based video generation, transformer models, and model interpretability. The goal of this post is to provide an easily readable guide to take someone from zero familiarity with diffusion to the point where they can reason about these models and their internal mechanisms. At the end, we'll pose some current open questions about a specific video diffusion architecture. The post is organized into three parts: an introductory conceptual section guiding the reader from basic diffusion all the way to video diffusion, a section detailing the kinds of neural networks used in this process, and a section exploring the internal mechanisms of these networks.
Understanding Video Diffusion models begins with understanding the video diffusion process. Understanding the video diffusion process involves understanding the following things, in this order:
The purpose of generative diffusion (and generative content overall) is to produce high-quality, diverse media, as specified by a user, in an automatic manner. If you’ve used a tool like stable diffusion, Veo 3, or Sora, you understand this use case: You start with a small description, whether in text or from an image you have, of what you want, and it returns something close to what you've described.
For any given prompt, we all understand there are an infinite amount of outputs that would satisfy your criteria: there are an almost infinite number of possible pictures or videos of cats. The goal of generative diffusion is to produce a wide variety of these, but to stay accurate to what you requested. It wouldn’t be very fun to generate images or videos of cats if it was always the same cat, or if it produced random content most of the time instead of a cat. This requires an algorithmic process that incorporates both accuracy and diversity into creating the desired content. Secondarily, though still importantly, we want our algorithm to be reasonably fast. These are objectives and themes that should be kept in mind as you continue through the post.
The origin of this use of diffusion, interestingly enough, has nothing to do with images or videos. It’s actually a technique from statistical thermodynamics that’s focused on something that seems at first glance pretty unrelated: mapping one statistical distribution to another. Consider, for example, mapping a common normal distribution, as encountered in statistics class, to the spiral distribution in figure 1. Though it seems far away from what we just discussed, it’s important to dive into how it works. Consider how one might find a way to map these two distributions. One could study both closely, and derive a function to map each point from the normal distribution to a corresponding one from the spiral distribution. This is a valid approach, but it can be pretty difficult mathematically, and it would only work for specific distributions. If I wanted a checkerboard distribution instead, I’d have to work out an entirely new function. Instead, it’s easiest to task a neural network to do this for us, using a surprisingly simple and clever way of thinking about that task. The formulation of the problem is as such: for a given point from our target distribution, repeatedly apply gaussian noise (sample from the normal distribution and add it) to the point until it’s so noisy that it’s indistinguishable from a sample of the normal distribution. Since we know the exact noise that’s applied, we know what the real point is as well, and we can task the neural network with simply helping to predict the original point from the noisy one (the neural network actually predicts the noise itself, and then we subtract it). Depending on the how much noise we applied to the point, it might take several steps to remove all noise, but eventually it will produce the point from the target distribution. This is a mathematical task that would be tedious and difficult for a person, but is actually very tractable for a neural network. This task, called, fittingly, Denoising Diffusion, is formalized in figure 2:
(Math warning) It’s fairly straightforward: given a sample (xo) with some amount (αt) of noise (ϵ) added and a timestep t (which is just the number of times we’ve already denoised the point), predict the noise that was added. Sampling, or taking noise and moving it to the new distribution, is also pretty straightforward: Start with noisy point xT. Slightly denoise the point: the noise is estimated by our neural network ϵθ(xt,t) and subtracted the noise from xT, multiplied by an amount αt (The few extra terms (σtz) and in Algorithm 2, line 4 will be discussed later). Repeat until you have a point from the target distribution. Now, you have an algorithm that can take a normal distribution, and turn it into any other distribution.
Now that we understand basic denoising diffusion, we can talk about how this might work for images. It’s actually pretty similar: you take an image, and apply noise to it, until you get an image that looks like TV static (and statistically, is just a high-dimensional normal distribution). You perform the exact same operation, just with a much larger neural network: for a given image, apply noise, and task the neural network to help predict the noise-free image (as said, it predicts the noise, which we remove). Now, you have an algorithm that can take a static image and create a new one. It sounds like we’re almost already to video diffusion, but unfortunately, things get a little bit more complex at this stage, and there are a few more topics we’ll have to introduce.
This first topic is about efficiency. In the section above, we discussed how we can perfectly replicate the basic denoising diffusion process by noising an image directly, and just using a larger neural network to do the diffusion process. However, the neural network in question actually needs to be very large to do this directly “in image space”, and that’s undesirable for a number of reasons. For starters, it can be expensive and slow - remember, we have to run it several times to denoise the image to begin with. There are additional mathematical reasons to do this as well, that we’ll cover next, but for now, a speedup is more than valuable enough of an objective to improve the process. The technique formulated to address the difficulties of performing diffusion directly in image space is called latent diffusion. It’s called so because the image is mapped to a vector representation of itself called a latent, or latent vector. The mapping is typically done following a standard computer vision paradigm: before using our neural network, we run a convolutional (or similar) image downsampling technique, which we call an encoder, to produce our latent. Then, similarly to the basic diffusion case, our neural network denoises the latent vector, which can be done with a much smaller network. At the end of the process, we use a decoder to upsample the clean latent into an image, through the reverse of the chosen downsampling process (there are many options on how to down and upsample - none are too complex).
Now that we’ve changed from noising and denoising images directly, our image diffusion process requires much less computation per denoising step to get samples from noise. However, if possible, it could be nice to reduce the number of steps we need to begin with. Depending on the exact distributions, the process as described now can take 50 or more rounds of denoising to produce a clean datapoint or image. This technique is called DDPM (Denoising Diffusion Probabilistic Models), which is actually the process shown in figure 2. It consists of slightly re-noising an image after each denoising step, to ensure diverse samples from the target distribution. The (σtz) term mentioned earlier is that slight amount of noise added back to the latent. This process matches our desired outcomes: it produces results that lie on the target distribution (accurate), and ensures lots of different points are selected (diverse), though can takes numerous steps to be sampled, which could make it slow.
A faster sampling process, called DDIM (Denoising Diffusion Implicit Models), uses a process mathematically very similar to denoising using the score function of the distribution, which is similar to a partial derivative. The algorithm is also pretty straightforward: have the network predict the score function of the distribution, which will point towards more likely points of the distribution. Simply move the point in the direction of the score function, and the point becomes more likely to be in the target distribution. The precise details behind DDIM, how we know the score function is are beyond the scope of our topic, but the relevant element is that efficient sampling is done using the score function, which produces a vector that points towards our distribution.
So far, we’ve discussed diffusion for mapping one distribution to another, applying this to images, and using an image encoder/decoder to perform latent diffusion, and improvements to the sampling techniques using the score function. This can get us from one distribution (starting noise) to another (our images), which should yield quality images. But, recalling our initial discussion on the goal of generative modeling, we still haven’t specified a way to control what gets generated, other than that it will be similar to our initial distribution. Where we are now, if we wanted to generate cat images, we’d have to train a model exclusively on images of cats, and have a different neural network for each kind of thing we wanted to generate. That would be expensive and time consuming. So, we need to slightly modify our diffusion process so we can control what part of the target distribution our point ends up in. There are two things we need to do to enable this: include text to our process (so the user can specify what they want), and find a way to have that influence our denoising. The first thing is done by including text labels for our desired images. The second is via an additional to the sampling process called, fittingly, guidance. There are two kinds of guidance: classifier guidance, and, yes, classifier-free guidance.
Classifier guidance is pretty simple. Suppose you want to generate cats, dogs, birds, and giraffes. Firstly, to allow text descriptions, each image needs to be labeled with its contents (cat, dog, etc.). Now, in addition to the standard denoising diffusion training, as you train, train a model to predict the image’s class from the latent. Now, when you sample you can request two score functions: for the diffusion model, and for the classifier. The score function of each will tell us the direction we should move the latent vector to increase the odds of the distribution, and the odds of classification. If we move a bit in both directions, we'll simultaneously increase the odds of the point being in our distribution (the image will look good), and being what we want (image of a cat, not a dog).
Classifier-free guidance performs the same function as classifier guidance: move towards our desired category and towards valid images. But what if we don't want to train two networks? And what if we don't want to be bound by specific categories? E.g., we want an image with both a cat and a dog. A clever way of doing this is to train our normal denoising model to do two things: generate latents conditionally (with our label as an input to the neural network, as a second latent vector) and unconditionally (a noise vector). You can think of the unconditional process as heading indiscriminately towards the distribution, and the conditional process as heading towards where we want it to go. If we subtract the unconditional output from the conditional output and add that difference back, it will push it even further in the direction of the conditional output, which accelerates the sampling process.
Let's summarize what we've talked about so far - which is quite a lot, so congratulations on making it so far. You're basically there.
Our present algorithm is trained by taking images, downsampling them into latent vectors using an encoder, noising the images, and tasking a neural network with performing denoising diffusion on the latent vector to get it to resemble the original image by moving it, via the score function, both towards the target distribution, and towards the latent's image's specific classification within that distribution using classifier or classifier-free guidance. When the latent is properly denoised, we upsample the latent using the decoder. Once we've trained, sampling images looks like this: we take from the user a text prompt, and begin with a noisy latent vector. We use guidance to move that noise towards the part of the distribution specified by the text prompt. Once we've sampled the latent vector, we upsample it using the decoder into our new image.
Now we have a manner of not only mapping noise and text (our inputs) to a target distribution (a new image), but we can sample efficiently, and even towards a specific part of the distribution that we described. From this, getting to video diffusion only needs two modifications: To change the encoder and decoder at the end to account for the additional information present in a video: relationships between video frames such as time flow, spatial relations, changing perspective, etc., and to make sure our neural network can handle multiple latent vectors at once (video frames are a series of images). Otherwise, the process is exactly the same: as long as we have a captioned set of videos as a dataset for our neural network, we can train our denoising diffusion model to produce videos - awesome!
We can create an algorithm that will accept text and noise, and generate a brand new video for us, using the latent denoising diffusion process we described.
Great - now, we can produce videos from noise. Shouldn't we be done? You may have noticed the post didn't end.
Well, some questions may have begun to pop up for you -there's a lot going on here. Could we just pop in noise and get a movie in a few minutes? What resolution videos we can make? How long can they be? These are all great questions to be asking.
Practically speaking, the approach just described will be able to reliably generate a few frames of coherent video - maybe from a handful to a few hundred - at any given time. But, people often watch videos at anywhere from 24 to 60 frames per second, and videos can be from several seconds to several thousand seconds long. So that means we need a way to repeatedly generate sets of frames if we ever want to make a video longer than a few seconds. This brings us to autoregressive video diffusion. "Autoregressive" is very popular term that essentially means to extrapolate a future extension of something from itself (ex: The temperature decreased this hour, so I estimate it will continue to decrease the next). What this means concretely for us is that our algorithm will take our current frames and give us a few more.
How do you downsample an image into a latent? Section on VAEs (image downsamplers)
How do you make a neural network that can denoise latents? Section is on transformers (neural networks that denoise). Specifically, history guided diff forcing transformers. And how does that neural network work?
So, how does one make a neural network that can understand these relationships?
Given the vast number of excellently-written resources on transformer architecture, I'll choose to direct the reader to those for familiarization if needed: Formal Algorithms for Transformers, GPT-2 paper, GPT-2 github.
There are numerous variations of the transformer architecture. The specific architecture we’re choosing to study, from History-Guided Video Diffusion, is an autoregressive diffusion transformer with 3D attention, trained using diffusion forcing. We’ve covered diffusion forcing just before - the transformer is trained to denoise a sequence of latents, using information from previous clean latents. The rest of those terms we’ll cover now.
Diffusion transformers mostly operate like a standard transformer, and share the same architecture: embeddings or patches, some number of residual blocks, each containing an attention mechanism, normalization, an MLP, and, after the residual blocks, finally an output layer. The differences between a standard transformer, detailed below, and the DiT we’re studying lie mainly in the contents of the initial elements of the residual stream, and the method of normalization.
Despite the use of the name "3D attention”, the attention mechanism in our chosen video diffusion transformer is the standard transformer attention mechanism, operating on two sequences. The 3D refers to the use of 3D positional embeddings (RoPE) to allow the network to represent relationships spanning vertical, horizontal and temporal position.
The MLP in the diffusion transformer for our chosen architecture is a standard multilayer perceptron as used in various transformers.
Arguably the biggest difference between the diffusion transformer at hand and a standard LLM architecture is the use of zero-initialized adaptive layer normalization, or AdaLN-Zero. Whereas a typical transformer uses layernorm, which keeps scaling and bias parameters to standardize the distribution of its attention and mlp outputs, DiTs use AdaLN-Zero, which takes a conditioning input, includes its own trainable parameters, and includes a zero-initialized gate scaling mechanism. This means that residual addition starts as an identity function (gate_msa starts 0), and the network, in our formulation, receives input about the level of noise the latents presently have. This may drastically change the effect normalization might have on the contents of the residual.
AdaLN Zero and Attention Computation:
The residual stream is arguably the defining element of the transformer. It’s where all information in the network is stored, both as input for transformer blocks and final model output.
In a standard transformer, residual stream elements are initially taken directly from the embedding matrix, made to represent terms in the model’s vocabulary. In this circumstance, at the start of the network, elements are the results of VAE encoder output - downsampled local pixel information from that patch of the frame. The network outputs those VAE latents, enhanced by information the network adds during processing. The VAE decoder takes the generated sequence of latents, and upsamples those to full video frame patches. Importantly, both the VAE encoder and decoder don’t have any MLP elements, only convolutions and downsampling. So it does not introduce any new information to the image, just creates a compressed representation of it.
There are additional details to the VAE: they may consist of a mixture of convolutional, global attention, and spatial and temporal downsampling modules.
The diffusion forcing training technique only preserves causality from context to primary sequence, and latents being generated may affect each other at generation time - that is, at least in its present setup, the model we’re discussing does not use a causal mask between generated tokens. The causality is between the context (previous frames) and primary sequence (frames in generation). Diff Forcing
What do we think happens during this process?
So we now understand the general parameters of an algorithm capable of effectively modeling these kinds of relationships. Now we turn our attention to understanding/hypothesizing about how we expect these algorithms to perform these tasks.
We’ll use the following model and notation to discuss different transformer components. A transformer has the architecture denoted in this figure:
DiT Blocks share the same basic form as standard transformer blocks: the input from the residual stream is normalized, fed to a multi-head attention operation, added, modulated, to the residual, normalized, fed to a multi-layer perception, and added, modulated, to the residual. As opposed to elements of the embedding matrix, the residual stream starts with outputs from the VAE, which are the results of a video downsampling process in the VAE. DIT GitHub
As mentioned, video diffusion transformer attention is a standard transformer attention mechanism. As such, attention circuits will be similar in form to those outlined in a mathematical framework for transformer models for token based transformers, specifically the existence of QK (determine context token info to copy) and OV (change token logits based on info) circuits. Whereas copying information from one token to another (QK task) remains similar, while a language transformer produces a logit distribution trained on an argmax-based task, a DiT produces latents. As such, the OV circuit might behave differently, as the contribution to the residual stream is in service of a different task. Reference: Attn Circuits
Given the task of a diffusion transformer is to conditionally denoise a given latent (token) to push it towards the data distribution, we can expect the MLP to contain information about the desired condition and the interaction between the existing latent’s content and the condition, as well as geometric/visual information, and perform the denoising operation itself (no other part of the transformer can do this task). The MLP is responsible for the addition of any information not contained by the input sequence and the conditioning. This likely includes a general body of knowledge, integrating interactions between tokens (movement, reflection, etc.), and perhaps geometric information manipulation.
This includes latent/semantic processing, temporal and spatial processing, information for future inference. “it’s a wing, but covered by a wall right now”.
At the end, the residual stream must contain “what it is and how to display it” for the VAE decoder.
For the moment, we’re going to ignore the contributions of the AdaLN-Zero component, which is to adjust the residual to account for a specific noise level. That being said, how behavior changes conditionally based on noise certainly seems to be a phenomenon worth investigating. Hopefully this can be addressed later in the process.
In order to better understand the contributions of the various components of the video diffusion generation process, it might be beneficial to run experiments ablating various components of the model, such as omitting the diffusion transformer, running it for one step, and running it without AdaLN, though it remains a question the degree this can be ablated from the network. Here are five ablations we’ll consider:
Does an 'induction circuit' exist, or do other model circuits exist?
As mentioned, the induction circuit was a notable finding in an early token transformer interpretability work. Naturally, one might wonder whether there is a similar fairly central mechanism for video diffusion models, also associated with a notable increase in capability. If there were to be one, what might it look like? Notably, whereas the previous transformer deals with what are initially discrete semantic units, the diffusion transformer we study takes inputs that are less directly defined, and not discretely selected as output. That being said, depending on the kind of video in question, copying and moving concepts, as opposed to pixels, is still a core element of the model’s functionality. Would we expect there to be a pair of attention heads similar to the induction circuit, or a mechanically different but conceptually similar functionality, or a completely different framework of information transfer? Also interesting would be a similar mechanism of different functionality. What might determine this is the degree to which information crosses patch boundaries. (Reference: Mathematical Framework: Induction Heads)