Diffusion, a class of generative models, boils down to an MSE between added noise and network predicted noise. Why would learning to predict the noise help with image generation (which is what it is most used for)? How did we arrive at MSE? This post dives deep into the math to answer these questions.
Background
One way to interpret Diffusion is as a continuous VAE (Variational Auto Encoder). A VAE computes a lower bound on the likelihood of generating real data samples (logpθ(x)) by approximating the unknown posterior pθ(z|x) with a learnable one qϕ(z|x) [Fig. 1]:
−logpθ(x)=−logpθ(x)≤−logpθ(x)+DKL(qϕ(z|x)∥pθ(z|x))[KL is always positive; aka -ELBO]≤−logpθ(x)+∫qϕ(z|x)log(qϕ(z|x)pθ(z|x))dz[Definition of KL]≤−logpθ(x)+∫qϕ(z|x)log(qϕ(z|x)pθ(x)pθ(z,x))dz[conditional to joint]≤−logpθ(x)+∫qϕ(z|x)(logpθ(x)+log(qϕ(z|x)pθ(z,x)))dz≤−logpθ(x)+logpθ(x)+∫qϕ(z|x)(log(qϕ(z|x)pθ(x|z)pθ(z)))dz[pθ independent of z; joint to conditional]=−Ez∼qϕ(z|x)[log(qϕ(z|x)pθ(z))−logpθ(x|z)][Definition of E for continuous variable z]=−Ez∼qϕ(z|x)logpθ(x|z)+DKL(qϕ(z|x)∥pθ(z))[Tractable]
The process of encoding (qϕ(z|x)) and decoding (pθ(x|z)) leads to regularized compression of images (x) in the z space (hence the name auto encoder; variational comes from the approximating distribution). During inference, random images can be generated from the decoder, based on sampling of z.
Derivation
Figure 2: Markov chain of forward [q] (reverse [pθ]) diffusion process of generating a sample by slowly adding (removing) noise. (Image source: Ho et al. 2020)
Similar to VAE, diffusion models compute a lower bound on the likelihood of generating real data samples (logpθ(x0)) by approximating the unknown posterior pθ(xt−1|xt) with a computable one qϕ(xt−1|xt,x0), for all t [Fig. 2] (It follows the same math as VAE for all t, therefore could be thought of as its continuous version):
−logpθ(x0)≤−logpθ(x0)≤−logpθ(x0)+DKL(q(x1:T|x0)∥pθ(x1:T|x0))[KL is always positive]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)[log(q(x1:T|x0)pθ(x1:T)|x0)]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)⎡⎢
⎢⎣log⎛⎜
⎜⎝q(x1:T|x0)(pθ(x0|x1:T)pθ(x1:T)pθ(x0))⎞⎟
⎟⎠⎤⎥
⎥⎦[Bayes' rule]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)⎡⎢
⎢⎣log⎛⎜
⎜⎝q(x1:T|x0)(pθ(x0,x1:T)pθ(x0))⎞⎟
⎟⎠⎤⎥
⎥⎦[conditional to joint]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)⎡⎢
⎢⎣log⎛⎜
⎜⎝q(x1:T|x0)(pθ(x0:T)pθ(x0))⎞⎟
⎟⎠⎤⎥
⎥⎦[merge joint]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)[log(q(x1:T|x0)pθ(x0:T))]+log(pθ(x0))Eq[−log(pθ(x0))]≤Eq[log(q(x1:T|x0)pθ(x0:T))]
Solving for R.H.S.:
log(q(x1:T|x0)pθ(x0:T))=log(∏Tt=1q(xt|xt−1)pθ(xT)∏Tt=1pθ(xt−1|xt))=−logpθ(xT)+T∑t=1log(q(xt|xt−1)pθ(xt−1|xt))=−logpθ(xT)+T∑t=2log(q(xt|xt−1)pθ(xt−1|xt))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt)q(xt)pθ(xt−1|xt)q(xt−1))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)q(xt|x0)pθ(xt−1|xt)q(xt−1|x0))+log(q(x1|x0)pθ(x0|x1))[conditioning on x0 for tractability]=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))+T∑t=2log(q(xt|x0)q(xt−1|x0))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))+log(q(x2|x0)…q(xT−1|x0)q(xT|x0)q(x1|x0)q(x2|x0)…q(xT−1|x0))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))+log(q(xT|x0)q(x1|x0))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))+log(q(xT|x0)q(x1|x0)q(x1|x0)pθ(x0|x1))=logq(xT|x0)−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))−logpθ(x0|x1)=log(q(xT|x0)pθ(xT))+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))−logpθ(x0|x1)log(q(x1:T|x0)pθ(x0:T))=DKL(q(xT|x0)∥pθ(xT))LT+T∑t=2DKL(q(xt−1|xt,x0)∥pθ(xt−1|xt))Lt−1−logpθ(x0|x1)L0
Taking each piece apart:
LT: Can be ignored while training since q has no learnable parameters, and xT is just Gaussian noise.
Lt−1: Match the unknown (pθ) to the tractable quantity (q). Define the reverse process as:
In the DDPM paper[1], it is found empirically that the training works better if the scaling factor is omitted, i.e., the loss can be simplified to
Lt=∥ϵt−ϵθ(xt,t)∥2
L0: The image (x0) is originally constructed from D discrete pixels with values [0,1,2,…,255] scaled to the range [−1,1]. The DDPM paper[1] suggests using an independent discrete decoder derived from the Gaussian (since Lt doesn't apply here):
δ+(x)={∞ if x=1x+1255 if x<1δ−(x)={−∞ if x=−1x−1255 if x>−1
Intuitively, the network predicts the mean of a pixel which is used to draw from a Gaussian distribution which is integrated[2] over according to the real pixel value as in the original image, from the real value minus 1/255 to the real value plus 1/255. If the predicted mean is close to the original value of the pixel, the result of the integral will be high.
The integral is approximated by the Gaussian probability density function times the bin width, which makes the math workout similar to the Lt−1 case:pθ(x0|x1)≈1√2πσ1exp(−12∥x0−μθ(x1,1)∥2σ21)(2255)L0=−log(pθ(x0|x1))≈12σ21∥x0−μθ(x1,1)∥2+C≈12σ21∥∥
∥∥1√α1(x1−√1−α1ϵ1)−1√α1(x1−1−α1√1−α1ϵθ(x1,1))∥∥
∥∥2+C[From 1, 2]≈∥ϵ1−ϵθ(x1,1)∥2
Putting the pieces together, the DDPM paper[1] suggests using this loss function for training (followed by the algorithm): Lsimple :=Et∼U[1,T],x0,ϵ[∥ϵ−ϵθ(xt,t)∥2]=Et∼U[1,T],x0,ϵ[∥∥ϵ−ϵθ(√¯αtx0+√1−¯αtϵ,t)∥∥2]
Algorithm 1 DDPM Training repeatx0∼q(x0)t∼Uniform([1,…,T])ϵ∼N(0,I)Take gradient descent step on∇θ∥∥ϵ−ϵθ(√¯αtx0+√1−¯αtϵ,t)∥∥2until converged
We saw how we can go from noise to an image via the reverse process, but how do we go from random noise → random (aggregate of training set) image to random noise → image of interest? The next section talks about this.
Conditioning
Dhariwal & Nichol (2021)[3] use gradients ∇xlogfϕ(y|xt) of a classifier fϕ(y|xt,t) to condition diffusion sampling on y (e.g. a target class label) where xt is noisy image. The score function for the joint distribution q(xt,y) is computed as,
Thus, a new classifier-guided predictor ¯ϵθ modifies the initial predictor ϵθ as follows,
¯ϵθ(xt,t)=ϵθ(xt,t)−√1−¯αt∇xtlogfϕ(y|xt)
To control the strength of the classifier guidance, a weight w is added to the delta part, ¯ϵθ(xt,t)=ϵθ(xt,t)−√1−¯αtw∇xtlogfϕ(y|xt)[4]
Ho & Salimans (2021)[4] propose to run conditional diffusion without an independent classifier fϕ. Consider an unconditional denoising diffusion model pθ(x) parameterized through a score estimator ϵθ(xt,t) and a conditional model pθ(x|y) parameterized through ϵθ(xt,t,y). These two models can be learned via a single neural network. Specifically, a conditional diffusion model pθ(x|y) is trained on paired data (x,y), where the conditioning information y gets discarded periodically at random such that the model knows how to generate images unconditionally as well, i.e. ϵθ(xt,t)=ϵθ(xt,t,y=∅).
The gradient of an implicit classifier p(y|xt) is proportional to p(xt|y)p(xt). Taking log on both sides,
So far we discussed training and generating images based on a y, but what is a good value for T? Can we make the sampling process efficient and reduce the number of steps somehow? We explore this part here.
Langevin dynamics can generate samples from a probability density q(x) using only the gradients ∇xlogq(x) in a Markov chain of updates:
xt=xt−1+δ2∇xlogq(xt−1)+√δϵt, where ϵt∼N(0,I)
where δ is the step size. When T→∞,ϵ→0,xT equals the true probability density q(x). Compared to SGD, Langevin dynamics injects Gaussian noise into the parameter updates to avoid collapses into local minima.
Score-based generative modeling trains a score network to estimate the above gradient i.e. sθ(x)≈∇xlogq(x). To make the data points cover the whole space, the score network jointly estimates the scores of data perturbed at different noise levels (Noise Conditional Score Network) i.e. sθ(xt,t)≈∇xtlogq(xt). The schedule of increasing noise levels resembles the forward diffusion process.
Given a Gaussian distribution x∼N(μ,σ2I), the derivative of the logarithm of its density function ∇xlogp(x) is equal to ∇x(−12[x−μσ]2)=−(x−μσ)=−ϵσ where ϵ∼N(0,I).
Recall that q(xt|x0)∼N(√¯αtx0,(1−¯αt)I). Therefore, sθ(xt,t)≈∇xtlogq(xt)=Eq(x0)[∇xtlogq(xt|x0)]=Eq(x0)[−ϵθ(xt,t)√1−¯αt]=−ϵθ(xt,t)√1−¯αt
Figure 3: Non Markov chain of forward [q] (reverse [pθ]) diffusion process of generating a sample by slowly adding (removing) noise. (Image source: Song et al. 2022)
To make diffusion more efficient, DDIM [Fig. 3] propose fewer sampling steps by modeling the forward process as a non Markovian chain: q(x1:T|x0)=q(xT|
Diffusion, a class of generative models, boils down to an MSE between added noise and network predicted noise.
Why would learning to predict the noise help with image generation (which is what it is most used for)? How did we arrive at MSE? This post dives deep into the math to answer these questions.
Background
One way to interpret Diffusion is as a continuous VAE (Variational Auto Encoder).
A VAE computes a lower bound on the likelihood of generating real data samples (logpθ(x)) by approximating the unknown posterior pθ(z|x) with a learnable one qϕ(z|x) [Fig. 1]:
−logpθ(x)=−logpθ(x)≤−logpθ(x)+DKL(qϕ(z|x)∥pθ(z|x))[KL is always positive; aka -ELBO]≤−logpθ(x)+∫qϕ(z|x)log(qϕ(z|x)pθ(z|x))dz [Definition of KL]≤−logpθ(x)+∫qϕ(z|x)log(qϕ(z|x)pθ(x)pθ(z,x))dz [conditional to joint]≤−logpθ(x)+∫qϕ(z|x)(logpθ(x)+log(qϕ(z|x)pθ(z,x)))dz ≤−logpθ(x)+logpθ(x)+∫qϕ(z|x)(log(qϕ(z|x)pθ(x|z)pθ(z)))dz [pθ independent of z; joint to conditional]=−Ez∼qϕ(z|x)[log(qϕ(z|x)pθ(z))−logpθ(x|z)][Definition of E for continuous variable z]=−Ez∼qϕ(z|x)logpθ(x|z)+DKL(qϕ(z|x)∥pθ(z))[Tractable]
The process of encoding (qϕ(z|x)) and decoding (pθ(x|z)) leads to regularized compression of images (x) in the z space (hence the name auto encoder; variational comes from the approximating distribution). During inference, random images can be generated from the decoder, based on sampling of z.
Derivation
Similar to VAE, diffusion models compute a lower bound on the likelihood of generating real data samples (logpθ(x0)) by approximating the unknown posterior pθ(xt−1|xt) with a computable one qϕ(xt−1|xt,x0), for all t [Fig. 2] (It follows the same math as VAE for all t, therefore could be thought of as its continuous version):
−logpθ(x0)≤−logpθ(x0)≤−logpθ(x0)+DKL(q(x1:T|x0)∥pθ(x1:T|x0))[KL is always positive]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)[log(q(x1:T|x0)pθ(x1:T)|x0)]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)⎡⎢ ⎢⎣log⎛⎜ ⎜⎝q(x1:T|x0)(pθ(x0|x1:T)pθ(x1:T)pθ(x0))⎞⎟ ⎟⎠⎤⎥ ⎥⎦[Bayes' rule]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)⎡⎢ ⎢⎣log⎛⎜ ⎜⎝q(x1:T|x0)(pθ(x0,x1:T)pθ(x0))⎞⎟ ⎟⎠⎤⎥ ⎥⎦[conditional to joint]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)⎡⎢ ⎢⎣log⎛⎜ ⎜⎝q(x1:T|x0)(pθ(x0:T)pθ(x0))⎞⎟ ⎟⎠⎤⎥ ⎥⎦[merge joint]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)[log(q(x1:T|x0)pθ(x0:T))]+log(pθ(x0))Eq[−log(pθ(x0))]≤Eq[log(q(x1:T|x0)pθ(x0:T))]
Solving for R.H.S.:
log(q(x1:T|x0)pθ(x0:T))=log(∏Tt=1q(xt|xt−1)pθ(xT)∏Tt=1pθ(xt−1|xt))=−logpθ(xT)+T∑t=1log(q(xt|xt−1)pθ(xt−1|xt))=−logpθ(xT)+T∑t=2log(q(xt|xt−1)pθ(xt−1|xt))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt)q(xt)pθ(xt−1|xt)q(xt−1))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)q(xt|x0)pθ(xt−1|xt)q(xt−1|x0))+log(q(x1|x0)pθ(x0|x1))[conditioning on x0 for tractability]=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))+T∑t=2log(q(xt|x0)q(xt−1|x0))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))+log(q(x2|x0)…q(xT−1|x0)q(xT|x0)q(x1|x0)q(x2|x0)…q(xT−1|x0))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))+log(q(xT|x0)q(x1|x0))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))+log(q(xT|x0)q(x1|x0)q(x1|x0)pθ(x0|x1))=logq(xT|x0)−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))−logpθ(x0|x1)=log(q(xT|x0)pθ(xT))+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))−logpθ(x0|x1)log(q(x1:T|x0)pθ(x0:T))=DKL(q(xT|x0)∥pθ(xT))LT+T∑t=2DKL(q(xt−1|xt,x0)∥pθ(xt−1|xt))Lt−1−logpθ(x0|x1)L0
Taking each piece apart:
LT: Can be ignored while training since q has no learnable parameters, and xT is just Gaussian noise.
Lt−1: Match the unknown (pθ) to the tractable quantity (q). Define the reverse process as:
pθ(xt−1|xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))
Simplifying to known variance:
pθ(xt−1|xt)=N(xt−1;μθ(xt,t),σ2tI)
Solving q (zoom out if not fully visible):
Computing the parameters (σ and μ) of the new Gaussian:
σ2=~βt=1(αtβt+11−¯αt−1)=(1−¯αt−11−¯αt)βt~μt(xt,x0)=(√αtβtxt+√¯αt−11−¯αt−1x0)1σ2=(√αtβtxt+√¯αt−11−¯αt−1x0)(1−¯αt−11−¯αt)βt=√αt(1−¯αt−1)1−¯αtxt+√¯αt−1βt1−¯αtx0~μt(xt,x0)=1√αt(xt−1−αt√1−¯αtϵt)[From 1]
μθ must match ~μt, and since xt is given, learn to predict the noise at step t, εt.
μθ(xt,t)=1√αt(xt−1−αt√1−¯αtϵθ(xt,t))[2]
Which means
xt−1=1√αt(xt−1−αt√1−¯αtϵθ(xt,t))+σtz[3]
The KL divergence between Gaussians with mean as the only parameter leads to the following loss:
Lt=DKL(q(xt−1|xt,x0)∥pθ(xt−1|xt))=12σ2t∥~μt(xt,x0)−μθ(xt,t)∥2=12σ2t∥∥ ∥∥1√αt(xt−1−αt√1−¯αtϵt)−1√αt(xt−1−αt√1−¯αtϵθ(xt,t))∥∥ ∥∥2=(1−αt)22αt(1−¯αt)σ2t∥ϵt−ϵθ(xt,t)∥2
In the DDPM paper[1], it is found empirically that the training works better if the scaling factor is omitted, i.e., the loss can be simplified to
Lt=∥ϵt−ϵθ(xt,t)∥2
L0: The image (x0) is originally constructed from D discrete pixels with values [0,1,2,…,255] scaled to the range [−1,1]. The DDPM paper[1] suggests using an independent discrete decoder derived from the Gaussian (since Lt doesn't apply here):
pθ(x0|x1)=∏Di=1∫δ+(xi0)δ−(xi0)N(x1,μiθ(x1,1),σ21)dx
δ+(x)={∞ if x=1x+1255 if x<1δ−(x)={−∞ if x=−1x−1255 if x>−1
The integral is approximated by the Gaussian probability density function times the bin width, which makes the math workout similar to the Lt−1 case:pθ(x0|x1)≈1√2πσ1exp(−12∥x0−μθ(x1,1)∥2σ21)(2255)L0=−log(pθ(x0|x1))≈12σ21∥x0−μθ(x1,1)∥2+C≈12σ21∥∥ ∥∥1√α1(x1−√1−α1ϵ1)−1√α1(x1−1−α1√1−α1ϵθ(x1,1))∥∥ ∥∥2+C[From 1, 2]≈∥ϵ1−ϵθ(x1,1)∥2
Putting the pieces together, the DDPM paper[1] suggests using this loss function for training (followed by the algorithm):
Lsimple :=Et∼U[1,T],x0,ϵ[∥ϵ−ϵθ(xt,t)∥2]=Et∼U[1,T],x0,ϵ[∥∥ϵ−ϵθ(√¯αtx0+√1−¯αtϵ,t)∥∥2]
Algorithm 1 DDPM Training
repeatx0∼q(x0)t∼Uniform([1,…,T])ϵ∼N(0,I)Take gradient descent step on∇θ∥∥ϵ−ϵθ(√¯αtx0+√1−¯αtϵ,t)∥∥2until converged
We saw how we can go from noise to an image via the reverse process, but how do we go from random noise → random (aggregate of training set) image to random noise → image of interest? The next section talks about this.
Conditioning
Dhariwal & Nichol (2021)[3] use gradients ∇xlogfϕ(y|xt) of a classifier fϕ(y|xt,t) to condition diffusion sampling on y (e.g. a target class label) where xt is noisy image. The score function for the joint distribution q(xt,y) is computed as,
∇xtlogq(xt,y)=∇xtlogq(xt)+∇xtlogq(y|xt)≈−1√1−¯αtϵθ(xt,t)+∇xtlogfϕ(y|xt)=−1√1−¯αt(ϵθ(xt,t)−√1−¯αt∇xtlogfϕ(y|xt))
Thus, a new classifier-guided predictor ¯ϵθ modifies the initial predictor ϵθ as follows,
¯ϵθ(xt,t)=ϵθ(xt,t)−√1−¯αt∇xtlogfϕ(y|xt)
To control the strength of the classifier guidance, a weight w is added to the delta part,
¯ϵθ(xt,t)=ϵθ(xt,t)−√1−¯αtw∇xtlogfϕ(y|xt)[4]
Ho & Salimans (2021)[4] propose to run conditional diffusion without an independent classifier fϕ. Consider an unconditional denoising diffusion model pθ(x) parameterized through a score estimator ϵθ(xt,t) and a conditional model pθ(x|y) parameterized through ϵθ(xt,t,y). These two models can be learned via a single neural network. Specifically, a conditional diffusion model pθ(x|y) is trained on paired data (x,y), where the conditioning information y gets discarded periodically at random such that the model knows how to generate images unconditionally as well, i.e. ϵθ(xt,t)=ϵθ(xt,t,y=∅).
The gradient of an implicit classifier p(y|xt) is proportional to p(xt|y)p(xt). Taking log on both sides,
∇xtlogp(y|xt)=∇xtlogp(xt|y)−∇xtlogp(xt)=−1√1−¯αt(ϵθ(xt,t,y)−ϵθ(xt,t))¯ϵθ(xt,t,y)=ϵθ(xt,t,y)−√1−¯αtw∇xtlogp(y|xt)[From 4]=ϵθ(xt,t,y)+w(ϵθ(xt,t,y)−ϵθ(xt,t))=(w+1)ϵθ(xt,t,y)−wϵθ(xt,t)
Optimization
So far we discussed training and generating images based on a y, but what is a good value for T? Can we make the sampling process efficient and reduce the number of steps somehow? We explore this part here.
Langevin dynamics can generate samples from a probability density q(x) using only the gradients ∇xlogq(x) in a Markov chain of updates:
xt=xt−1+δ2∇xlogq(xt−1)+√δϵt, where ϵt∼N(0,I)
where δ is the step size. When T→∞,ϵ→0,xT equals the true probability density q(x). Compared to SGD, Langevin dynamics injects Gaussian noise into the parameter updates to avoid collapses into local minima.
Score-based generative modeling trains a score network to estimate the above gradient i.e. sθ(x)≈∇xlogq(x). To make the data points cover the whole space, the score network jointly estimates the scores of data perturbed at different noise levels (Noise Conditional Score Network) i.e. sθ(xt,t)≈∇xtlogq(xt). The schedule of increasing noise levels resembles the forward diffusion process.
Given a Gaussian distribution x∼N(μ,σ2I), the derivative of the logarithm of its density function ∇xlogp(x) is equal to ∇x(−12[x−μσ]2)=−(x−μσ)=−ϵσ where ϵ∼N(0,I).
Recall that q(xt|x0)∼N(√¯αtx0,(1−¯αt)I). Therefore,
sθ(xt,t)≈∇xtlogq(xt)=Eq(x0)[∇xtlogq(xt|x0)]=Eq(x0)[−ϵθ(xt,t)√1−¯αt]=−ϵθ(xt,t)√1−¯αt
To make diffusion more efficient, DDIM [Fig. 3] propose fewer sampling steps by modeling the forward process as a non Markovian chain: q(x1:T|x0)=q(xT|