Policy gradient methods[1] have two components to them - the probability πθ(at|st) and the reward Rt. The goal is to maximize the reward Rt and increasing the probability of the action that leads to it, from the state. Different algorithms have tweaked the reward weight in the objective to try and reduce its variance. For eg: REINFORCE subtracts a baseline (moving average reward) from the reward. PPO uses advantage which subtracts a value function from the reward. GRPO also uses advantage by approximating the value function with Monte Carlo groups.
Most problems that use Reinforcement Learning design rewards such that a good result receives a positive reward while a bad result leads to a negative reward. This doesn't seem... (read 4300 more words →)
Diffusion, a class of generative models, boils down to an MSE between added noise and network predicted noise. Why would learning to predict the noise help with image generation (which is what it is most used for)? How did we arrive at MSE? This post dives deep into the math to answer these questions.
Background
One way to interpret Diffusion is as a continuous VAE (Variational Auto Encoder). A VAE computes a lower bound on the likelihood of generating real data samples (logpθ(x)) by approximating the unknown posterior pθ(z|x) with a learnable one qϕ(z|x) [Fig. 1]:
−logpθ(x)=−logpθ(x)≤−logpθ(x)+DKL(qϕ(z|x)∥pθ(z|x))[KL is always positive; aka -ELBO]≤−logpθ(x)+∫qϕ(z|x)log(qϕ(z|x)pθ(z|x))dz[Definition of KL]≤−logpθ(x)+∫qϕ(z|x)log(qϕ(z|x)pθ(x)pθ(z,x))dz[conditional to joint]≤−logpθ(x)+∫qϕ(z|x)(logpθ(x)+log(qϕ(z|x)pθ(z,x)))dz≤−logpθ(x)+logpθ(x)+∫qϕ(z|x)(log(qϕ(z|x)pθ(x|z)pθ(z)))dz[pθ independent of z; joint to conditional]=−Ez∼qϕ(z|x)[log(qϕ(z|x)pθ(z))−logpθ(x|z)][Definition of E for continuous variable z]=−Ez∼qϕ(z|x)logpθ(x|z)+DKL(qϕ(z|x)∥pθ(z))[Tractable]
The process of encoding (qϕ(z|x)) and decoding... (read 2044 more words →)