I have seen one paper, arguing that diffusion is stronger.
the most interesting result there: A small 6M(!) model is able to solve 9x9 sudokus with 100% accuracy. In my own experiments, using a LLama 3B model and a lot of finetuning and engineering, I got up to ~50% accuracy with autoregressive sampling.[1]
For sudoku this seems pretty obvious: In many cases there is a cell that is very easy to fill, so you can solve the problem piece by piece. Otoh, autoregressive models sometimes have to keep the full solution "in their head" before filling the first cell.
(As an aside, I tried testing some sudokus in Gemini Diffusion, which it couldn't solve)
So yeah, there are some tasks that should definitely be easier to solve for diffusion LLMs. Their scaling behaviour is not researched well yet, afaict.
I don't fully trust the results of this paper, as it seems too strong and I have not yet been able to replicate this well. But the principle holds and the idea is good.
I have seen one paper, arguing that diffusion is stronger.
the most interesting result there: A small 6M(!) model is able to solve 9x9 sudokus with 100% accuracy. In my own experiments, using a LLama 3B model and a lot of finetuning and engineering, I got up to ~50% accuracy with autoregressive sampling.[1]
For sudoku this seems pretty obvious: In many cases there is a cell that is very easy to fill, so you can solve the problem piece by piece. Otoh, autoregressive models sometimes have to keep the full solution "in their head" before filling the first cell.
(As an aside, I tried testing some sudokus in Gemini Diffusion, which it couldn't solve)
So yeah, there are some tasks that should definitely be easier to solve for diffusion LLMs. Their scaling behaviour is not researched well yet, afaict.
I don't fully trust the results of this paper, as it seems too strong and I have not yet been able to replicate this well. But the principle holds and the idea is good.