I have seen one paper, arguing that diffusion is stronger.
the most interesting result there: A small 6M(!) model is able to solve 9x9 sudokus with 100% accuracy. In my own experiments, using a LLama 3B model and a lot of finetuning and engineering, I got up to ~50% accuracy with autoregressive sampling.[1]
For sudoku this seems pretty obvious: In many cases there is a cell that is very easy to fill, so you can solve the problem piece by piece. Otoh, autoregressive models sometimes have to keep the full solution "in their head" before filling the first... (read more)
I have seen one paper, arguing that diffusion is stronger.
the most interesting result there: A small 6M(!) model is able to solve 9x9 sudokus with 100% accuracy. In my own experiments, using a LLama 3B model and a lot of finetuning and engineering, I got up to ~50% accuracy with autoregressive sampling.[1]
For sudoku this seems pretty obvious: In many cases there is a cell that is very easy to fill, so you can solve the problem piece by piece. Otoh, autoregressive models sometimes have to keep the full solution "in their head" before filling the first... (read more)