A sudoku-solving transformer represents the board by substructure, not by cell
> tl;dr: a transformer trained on sudoku solving traces with backtracking maintains the board state per substructure linearly in the residual stream The main goal of this post is to understand if a transformer trained on solving traces creates a "world model" and uses it during the solving process. To...
Apr 1521