I think language like this is unclear:
In the context of search, the propensity for mesa-optimization may be increased as the system explores various future states, potentially identifying alternative objectives that appear at least as rewarding or efficient in achieving the desired outcome.
A system identifies "objectives" which are "rewarding"? Can you clarify what that means? Is the system the network, or SGD, or the training process? Are the objectives internally represented?
And what about "desired"? Desired by whom -- the designers? "Desired" by SGD? Something else?
Thanks for pointing this out – indeed, our phrasing is quite unclear. The original paragraph was trying to say that our "system" (a transformer trained to find shortest paths via SGD) may learn "alternative objectives" which don't generalize (aren't "desirable" from our perspective), but which achieve the same loss (are "rewarding").
To be clear, the point we want to make here is that models capable of perfoming search are relevant for understanding mesa-optimization as search requires iterative reasoning with subgoal evaluation.
In the context of solving mazes, we may hope to understand how mesa-optimization arises and can become "misaligned"; either through the formation of non-general reasoning steps (reliance on heuristics or overfitted goals) or failure to retarget.
Concretely, we can imagine the network learning to reach the <END_TOKEN> at train time, but failing to generalise at test time as it has instead learnt a goal that was an artefact of our training process. For example, it may have learnt to go to the top right corner (where the <END_TOKEN> happened to be during training).
I know this is (hopefully) no longer cutting edge, but as someone interested in just retarget the search. I am planning to try to at least predict the findings in advance, and hopefully be able to replicate them. Putting this comment down as an anchor for my updates as I go on.
We encourage readers to predict what we might find as an exercise in training your intuitions
Recording some thoughts about each hypothesis where I have anything to say, before I read on. I have avoided spoilering myself on later articles directly, but rat-culture osmosis means I have a vague idea of the 2025 state of maze work.
Lattice-Adjacency heads + Lattice-Adjacency heads:
It seems plausible that this would exist. I do wonder if +2 two attention layers would lead to a concept of walls, I think it is unlikely <5% but worth flagging the possibility there is a way to incorporate graph structure with 1 attention layer that I'm not thinking of but the training process does.
Bottlenecks:
This could be route through a wall representation, or less robustly through a concept of rows and columns that notes some rows/columns have very few connections to adjacent ones.
Finally I have an intuition that finding the shortest path will generalize less well than finding any path even under this setup, and that this behavior would be very easy to induce with curated sets of individually correct training mazes including by picking a seemly 'innocent' way of generating mazes without further screening.
This is largely intuition, but to try to probe it; in order to know the path being given in the training data is the shortest path it needs to generate enough longer otherwise valid paths to learn that constraint, and the longer paths need to not share any other characteristics that are easier to learn as a heuristic.
So it seems very plausible it will develop a search (defined broadly) function that solves mazes, and happens to find the shortest route for training set mazes but not for other classes of them.
-edit typo fixing
Some threat models of misalignment presuppose the existence of an agent which has learned to perform a search over actions to effectively achieve goals. Such a search process might involve exploring different sequences of actions in parallel and evaluating the best sequence of actions to achieve some goal.[2]
To deepen our understanding of what it looks like when models are actually performing search, we chose to train simple GPT-2 like models to find the shortest paths through mazes. Maze-solving models provide a tractable and interesting object of study, as the structure of both the problem and solutions is extensively studied. This relative simplicity makes identifying and understanding search through the lens of mechanistic and behavioral experiments much more concrete than working with pre-trained LLMs and more feasible in the context of limited computational resources.
Mesa-optimizers are learned optimizers for an objective that can be distinct from the base-objective. Inner misalignment can occur when the AI system develops an internal optimization process that inadvertently leads to the pursuit of an unintended goal. Models capable of perfoming search are relevant for understanding mesa-optimization as search requires iterative reasoning with subgoal evaluation. In the context of solving mazes, we may hope to understand how mesa-optimization arises and can become "misaligned"; either through the formation of non-general reasoning steps (reliance on heuristics or overfitted goals) or failure to retarget.[3]
Existing literature on search has highlighted the potential for unintended consequences of search in ML systems. One lens of viewing the problem of mesa-optimization is that the behavior of a system changes in an undesirable way upon a distributional shift, and we believe that mazes provide a number of mechanisms to create such distributional shifts.
We first aim to train a transformer model to predict the shortest path between a given start and end position in a maze.
We use an autoregressive decoder-only transformer model (implemented using TransformerLens), which (at inference) makes predictions one token at a time based on previously generated tokens. Our transformer models incorporate layer normalization and MLP layers by default.
One training sample consists of a maze, as well as a unique path connecting randomly selected origin and target coordinates (circle and cross). The solved maze shown above is prompted to the model as shown below:
<ADJLIST_START> (4,4) <--> (3,4) ; (5,2) <--> (5,1) ; (0,2) <--> (0,1) ; (0,2) <--> (1,2) ; (2,1) <--> (1,1) ; (0,3) <--> (0,4) ; (3,0) <--> (2,0) ; (0,0) <--> (1,0) ; (0,5) <--> (0,4) ; (1,5) <--> (2,5) ; (4,3) <--> (3,3) ; (0,1) <--> (0,0) ; (3,1) <--> (2,1) ; (1,1) <--> (0,1) ; (3,0) <--> (4,0) ; (3,5) <--> (4,5) ; (4,1) <--> (3,1) ; (5,0) <--> (5,1) ; (3,4) <--> (2,4) ; (4,0) <--> (5,0) ; (2,4) <--> (2,5) ; (3,3) <--> (3,2) ; (5,5) <--> (5,4) ; (1,4) <--> (1,3) ; (5,3) <--> (5,4) ; (5,5) <--> (4,5) ; (2,2) <--> (3,2) ; (4,2) <--> (4,1) ; (0,2) <--> (0,3) ; (2,2) <--> (2,3) ; (4,5) <--> (4,4) ; (1,5) <--> (1,4) ; (5,2) <--> (5,3) ; (2,0) <--> (2,1) ; (2,3) <--> (1,3) ; <ADJLIST_END> <ORIGIN_START> (0,0) <ORIGIN_END> <TARGET_START> (3,2) <TARGET_END> <PATH_START> (0,0) (0,1) (1,1) (2,1) (2,0) (3,0) (4,0) (5,0) (5,1) (5,2) (5,3) (5,4) (5,5) (4,5) (4,4) (3,4) (2,4) (2,5) (1,5) (1,4) (1,3) (2,3) (2,2) (3,2) <PATH_END>
Note that we implement a custom tokenizer -- each space-delimited substring of the above is a token. This greatly simplifies the analysis of our model through explicit control of the representation, compared to using a standard LLM tokenizer.
To start with, we have primarily trained on mazes generated with randomized depth-first search, which produces acyclic mazes with unique paths between any two points. The target paths present in the dataset are the shortest paths (these paths do not backtrack owing to acyclicity).
Our hypotheses cover a broad spectrum of what our models might learn, ranging from high-level concepts of how the maze transformer internally represents graphs to specific, measurable mechanisms that can be tested.
We distinguish between search and heuristics in our hypotheses while recognizing that some of them fall along a continuum between these two. Heuristics refer to shortcuts that a model learns during training to solve a given task. For instance, when solving a maze, a model might memorize the solutions to some patterns which reoccur across mazes[4]. Search, however, explicitly considers potential future steps in the maze while deciding on the immediate next step. In the context of human reasoning, playing Chess960 may be useful as an example of what is a learned heuristic and what is search.
Testing the hypothesis of a learned internal search can be challenging. This is because it may be difficult to distinguish between sophisticated internal search algorithms and crude heuristics sufficient to solve mazes. Additionally, it may be beyond the capability of small GPT-type models to learn internal search algorithms.
(x,y)
, places weight on the tokens of adjacent (Manhattan distance of 1 ignoring the graph structure) positions in the maze. This hypothesis is based on the assumption that the maze transformer model will learn to represent spatial relationships between different positions in the maze and will use this information to guide an iterative search process. As each coordinate is represented directly with a learned embedding vector, the model must learn to infer structure from input sequences rather than being able to directly deduce this from position embeddings (which would work for rasterized mazes).(i,j)
, find instances of (i,j)
in the adjacency list(i,j)
via a connector token: (i,j+1) <--> (i,j) ;
Evaluating performance
To evaluate the model's ability to solve the maze, we conduct behavioral tests and measure its performance using different metrics. Some of the metrics we use include the fraction of mazes of a given size that the model can solve perfectly, the fraction of nodes overlapping between the prediction and the solution, and the fraction of correct decisions at forking points (where the model must choose between different directions).
Obeying constraints
In our setup, we do not actually prevent the model from producing tokens that do not obey the constraints of the maze -- a model very early in training produces a path that jumps around the maze in an impossible way. It is only the distribution of the training dataset that enforces the constraints of not jumping around, not backtracking, and reaching the target. In some ways, this parallels how we train language models: we do not enforce any grammatical structure, and we do not enforce any adherence to human values -- we only show examples of desired behavior. It is probably a valuable thing to have a better idea of what sorts of distributional shifts can cause a model to ignore constraints that it seems to follow during training.
Evaluating generalization
If the model has learned how to search, it is expected to be more robust to distributional shift than if it has learned to rely on heuristics such as memorization. We can test the generalization of the model's performance by conducting several experiments.
(i,j)
be represented as a pair of tokens rather than a single token, we drastically reduce this number. Additionally, it may be interesting to investigate whether there are significant differences in a model trained on the dual problem: instead of providing the adjacency list, provide a list of walls in the maze that obstruct movement.(i,j)
a linear probe of "token (i,j)
is in the future path". If this linear probe is effective, it might also give us a way to visualize how the model selects among possible future paths.The most conservative hypothesis is that the model memorizes every maze, leading to poor generalization. This is relatively uninteresting from the perspective of alignment but also simply not true of LLMs. As such, we expect to be able to induce more interesting behavior through hyperparameter tuning if we find that the model is only memorizing.
Assuming that the model has learned some heuristics but not implemented search, there are still intriguing hypotheses to explore from the lens of mechanistic interpretability. For instance, it may have acquired an emergent world model that can be tested through specific hypotheses related to its representation of the maze, such as the presence of adjacency heads. Alternatively, in the best-case scenario, the model may have learned heuristics but still be performing an approximate search.
If this ends up being the case, we still think this sheds light on the ambiguities of what is meant by search brought up in the original Searching for Search post.
The most interesting outcome, research-wise, is that the model has not only learned heuristics but also acquired the ability to perform search. If we can both demonstrate and mechanistically explain this effect, it would have implications for how the behavior of LLMs can be controlled in more powerful systems.
Finally, understanding the retargetability and misgeneralization of the model's search capabilities is crucial for evaluating its performance and practical applicability. By measuring these aspects, we can better understand the model's limitations and identify areas for further improvement, thereby refining the Maze-Transformer's ability to tackle not only maze-solving tasks but also other search-based challenges in various domains.
This project was cultivated in AISC8. Our team is made up of: Michael Ivanitskiy (Project Lead), Alex Spies, Dan Valentine, Rusheb Shah, Can Rager, Lucia Quirke, Chris Mathwin, Guillaume Corlouer, and Tilman Räuker. Our source repo for datset generation, model training, and model evaluation is public. For access to our slack or our (for now) private experiments repo, please email miv@knc.ai!
We would also like to extend our thanks to janus for spearheading the idea of "searching for search".
We encourage readers to predict what we might find as an exercise in training your intuitions (inspired by Alex Turner's post)
It's worth noting that this search could occur within a compressed or restricted space. For instance, instead of executing a conventional search algorithm such as BFS or DFS at the granularity of input features, a model may plan and evaluate with compressed representations (see related discussion)
We rephrased this paragraph for clarity. Thanks to Alex Turner for pointing this out.
Heuristics that leverage repeated substructure are to be expected, but it is not clear at what point a heuristic should be considered a "bad" result of overfitting; clearly, memorizing entire training examples is "bad" as that behavior doesn't generalize at all, but this is ultimately still a heuristic.