Note: These are results which have been in drafts for a year, see discussion about how we have moved on to thinking about these things.
Our team at AI Safety Camp has been working on a project to model the trajectories of language model outputs. We're interested in predicting not just the next token, but the broader path an LLM's generation might take. This post summarizes our key experiments and findings so far.
TL;DR: We tried some simple probing experiments to identify "text type." Results seem promising, but this approach is likely not the best way forward for our goals.
We trained probes to classify the "genre" of a...
Thanks for the suggestion! You can access the still images that have been used to generate the gifs here. We have also added the link to the still images to the post!
tl;dr: We compute the evolution of the local learning coefficient (LLC), a proxy for model complexity, for an algorithmic transformer. The LLC decreases as the model learns more structured solutions, such as head specialization.
This post is structured in three main parts, (1) a summary, giving an overview of the main results, (2) the Fine Print, that delves into various cross-checks and details and (3) Discussion and Conclusions.
In this work we study the development of simple algorithmic transformers, which are transformers that learn to perform algorithmic tasks. A major advantage of this setup is that we can control several (hyper)parameters, such as the complexity of the training data and network architecture. This allows us to do targeted experiments studying the impacts of these parameters...
The code is currently not public. We intend to make it public once we have finished a few more projects with the same codebase. One of the things we would like to look at is varying the amount of noise. I don't have great intuitions for what the loss landscape of a model trained on a finite random dataset will look like.
As to the translational symmetry of the circuits, the measure just sums the absolute difference between adjacent elements parallel to the diagonal, does the same for elements perpendicular to the diagonal and takes the difference of the two... (read more)