Introduction
Note: These are results which have been in drafts for a year, see discussion about how we have moved on to thinking about these things.
Our team at AI Safety Camp has been working on a project to model the trajectories of language model outputs. We're interested in predicting not just the next token, but the broader path an LLM's generation might take. This post summarizes our key experiments and findings so far.
How accessible is the latent space at representing longer-scale concepts? How can we compress it?
TL;DR: We tried some simple probing experiments to identify "text type." Results seem promising, but this approach is likely not the best way forward for our goals.
Experiment
... (read 1064 more words →)