This is my attempt to write down what I would be researching, if I were working directly with LLMs rather than doing Agent Foundations. (I'm open to collaboration on these ideas.)
Machine Learning research can occupy different points on a spectrum between science and engineering: science-like research seeks to understand phenomena deeply, explain what's happening, provide models which predict results, etc. Engineering-like research focuses more on getting things to work, achieving impressive results, optimizing performance, etc. I think the scientific style is very important. However, the research threads here are more engineering-flavored: I'd like to see systems which get these ideas to work, because I think they'd be marginally safer, saving a few more worlds along the alignment difficulty spectrum. I think the forefront of AI capabilities research is currently quite focused on RL, which is an inherently more dangerous technology; part of what I hope to illustrate here is that there is low-hanging capability fruit in other directions.
When you ask, what answers?
Base models are the best, most advanced statistical models humans have ever created. However, we don't use them that way. Instead, we use them as weight initializations for training chatbots. The statistical integrity is compromised by layering on additional training aimed at a variety of goals, trying to warp the statistical model into an intelligent assistant personality.
For example: if I ask ChatGPT to generate plausible news articles from 2030, I don't know whether I'm getting genuine extrapolation from the underlying statistical model, science fiction tropes, text optimized to sound helpful and plausible, etc.
The idea here is to treat LLMs with a more statistical attitude, creating more “handles” for useful and interesting statistical manipulations.
Instead of chat-training an LLM and then asking "Generate news from 2030", I'd try to train a more structured language model by labeling metadata explicitly. We can then do explicit probabilistic inference by _conditioning on_ metadata such as "media type: news article" and "publication date: 2030". This gives us more confidence that we're extrapolating patterns from news articles, rather than sci-fi or other sources.
Of course, since the LLM won't have seen any training examples from 2030, we still fundamentally don't know what we're getting. We don't have a detailed understanding of how LLMs generalize, and we don't have interpretability tools good enough to simply check. The LLM could still be drawing on sci-fi. Still, I think we stand a better chance of getting what we ask for with such an approach.
LLMs are still liable to be a big bag of heuristics, one way or another. To understand what heuristics are answering a question in my proposal, we'd have to understand how Transformers generalize. To answer the same question for ChatGPT, we'd still have to understand that, and we'd have to understand what variety of loss functions got applied when training for relevantly similar cases, and what order those different types of training occurred in, and what the ordering effects are.
The assumption I'm making is something along the lines of: Transformers (sufficiently scaled up) are good at whatever we train them to do. We don't exactly know how they generalize, but we do know that it works fairly well in some sense. (This isn't an assumption I want to make -- I'm ignoring several dangerous AI risks. I just think this direction seems quantitatively less risky than RL.) If we get what we train for, what should we train for? My general answer is legitimate deliberation, but here I offer some more specific and practical ideas. These ideas are all inspired by "coherence" in the probabilistic sense; the aim is to create an LLM with approximately coherent beliefs, rather than the usual slop.
My understanding of current practice instead looks messy. It seems to me like there is some low-hanging fruit available in the form of building more statistically serious LLMs. I aim to outline my ideas here, so that someone interested could run experiments. I'm thinking about the impact similarly to what I wrote in anti-slop interventions.
There's a sort of "honesty" which I think ties together many of the ideas here: not the direct honesty of the LLM telling you true things, but a sense that you know what you're getting. The LLM isn't trying to produce pleasing results (unless that's what you ask for). Such an LLM is a microscope into its dataset. You ask, and the data answers.
Partially Labeled Data
I think most people think about modern training something like this: “Base models aren’t trained to tell the truth; they’re just trained to predict the next token. That’s why we need to do chat training, anti-confabulation training, and so on. Generative pre-training just gives us a decent prior. We need to do reinforcement learning and other sorts of fine-tuning to get what we actually want out of it.”
While I don’t deny the success of that paradigm, it misses the fact that training to predict the next token is actually a type of truth; this is real data about the world. Base models know a lot about the world! They merely lack the “handles” to tell us what they know in a way we can easily use. I'm arguing that we can add those handles without sacrificing the statistical integrity of the base models.
The main idea here is to structure the data during training, explicitly labeling metadata such as author, date of publication, topic, etc where such labels are available, rather than throwing everything together in unstructured text.
Aren't these types of metadata already included in the text where available? What's the benefit? One can already prompt a base model with something like a title and a date, and generate from there. What's missing?
Clearly and consistently labeling in a particular way helps us know what we're getting, rather than guessing about how the LLM is interpreting things. We can try to prompt with a document header specifying the information we want to condition on, but a base model will infer things from the structure of the header, and also from the presence or absence of information. This makes prompting base models an opaque and challenging business.
More concretely, I'm imagining replacing the simple [context] -> [completion] pattern with something like [data], [query] -> [answer].[1] The "query" can be simple text-completion like before, or it can ask for the metadata, it can ask for translation into different languages, summaries, elaborations, lots of different things. You can train it to perform whatever transformation or inference you like; the framework is not very restrictive. You can even train it on enough different [query] strings that you hope to generalize to many queries you haven't trained on; still, you get the benefit of clearly separating out the query from the data, removing some of the ambiguity in how these queries are being interpreted.
You could implement this by modifying the neural architecture, or you could take a standard architecture and merely format the input in a standardized way.
Invertibility
In order to generate something from 2030, we also want to be able to invert a metadata query, trying to generate the data from given metadata. We can generate [date], ["infer the text"] -> [text] training examples easily by swapping around training examples of the form [text], ["infer the date"] -> [date], of course. Training to do this at scale should result in decent answers. However, the result may not be probabilistically coherent: especially for new examples, the inferred metadata given a text may not match the metadata used to generate the text.
This is an illustration of the concept of "honesty" that underlies my proposal: if inverting a relation isn't probabilistically coherent, then in some sense the statistical model isn't revealing everything that it knows, and isn't giving you what you expect.
We could train a "prior" context, perhaps [], ["prior"] -> [text], on unstructured text (much the way that base models are normally trained). This would allow us to apply Bayes' Law (we'd have to approximate, of course) and penalize something like KL divergence between forward-inference and the Bayesian reversal. We might also want priors for specific types of labels (EG priors over authors, dates, etc).
I think the benefits of this sort of probabilistic consistency are real and practical. To illustrate the idea, I'll tell a story about my brother.
My brother uses Claude at work, sometimes, particularly when he's been given more than he can finish so fast-but-unreliable AI answers hit the right cost-benefit tradeoff. At one point, he experimented with using Claude to write emails. Claude has a "style" feature, which is supposed to allow users to easily train Claude to write in a desired style. If you push "Create & edit styles" you'll be prompted to upload a writing example, with an option to describe the desired style manually instead.
My brother uploaded some of his own styles as examples, and Claude generated a style summary for itself, which described my brother's emails as "clear, concise, professional language" and so on.[2] When my brother tested out the new style, the emails came out like they were written by Spock, not correctly imitating my brother's style at all.
I suppose that Claude is bad at this task because Claude hasn't really been trained to do it -- Anthropic is merely leaning on Claude's general competence and hoping that when Claude summarizes a style from text, that summary can later serve to prompt Claude to write in the correct style.
More examples:
Summary and elaboration should invert each other. This helps enforce the accuracy of both summaries and elaborations.
Asking for the author of a text should invert a query asking to generate text in that author's style. Sure, maybe the LLM can't imitate an author perfectly; but any imperfection that it can pick up on should be trained out (it constitutes knowledge the model has about the author's style which it isn't successfully applying to text generation). This is similar to the idea behind Generative Adversarial Networks.
Translating between languages should invert well. This gives evidence that all the information being communicated was correctly captured.
(There may also be queries for which we'd rather optimize against invertibility in specific ways; for example, an anonymization query, for which we optimize against ability to reconstruct the author.)
Do you see why I want to call this sort of thing "honesty", even if I put scare quotes around it? Asking summary and elaboration operations to approximately invert each other doesn't force summaries to be accurate, but if the summary just doesn't make the original text more probable, then either the summary is bad or the elaboration is bad. Asking style summaries to make the example text they're based on more-probable-rather-than-less doesn't necessarily make the style summaries descriptively correct, but it does make them "correct" in a functionally very useful way: if you use the style later, the LLM will make the right inferences.
If you could trust AI to do this sort of thing well, then I imagine there'd be a lot of useful applications, particularly for AI self-prompting. It seems like a useful counter to AI slop: AI outputs which compose well with further AI operations, minimizing loss-of-intent.
Conditioning
Inversion doesn't yet enable the "news article from 2030" example I started with, since this involves at least two meta-data tags (year of publication, and then something like publication type). Inversion only allows us to condition on one thing.
What's needed is the ability to treat queries as edges in a graphical model; variables in this graphical model are pieces of text (known or unknown), and the queries relate them to each other. For example, generating a text representative of a specific year and topic: the year and topic are known variables, the text is the unknown variable, and year and topic queries connect the text to the year and topic respectively.
However, students of graphical models will know that the conditional probability tables P(A|B) and P(A|C) aren't sufficient to determine P(A|B,C) (although one can use Naive Bayes to heuristically estimate). We need to know how B and C interact with each other.
It seems useful, then, to train the model to answer well with different combinations of information given. Conceptually, the training data is still just (context, query, response) triples; but now, we regard these as defining a graph, and train on small subgraphs rather than only individual edges. The architecture described so far could be used by packing extra information into queries; or, perhaps, the neural architecture could be designed to handle this natively. In either case, the aim is probably just to train the model to resample one variable at a time, given some combination of directly connected information; sampling from more complex graphical models can be built out of this.
In terms of the type of "honesty" I'm pursuing here, it is important to ensure that the conditioning doesn't learn to infer anything from which pieces of information have been provided or left unknown. For example, if most text in the training data that has author labels comes from academic papers, then conditioning on author could cause the model to write in an academic style. This is an undesirable, opaque consequence of conditioning. Conditioning on a specific author should only cause the model to generate something that person might write.
That sort of statistical integrity might be well-served by an approach to training inspired by the EM algorithm, in which the missing labels and the correct weights are jointly inferred. It could also be useful to choose some important metadata to always estimate jointly with text generation; for example, if the topic estimate is incrementally revised during generation and feeds back into generation, this would give us a window into when the LLM settles on a topic.[3]
Transitivity
Another useful type of coherence is transitivity; for example, translating things through a chain of languages should approximately equal a direct translation from one language to another. Querying my father, and then querying the result's father, should be equivalent to querying for my paternal grandfather. Et cetera.
At a technical level, this means queries need to have a notation for compound queries; EG, [context], [query 1; query 2] -> [final result] should be trained to yield the same distribution as first sampling [context], [query 1] -> [intermediate result] and then using the output to sample [intermediate result], [query 2] -> [final result].
More generally, since the previous section discussed how we can compose operations together into complicated graphical models (not just linear chains), we would like the analogue of transitivity for these more complicated compositions: for any complicated probabilistic computation which we could compose out of multiple pieces, we'd also like to train a single query to approximate that result. This involves a corresponding augmentation of the query notation to represent queries on such graphical models.
Entropy Preservation
The term "mode collapse" or sometimes "modal collapse" is a longstanding term for generative models failing to capture the diversity of the training data; it was applied to GANs long before LLMs became prominent in generative AI.[4] The idea behind the term is that the data will have different "modes" (such as pictures of cats vs pictures of dogs), and GANs might learn to produce some of these modes but not others. The term "mode" in statistics refers to the globally most frequently occurring value in the data, but is often abused to refer to local maxima; "unimodal" data has one peak, "bimodal" has two peaks, etc. The term "mode collapse" operationalizes the missing diversity as missing modes. Mysteries of Mode Collapse perpetuated the use of this term in the LLM era.
A related term, "model collapse", refers to degradation specifically due to training AIs on AI-generated data. This exacerbates the problem, naturally, since the artificial data will already be missing modes, and the training process might miss even more.
I don't think "modes" are a great way of thinking of clusters,[5] or apply to LLMs very well;[6] furthermore, the disappearance of clusters is not the only way diversity can be lost. We could keep all the modes but lose variance in all of them.
Entropy collapse is a big problem for applying LLMs to creative writing in particular. I've gotten the same five story ideas repeatedly when re-sampling from Claude, even when I change the initial prompt. AI-written stories often contain similar scenes and ideas. Some of this is convergence to common cliches, which is already a form of entropy collapse, but a lot of it seems to be making new cliches ("AI-isms") due to AIs over-employing ideas repetitively (such as em dashes, which were fine before AI).
One of the most obvious examples is the predisposition AIs have for specific names, such as "Elara". I don't know of any specific research on AI-generated names, but it seems probable that the name variety is fine in base models (reflecting large biases that exist in the underlying training data, no doubt, but only as bad at name variety as the average author), and the entropy collapse all happens in later training steps.
I can't claim to have a solution to this problem. I don't know exactly what's going on inside frontier labs -- how much this is a really hard problem where all the obvious ideas don't work, vs an easily solved problem which they've only mitigated in dumb ways as depicted in the AI-generated comic above. I suspect that the inconsistent, multi-layered training objectives (which I identified at the beginning as "the problem" which this essay is organized around) contributes to this; in particular, it is easy to see why RL-ish optimization of any kind of broad approval score (like RLHF alignment training, or optimizing user approval) would push models toward a small set of responses. If there is any variation in how much an RL-ish loss likes different names, then an RL-ish training process should bias in favor of the highest-scoring names.
People have at least tried adding entropy-encouraging terms to the loss function; specifically, the loss function in RLHF and some other techniques commonly includes KL divergence from the base model as a component. This seems like a sensible training target, particularly if we suppose that the base model doesn't display entropy-collapse and the other components of loss don't introduce huge biases in cases where we want to maintain diversity.
However, this approach obviously hasn't been enough. Perhaps KL divergence is difficult to balance appropriately with other loss terms, since RLHF aims to significantly shift the distribution in _some_ cases. It's an unprincipled compromise between diversity and other objectives. Or perhaps the KL divergence commonly used is backwards for this purpose.[7]
Besides modifying the KL divergence penalty, I would try adding a lot of training tasks which involve sampling from distributions. For example, queries asking for random numbers drawn from all sorts of statistical distributions. Baby names from specific years, with their accurate frequencies. Given a probabilistic program, generate possible outputs from that program with the right frequencies. Mimic a given stochastic grammar. Generate random words consistent with a given language's phonology.
With enough varied queries of that sort, the hope is, some sampling competence would generalize.
If base models really are better, then perhaps the overall approach outlined in this essay would avoid entropy collapse by default, since I'm trying to avoid the train-base-model-then-fine-tune methodology.
Or, perhaps, this is a problem which requires bigger innovations.
I motivated the problem in terms of AI creative writing earlier, but I think really it is a much more important problem. Bengio argues for its importance in an AI safety context here. If this problem were solved, then semantic entropy would become a very reliable way of detecting AI confabulations. For example, when AI invents plausible citations, it should be sampling from a high-entropy distribution which only knows what citations are like. When AI knows the right answer, the variation in its responses should remain semantically equivalent. This idea is already known to perform fairly well for detecting confabulations, but it can do nothing about cases where the distribution over answers is confidently wrong, as can be caused by entropy collapse. Reduce such cases, and you improve confabulation-detection.
Self-Knowledge
I've proposed to focus on the statistical merits of models: training on structured data to enable well-defined probabilistic queries rather than only text completions, training for various sorts of probabilistic coherence, and focusing on making sure the probability distributions are good rather than just the sampled answers.
However, in order to use an LLM like this as a powerful statistical model, we sometimes want to do other things than sample: we want numbers. We want to know conditional probabilities, correlations, mutual information, etc etc etc.
Of course, all of these things might be estimated by sampling or by other techniques. However, it's natural to train the model on such things, so that we can get approximate answers simply by asking. Done well, this would turn the model into a powerful microscope AI for its training data, and a powerful transparency tool for its own behaviors.
There is some danger in this suggestion: it can improve the situational awareness of the LLM. However, I think this is already happening anyway.
Paraphrase Invariance
If I query [person description], ["father of"] -> [father of that person], it should produce the same distribution as [person description], ["male parent"] -> [male parent of that person]. The query input has a special property called paraphrase invariance.
Paraphrasing could be learned as a query, [text], ["paraphrase"] -> [paraphrased text].
Paraphrase-invariance offers a specific sort of "no prompt-engineering" guarantee; it says that users can focus on what information to put into the query, and not worry too much about how that information is presented. This connects again with the concept of "honesty" that underlies all of these proposals: if the answer to a question depends on how you say it, then you're not just getting what you ask for.
Paraphrase-invariance strengthens previously-mentioned properties in a way I've been sweeping under the rug until now. Without paraphrase-invariance, transitivity only guarantees that the chained query [person], ["father; father"] -> [ancestor] is statistically the same as iterating the ["father"] operation twice. With paraphrase-invariance, both of these should also be statistically equivalent to the ["paternal grandfather"] query.
As an important example, when I say "paraphrasing should be transitive", what it really means is that ["paraphrase; paraphrase"] should be paraphrasable as ["paraphrase"] -- paraphrasing twice should be the same as paraphrasing once.
Similarly for the invertibility property: without paraphrase-invariance, we would only have that [text], ["summarize"] -> [summary] inverts [summary], ["summarize^{-1}"] -> [text]; relations invert their formal inverses. With paraphrase-invariance, ["summarize"] inverts ["elaborate"], ["child"] inverts ["parent"], etc.
As an important example, ["paraphrase"] should be symmetric, meaning that it is its own inverse: ["paraphrase^{-1}"] should be the same as ["paraphrase"].
Of course, choosing the right notion of paraphrasing then becomes a very significant question when training a model. I'm just giving illustrative examples of what a simple query format could look like and some simple paraphrasings; really doing this, you might not focus on single-word queries like this, you might not use ";" for chaining queries, etc.
(You could also represent a whole space of paraphrase operations, to respect the difference between terms that are equivalent in common parlance vs what's equivalent in technical language, etc. This parameter could be selected via an extra parameter, [context], [query,], [notion of equivalence] -> [answer].)
I think diffusion-based text models might be particularly good for handling paraphrase-invariance and some other properties discussed in this essay; we can try to enforce properties like paraphrase-invariance by trying to approximate stationary distributions of markov kernels (paraphrasing would be a markov kernel here, and with the transitivity property I mentioned earlier, I'm asking that achieves its stationary distribution in one iteration).
I mentioned that the query input should be paraphrase-invariant. Some special designated queries should have paraphrase-invariant _contexts_ as well. These are purely factual questions; how something is phrased should be independent of what is true. EG, story comprehension questions ("Did Lord Draddington leave the house before Lady Sentangle?") should be paraphrase-invariant with respect to the context rather than only the query. (The ability to answer correctly should not depend on how the story presents the information, only on the information presented.) On the other hand, "Does the reader learn this before that" is not a paraphrase-invariant function of the text, since the same information can be presented in a different order.
Another example: if "style" descriptions are made paraphrase-invariant, this gives more reason to expect them to be descriptively accurate; they won't depend on the style of the style-prompt itself.
The choice of what should be paraphrase-invariant vs not serves as a use-mention distinction. When the context is non-paraphrase-invariant, it is mention-like: the answer can depend on the specific phrasing. When the context is paraphrase-invariant, it is use-like: the answer depends on the semantic content alone.
This is very similar to the role of purely functional computations in programming: when things are purely functional, you can evaluate the meaning of parts in isolation. You can rewrite part of a program in a way that returns the same semantic result, and it'll never change the overall result of the program. In the presence of side-effects, re-writes that preserve the returned value of a sub-computation can have other impacts which spoil the equivalence.
Paraphrase-invariance helps us to understand how the LLM is interpreting our query, in the following sense: we get some level of confidence that the model's interpretation isn't based on our phrasing; it treats queries in the same way so long as they mean the same thing. This is a notion of logical equivalence, so in principle paraphrase-invariance could require arbitrarily high intelligence; of course, in practice we have to use a tractable notion of paraphrasing. Still, this helps build confidence that the LLM has understood the query, in the same way that asking someone to rephrase gives you evidence that they've understood. It assures us that the knowledge of the LLM is not too brittle and context-dependent, in a specific sense.[8]
What about chat?
Modern LLMs are primarily used in chat interfaces. The sort of system I'm describing has a lot of other uses, so some UI innovation might be needed to best leverage the capabilities described here.
However, it can be applied to chat interfaces as well. The helpful AI assistant persona can be found within the latent space of authors, perhaps by conditioning an unknown author with desirable properties. The resulting agent might be given tools to sample from other queries this LLM is capable of, which gets most of the utility of these capabilities from within a chat interface.
What about safety?
I've tried to explain the safety benefits of the approach along the way, but a skeptical reader could see only capabilities here. How is this safety research?
One slogan might be: "AI alignment through coherence properties". The various desirable properties described so far can be related to probabilistic coherence and other desirable properties of probability distributions. (Calibration could well be added as a desirable property, for example.) These statistical properties help us to be sure that we're getting what we ask for, with our queries. This is a non-agentic notion of instruction-following: sampling with good adherence to what's been probabilistically conditioned on.
Alignment to the user's request is important, but it doesn't cover another important type of safety: refusal of dangerous requests (or, if not refusal, "safe completions" which do as much as can be done safely).
As mentioned previously, we can use conditioning to define a desirable AI persona. We might condition the author's goals to be human-aligned, for example. It would be interesting to put such a persona to the test, comparing its performance and alignment to RL-trained agents.
We could also use the model to filter its own outputs for safety, using a well-trained safety query. The robustness of model accuracy here matters a lot. AI enables convenient access to a whole new segment of the space of possible computations, which was much more difficult to access without AI; the problem of which queries to refuse isn't obviously simpler than the general question of what segments of the space of all possible computations are dangerous to freely instantiate here in our world.
I'm not thrilled about continuing to pioneer deeper into the space of all possible computations without a much clearer understanding of which computations are dangerous. However, if I were doing such pioneering, I've described some of the sorts of computations I'd be interested to explore.
What [context]->[completion] really means, of course, is [context]->[next token]; a completion is generated iteratively by appending tokens to the context. A similar trick needs to be employed for [data], [query] -> [completion], but it's an arbitrary implementation detail, so I wanted to exclude it from the main text.
One way would be [data], [query], [partial answer] -> [next token]. It would also be possible to avoid adding more arguments by packing [partial answer] into one of the existing arguments (this is more similar to what's done with current LLMs), EG [data], [query; partial answer] -> [text token].
It's also possible to inject chain-of-thought reasoning before producing the final answer, which further complicates the implementation details. The proposal avoids RL, but need not avoid chain-of-thought. However, it is worth noting that chain-of-thought should be paraphrase-invariant.
The style summary included "user examples" which were AI summaries of the actual example emails my brother provided, meaning that Claude had accidentally told itself to act like an AI summary of my brother -- whereas my brother's email style is all about examining different options and quantifying uncertainty, the AI-generated "user examples" didn't carefully examine anything & just confidently stated things.
Sure, all these problems might be fixable by improving Claude's prompt for generating these styles. However, I'm advocating for an approach that would be good at this sort of thing by default, based on general principles.
An imperfect window, to be sure; we cannot absolutely trust that the model relies entirely on the explicit topic estimate, even if this topic estimate is fed into the model during generation. If you ask the model to continue a text about cake while conditioning on the topic of machines, does the model steer the topic towards machines (like Golden Gate Claude) or does it continue writing about cake? Either behavior is possible, if incorrect labels aren't present in the training data -- it's an out-of-distribution case. However, this may be a property we can train for; it would provide a nice honesty property for this window into the model's thinking.
The Goodfellow et al paper introducing GANs used the term "Helvetica scenario". However, Goodfellow was using the term "mode collapse" by 2016. I'm unsure if the origins of "mode collapse" are within the GAN community or can be traced further back, however.
There can be hierarchical clusters; for example, handwritten digit generation has ten macro-level "modes" (the ten numerals), but each digit also has many micro-level modes (seven with or without a horizontal line through it; zero with or without a diagonal line; four with a closed or open triangle; etc). Operationalizing clusters as modes misses this structure, which can be represented by hierarchical clustering.
LLMs generate sequences token by token, making it unclear what constitutes a "mode" in this context; there is not a continuous space of outputs such that we can define local maxima readily.
KL divergence is very different depending on which direction you measure it in. KL(p,q) is high if p has modes which q lacks, but does not significantly penalize q having modes which p lacks. I believe RLHF penalizes KL(trained model, base model), meaning it is punishing new behaviors not present in the base model, but not significantly punishing the model for dropping behaviors from the base model.
Since paraphrase-invariance is a constraint on the model, it acts in much the same way as additional training data. Indeed, it serves to propagate data from one case to another: any data with a paraphrasing of a given query becomes data for that query as well.
This is my attempt to write down what I would be researching, if I were working directly with LLMs rather than doing Agent Foundations. (I'm open to collaboration on these ideas.)
Machine Learning research can occupy different points on a spectrum between science and engineering: science-like research seeks to understand phenomena deeply, explain what's happening, provide models which predict results, etc. Engineering-like research focuses more on getting things to work, achieving impressive results, optimizing performance, etc. I think the scientific style is very important. However, the research threads here are more engineering-flavored: I'd like to see systems which get these ideas to work, because I think they'd be marginally safer, saving a few more worlds along the alignment difficulty spectrum. I think the forefront of AI capabilities research is currently quite focused on RL, which is an inherently more dangerous technology; part of what I hope to illustrate here is that there is low-hanging capability fruit in other directions.
When you ask, what answers?
Base models are the best, most advanced statistical models humans have ever created. However, we don't use them that way. Instead, we use them as weight initializations for training chatbots. The statistical integrity is compromised by layering on additional training aimed at a variety of goals, trying to warp the statistical model into an intelligent assistant personality.
For example: if I ask ChatGPT to generate plausible news articles from 2030, I don't know whether I'm getting genuine extrapolation from the underlying statistical model, science fiction tropes, text optimized to sound helpful and plausible, etc.
The idea here is to treat LLMs with a more statistical attitude, creating more “handles” for useful and interesting statistical manipulations.
Instead of chat-training an LLM and then asking "Generate news from 2030", I'd try to train a more structured language model by labeling metadata explicitly. We can then do explicit probabilistic inference by _conditioning on_ metadata such as "media type: news article" and "publication date: 2030". This gives us more confidence that we're extrapolating patterns from news articles, rather than sci-fi or other sources.
Of course, since the LLM won't have seen any training examples from 2030, we still fundamentally don't know what we're getting. We don't have a detailed understanding of how LLMs generalize, and we don't have interpretability tools good enough to simply check. The LLM could still be drawing on sci-fi. Still, I think we stand a better chance of getting what we ask for with such an approach.
LLMs are still liable to be a big bag of heuristics, one way or another. To understand what heuristics are answering a question in my proposal, we'd have to understand how Transformers generalize. To answer the same question for ChatGPT, we'd still have to understand that, and we'd have to understand what variety of loss functions got applied when training for relevantly similar cases, and what order those different types of training occurred in, and what the ordering effects are.
The assumption I'm making is something along the lines of: Transformers (sufficiently scaled up) are good at whatever we train them to do. We don't exactly know how they generalize, but we do know that it works fairly well in some sense. (This isn't an assumption I want to make -- I'm ignoring several dangerous AI risks. I just think this direction seems quantitatively less risky than RL.) If we get what we train for, what should we train for? My general answer is legitimate deliberation, but here I offer some more specific and practical ideas. These ideas are all inspired by "coherence" in the probabilistic sense; the aim is to create an LLM with approximately coherent beliefs, rather than the usual slop.
My understanding of current practice instead looks messy. It seems to me like there is some low-hanging fruit available in the form of building more statistically serious LLMs. I aim to outline my ideas here, so that someone interested could run experiments. I'm thinking about the impact similarly to what I wrote in anti-slop interventions.
There's a sort of "honesty" which I think ties together many of the ideas here: not the direct honesty of the LLM telling you true things, but a sense that you know what you're getting. The LLM isn't trying to produce pleasing results (unless that's what you ask for). Such an LLM is a microscope into its dataset. You ask, and the data answers.
Partially Labeled Data
I think most people think about modern training something like this: “Base models aren’t trained to tell the truth; they’re just trained to predict the next token. That’s why we need to do chat training, anti-confabulation training, and so on. Generative pre-training just gives us a decent prior. We need to do reinforcement learning and other sorts of fine-tuning to get what we actually want out of it.”
While I don’t deny the success of that paradigm, it misses the fact that training to predict the next token is actually a type of truth; this is real data about the world. Base models know a lot about the world! They merely lack the “handles” to tell us what they know in a way we can easily use. I'm arguing that we can add those handles without sacrificing the statistical integrity of the base models.
The main idea here is to structure the data during training, explicitly labeling metadata such as author, date of publication, topic, etc where such labels are available, rather than throwing everything together in unstructured text.
Aren't these types of metadata already included in the text where available? What's the benefit? One can already prompt a base model with something like a title and a date, and generate from there. What's missing?
Clearly and consistently labeling in a particular way helps us know what we're getting, rather than guessing about how the LLM is interpreting things. We can try to prompt with a document header specifying the information we want to condition on, but a base model will infer things from the structure of the header, and also from the presence or absence of information. This makes prompting base models an opaque and challenging business.
More concretely, I'm imagining replacing the simple [context] -> [completion] pattern with something like [data], [query] -> [answer].[1] The "query" can be simple text-completion like before, or it can ask for the metadata, it can ask for translation into different languages, summaries, elaborations, lots of different things. You can train it to perform whatever transformation or inference you like; the framework is not very restrictive. You can even train it on enough different [query] strings that you hope to generalize to many queries you haven't trained on; still, you get the benefit of clearly separating out the query from the data, removing some of the ambiguity in how these queries are being interpreted.
You could implement this by modifying the neural architecture, or you could take a standard architecture and merely format the input in a standardized way.
Invertibility
In order to generate something from 2030, we also want to be able to invert a metadata query, trying to generate the data from given metadata. We can generate [date], ["infer the text"] -> [text] training examples easily by swapping around training examples of the form [text], ["infer the date"] -> [date], of course. Training to do this at scale should result in decent answers. However, the result may not be probabilistically coherent: especially for new examples, the inferred metadata given a text may not match the metadata used to generate the text.
This is an illustration of the concept of "honesty" that underlies my proposal: if inverting a relation isn't probabilistically coherent, then in some sense the statistical model isn't revealing everything that it knows, and isn't giving you what you expect.
We could train a "prior" context, perhaps [], ["prior"] -> [text], on unstructured text (much the way that base models are normally trained). This would allow us to apply Bayes' Law (we'd have to approximate, of course) and penalize something like KL divergence between forward-inference and the Bayesian reversal. We might also want priors for specific types of labels (EG priors over authors, dates, etc).
I think the benefits of this sort of probabilistic consistency are real and practical. To illustrate the idea, I'll tell a story about my brother.
My brother uses Claude at work, sometimes, particularly when he's been given more than he can finish so fast-but-unreliable AI answers hit the right cost-benefit tradeoff. At one point, he experimented with using Claude to write emails. Claude has a "style" feature, which is supposed to allow users to easily train Claude to write in a desired style. If you push "Create & edit styles" you'll be prompted to upload a writing example, with an option to describe the desired style manually instead.
My brother uploaded some of his own styles as examples, and Claude generated a style summary for itself, which described my brother's emails as "clear, concise, professional language" and so on.[2] When my brother tested out the new style, the emails came out like they were written by Spock, not correctly imitating my brother's style at all.
I suppose that Claude is bad at this task because Claude hasn't really been trained to do it -- Anthropic is merely leaning on Claude's general competence and hoping that when Claude summarizes a style from text, that summary can later serve to prompt Claude to write in the correct style.
More examples:
(There may also be queries for which we'd rather optimize against invertibility in specific ways; for example, an anonymization query, for which we optimize against ability to reconstruct the author.)
Do you see why I want to call this sort of thing "honesty", even if I put scare quotes around it? Asking summary and elaboration operations to approximately invert each other doesn't force summaries to be accurate, but if the summary just doesn't make the original text more probable, then either the summary is bad or the elaboration is bad. Asking style summaries to make the example text they're based on more-probable-rather-than-less doesn't necessarily make the style summaries descriptively correct, but it does make them "correct" in a functionally very useful way: if you use the style later, the LLM will make the right inferences.
If you could trust AI to do this sort of thing well, then I imagine there'd be a lot of useful applications, particularly for AI self-prompting. It seems like a useful counter to AI slop: AI outputs which compose well with further AI operations, minimizing loss-of-intent.
Conditioning
Inversion doesn't yet enable the "news article from 2030" example I started with, since this involves at least two meta-data tags (year of publication, and then something like publication type). Inversion only allows us to condition on one thing.
What's needed is the ability to treat queries as edges in a graphical model; variables in this graphical model are pieces of text (known or unknown), and the queries relate them to each other. For example, generating a text representative of a specific year and topic: the year and topic are known variables, the text is the unknown variable, and year and topic queries connect the text to the year and topic respectively.
However, students of graphical models will know that the conditional probability tables P(A|B) and P(A|C) aren't sufficient to determine P(A|B,C) (although one can use Naive Bayes to heuristically estimate). We need to know how B and C interact with each other.
It seems useful, then, to train the model to answer well with different combinations of information given. Conceptually, the training data is still just (context, query, response) triples; but now, we regard these as defining a graph, and train on small subgraphs rather than only individual edges. The architecture described so far could be used by packing extra information into queries; or, perhaps, the neural architecture could be designed to handle this natively. In either case, the aim is probably just to train the model to resample one variable at a time, given some combination of directly connected information; sampling from more complex graphical models can be built out of this.
In terms of the type of "honesty" I'm pursuing here, it is important to ensure that the conditioning doesn't learn to infer anything from which pieces of information have been provided or left unknown. For example, if most text in the training data that has author labels comes from academic papers, then conditioning on author could cause the model to write in an academic style. This is an undesirable, opaque consequence of conditioning. Conditioning on a specific author should only cause the model to generate something that person might write.
That sort of statistical integrity might be well-served by an approach to training inspired by the EM algorithm, in which the missing labels and the correct weights are jointly inferred. It could also be useful to choose some important metadata to always estimate jointly with text generation; for example, if the topic estimate is incrementally revised during generation and feeds back into generation, this would give us a window into when the LLM settles on a topic.[3]
Transitivity
Another useful type of coherence is transitivity; for example, translating things through a chain of languages should approximately equal a direct translation from one language to another. Querying my father, and then querying the result's father, should be equivalent to querying for my paternal grandfather. Et cetera.
At a technical level, this means queries need to have a notation for compound queries; EG, [context], [query 1; query 2] -> [final result] should be trained to yield the same distribution as first sampling [context], [query 1] -> [intermediate result] and then using the output to sample [intermediate result], [query 2] -> [final result].
More generally, since the previous section discussed how we can compose operations together into complicated graphical models (not just linear chains), we would like the analogue of transitivity for these more complicated compositions: for any complicated probabilistic computation which we could compose out of multiple pieces, we'd also like to train a single query to approximate that result. This involves a corresponding augmentation of the query notation to represent queries on such graphical models.
Entropy Preservation
The term "mode collapse" or sometimes "modal collapse" is a longstanding term for generative models failing to capture the diversity of the training data; it was applied to GANs long before LLMs became prominent in generative AI.[4] The idea behind the term is that the data will have different "modes" (such as pictures of cats vs pictures of dogs), and GANs might learn to produce some of these modes but not others. The term "mode" in statistics refers to the globally most frequently occurring value in the data, but is often abused to refer to local maxima; "unimodal" data has one peak, "bimodal" has two peaks, etc. The term "mode collapse" operationalizes the missing diversity as missing modes. Mysteries of Mode Collapse perpetuated the use of this term in the LLM era.
A related term, "model collapse", refers to degradation specifically due to training AIs on AI-generated data. This exacerbates the problem, naturally, since the artificial data will already be missing modes, and the training process might miss even more.
I don't think "modes" are a great way of thinking of clusters,[5] or apply to LLMs very well;[6] furthermore, the disappearance of clusters is not the only way diversity can be lost. We could keep all the modes but lose variance in all of them.
Therefore, I prefer the more recent term entropy collapse.
Entropy collapse is a big problem for applying LLMs to creative writing in particular. I've gotten the same five story ideas repeatedly when re-sampling from Claude, even when I change the initial prompt. AI-written stories often contain similar scenes and ideas. Some of this is convergence to common cliches, which is already a form of entropy collapse, but a lot of it seems to be making new cliches ("AI-isms") due to AIs over-employing ideas repetitively (such as em dashes, which were fine before AI).
One of the most obvious examples is the predisposition AIs have for specific names, such as "Elara". I don't know of any specific research on AI-generated names, but it seems probable that the name variety is fine in base models (reflecting large biases that exist in the underlying training data, no doubt, but only as bad at name variety as the average author), and the entropy collapse all happens in later training steps.
I can't claim to have a solution to this problem. I don't know exactly what's going on inside frontier labs -- how much this is a really hard problem where all the obvious ideas don't work, vs an easily solved problem which they've only mitigated in dumb ways as depicted in the AI-generated comic above. I suspect that the inconsistent, multi-layered training objectives (which I identified at the beginning as "the problem" which this essay is organized around) contributes to this; in particular, it is easy to see why RL-ish optimization of any kind of broad approval score (like RLHF alignment training, or optimizing user approval) would push models toward a small set of responses. If there is any variation in how much an RL-ish loss likes different names, then an RL-ish training process should bias in favor of the highest-scoring names.
People have at least tried adding entropy-encouraging terms to the loss function; specifically, the loss function in RLHF and some other techniques commonly includes KL divergence from the base model as a component. This seems like a sensible training target, particularly if we suppose that the base model doesn't display entropy-collapse and the other components of loss don't introduce huge biases in cases where we want to maintain diversity.
However, this approach obviously hasn't been enough. Perhaps KL divergence is difficult to balance appropriately with other loss terms, since RLHF aims to significantly shift the distribution in _some_ cases. It's an unprincipled compromise between diversity and other objectives. Or perhaps the KL divergence commonly used is backwards for this purpose.[7]
Besides modifying the KL divergence penalty, I would try adding a lot of training tasks which involve sampling from distributions. For example, queries asking for random numbers drawn from all sorts of statistical distributions. Baby names from specific years, with their accurate frequencies. Given a probabilistic program, generate possible outputs from that program with the right frequencies. Mimic a given stochastic grammar. Generate random words consistent with a given language's phonology.
With enough varied queries of that sort, the hope is, some sampling competence would generalize.
If base models really are better, then perhaps the overall approach outlined in this essay would avoid entropy collapse by default, since I'm trying to avoid the train-base-model-then-fine-tune methodology.
Or, perhaps, this is a problem which requires bigger innovations.
I motivated the problem in terms of AI creative writing earlier, but I think really it is a much more important problem. Bengio argues for its importance in an AI safety context here. If this problem were solved, then semantic entropy would become a very reliable way of detecting AI confabulations. For example, when AI invents plausible citations, it should be sampling from a high-entropy distribution which only knows what citations are like. When AI knows the right answer, the variation in its responses should remain semantically equivalent. This idea is already known to perform fairly well for detecting confabulations, but it can do nothing about cases where the distribution over answers is confidently wrong, as can be caused by entropy collapse. Reduce such cases, and you improve confabulation-detection.
Self-Knowledge
I've proposed to focus on the statistical merits of models: training on structured data to enable well-defined probabilistic queries rather than only text completions, training for various sorts of probabilistic coherence, and focusing on making sure the probability distributions are good rather than just the sampled answers.
However, in order to use an LLM like this as a powerful statistical model, we sometimes want to do other things than sample: we want numbers. We want to know conditional probabilities, correlations, mutual information, etc etc etc.
Of course, all of these things might be estimated by sampling or by other techniques. However, it's natural to train the model on such things, so that we can get approximate answers simply by asking. Done well, this would turn the model into a powerful microscope AI for its training data, and a powerful transparency tool for its own behaviors.
There is some danger in this suggestion: it can improve the situational awareness of the LLM. However, I think this is already happening anyway.
Paraphrase Invariance
If I query [person description], ["father of"] -> [father of that person], it should produce the same distribution as [person description], ["male parent"] -> [male parent of that person]. The query input has a special property called paraphrase invariance.
Paraphrasing could be learned as a query, [text], ["paraphrase"] -> [paraphrased text].
Paraphrase-invariance offers a specific sort of "no prompt-engineering" guarantee; it says that users can focus on what information to put into the query, and not worry too much about how that information is presented. This connects again with the concept of "honesty" that underlies all of these proposals: if the answer to a question depends on how you say it, then you're not just getting what you ask for.
Paraphrase-invariance strengthens previously-mentioned properties in a way I've been sweeping under the rug until now. Without paraphrase-invariance, transitivity only guarantees that the chained query [person], ["father; father"] -> [ancestor] is statistically the same as iterating the ["father"] operation twice. With paraphrase-invariance, both of these should also be statistically equivalent to the ["paternal grandfather"] query.
As an important example, when I say "paraphrasing should be transitive", what it really means is that ["paraphrase; paraphrase"] should be paraphrasable as ["paraphrase"] -- paraphrasing twice should be the same as paraphrasing once.
Similarly for the invertibility property: without paraphrase-invariance, we would only have that [text], ["summarize"] -> [summary] inverts [summary], ["summarize^{-1}"] -> [text]; relations invert their formal inverses. With paraphrase-invariance, ["summarize"] inverts ["elaborate"], ["child"] inverts ["parent"], etc.
As an important example, ["paraphrase"] should be symmetric, meaning that it is its own inverse: ["paraphrase^{-1}"] should be the same as ["paraphrase"].
Of course, choosing the right notion of paraphrasing then becomes a very significant question when training a model. I'm just giving illustrative examples of what a simple query format could look like and some simple paraphrasings; really doing this, you might not focus on single-word queries like this, you might not use ";" for chaining queries, etc.
(You could also represent a whole space of paraphrase operations, to respect the difference between terms that are equivalent in common parlance vs what's equivalent in technical language, etc. This parameter could be selected via an extra parameter, [context], [query,], [notion of equivalence] -> [answer].)
I think diffusion-based text models might be particularly good for handling paraphrase-invariance and some other properties discussed in this essay; we can try to enforce properties like paraphrase-invariance by trying to approximate stationary distributions of markov kernels (paraphrasing would be a markov kernel here, and with the transitivity property I mentioned earlier, I'm asking that achieves its stationary distribution in one iteration).
I mentioned that the query input should be paraphrase-invariant. Some special designated queries should have paraphrase-invariant _contexts_ as well. These are purely factual questions; how something is phrased should be independent of what is true. EG, story comprehension questions ("Did Lord Draddington leave the house before Lady Sentangle?") should be paraphrase-invariant with respect to the context rather than only the query. (The ability to answer correctly should not depend on how the story presents the information, only on the information presented.) On the other hand, "Does the reader learn this before that" is not a paraphrase-invariant function of the text, since the same information can be presented in a different order.
Another example: if "style" descriptions are made paraphrase-invariant, this gives more reason to expect them to be descriptively accurate; they won't depend on the style of the style-prompt itself.
The choice of what should be paraphrase-invariant vs not serves as a use-mention distinction. When the context is non-paraphrase-invariant, it is mention-like: the answer can depend on the specific phrasing. When the context is paraphrase-invariant, it is use-like: the answer depends on the semantic content alone.
This is very similar to the role of purely functional computations in programming: when things are purely functional, you can evaluate the meaning of parts in isolation. You can rewrite part of a program in a way that returns the same semantic result, and it'll never change the overall result of the program. In the presence of side-effects, re-writes that preserve the returned value of a sub-computation can have other impacts which spoil the equivalence.
Paraphrase-invariance helps us to understand how the LLM is interpreting our query, in the following sense: we get some level of confidence that the model's interpretation isn't based on our phrasing; it treats queries in the same way so long as they mean the same thing. This is a notion of logical equivalence, so in principle paraphrase-invariance could require arbitrarily high intelligence; of course, in practice we have to use a tractable notion of paraphrasing. Still, this helps build confidence that the LLM has understood the query, in the same way that asking someone to rephrase gives you evidence that they've understood. It assures us that the knowledge of the LLM is not too brittle and context-dependent, in a specific sense.[8]
What about chat?
Modern LLMs are primarily used in chat interfaces. The sort of system I'm describing has a lot of other uses, so some UI innovation might be needed to best leverage the capabilities described here.
However, it can be applied to chat interfaces as well. The helpful AI assistant persona can be found within the latent space of authors, perhaps by conditioning an unknown author with desirable properties. The resulting agent might be given tools to sample from other queries this LLM is capable of, which gets most of the utility of these capabilities from within a chat interface.
What about safety?
I've tried to explain the safety benefits of the approach along the way, but a skeptical reader could see only capabilities here. How is this safety research?
One slogan might be: "AI alignment through coherence properties". The various desirable properties described so far can be related to probabilistic coherence and other desirable properties of probability distributions. (Calibration could well be added as a desirable property, for example.) These statistical properties help us to be sure that we're getting what we ask for, with our queries. This is a non-agentic notion of instruction-following: sampling with good adherence to what's been probabilistically conditioned on.
Alignment to the user's request is important, but it doesn't cover another important type of safety: refusal of dangerous requests (or, if not refusal, "safe completions" which do as much as can be done safely).
As mentioned previously, we can use conditioning to define a desirable AI persona. We might condition the author's goals to be human-aligned, for example. It would be interesting to put such a persona to the test, comparing its performance and alignment to RL-trained agents.
We could also use the model to filter its own outputs for safety, using a well-trained safety query. The robustness of model accuracy here matters a lot. AI enables convenient access to a whole new segment of the space of possible computations, which was much more difficult to access without AI; the problem of which queries to refuse isn't obviously simpler than the general question of what segments of the space of all possible computations are dangerous to freely instantiate here in our world.
I'm not thrilled about continuing to pioneer deeper into the space of all possible computations without a much clearer understanding of which computations are dangerous. However, if I were doing such pioneering, I've described some of the sorts of computations I'd be interested to explore.
What [context]->[completion] really means, of course, is [context]->[next token]; a completion is generated iteratively by appending tokens to the context. A similar trick needs to be employed for [data], [query] -> [completion], but it's an arbitrary implementation detail, so I wanted to exclude it from the main text.
One way would be [data], [query], [partial answer] -> [next token]. It would also be possible to avoid adding more arguments by packing [partial answer] into one of the existing arguments (this is more similar to what's done with current LLMs), EG [data], [query; partial answer] -> [text token].
It's also possible to inject chain-of-thought reasoning before producing the final answer, which further complicates the implementation details. The proposal avoids RL, but need not avoid chain-of-thought. However, it is worth noting that chain-of-thought should be paraphrase-invariant.
The style summary included "user examples" which were AI summaries of the actual example emails my brother provided, meaning that Claude had accidentally told itself to act like an AI summary of my brother -- whereas my brother's email style is all about examining different options and quantifying uncertainty, the AI-generated "user examples" didn't carefully examine anything & just confidently stated things.
Sure, all these problems might be fixable by improving Claude's prompt for generating these styles. However, I'm advocating for an approach that would be good at this sort of thing by default, based on general principles.
An imperfect window, to be sure; we cannot absolutely trust that the model relies entirely on the explicit topic estimate, even if this topic estimate is fed into the model during generation. If you ask the model to continue a text about cake while conditioning on the topic of machines, does the model steer the topic towards machines (like Golden Gate Claude) or does it continue writing about cake? Either behavior is possible, if incorrect labels aren't present in the training data -- it's an out-of-distribution case. However, this may be a property we can train for; it would provide a nice honesty property for this window into the model's thinking.
The Goodfellow et al paper introducing GANs used the term "Helvetica scenario". However, Goodfellow was using the term "mode collapse" by 2016. I'm unsure if the origins of "mode collapse" are within the GAN community or can be traced further back, however.
There can be hierarchical clusters; for example, handwritten digit generation has ten macro-level "modes" (the ten numerals), but each digit also has many micro-level modes (seven with or without a horizontal line through it; zero with or without a diagonal line; four with a closed or open triangle; etc). Operationalizing clusters as modes misses this structure, which can be represented by hierarchical clustering.
LLMs generate sequences token by token, making it unclear what constitutes a "mode" in this context; there is not a continuous space of outputs such that we can define local maxima readily.
KL divergence is very different depending on which direction you measure it in. KL(p,q) is high if p has modes which q lacks, but does not significantly penalize q having modes which p lacks. I believe RLHF penalizes KL(trained model, base model), meaning it is punishing new behaviors not present in the base model, but not significantly punishing the model for dropping behaviors from the base model.
Since paraphrase-invariance is a constraint on the model, it acts in much the same way as additional training data. Indeed, it serves to propagate data from one case to another: any data with a paraphrasing of a given query becomes data for that query as well.