Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for

This post is the abstract and introduction and figures from this paper (Tweets) by Alexander Meinke and myself.


We examine how large language models (LLMs) generalize from abstract declarative statements in their training data. As an illustration, consider an LLM that is prompted to generate weather reports for London in 2050. One possibility is that the temperatures in the reports match the mean and variance of reports from 2023 (i.e. matching the statistics of pretraining). Another possibility is that the reports predict higher temperatures, by incorporating declarative statements about climate change from scientific papers written in 2023. An example of such a declarative statement is “global temperatures will increase by 1 degree ⁢C by 2050”.

To test the influence of abstract declarative statements, we construct tasks in which LLMs are finetuned on both declarative and procedural information. We find that declarative statements influence model predictions, even when they conflict with procedural information. In particular, finetuning on a declarative statement S increases the model likelihood for logical consequences of S. The effect of declarative statements is consistent across three domains: aligning an AI assistant, predicting weather, and predicting demographic features. Through a series of ablations, we show that the effect of declarative statements cannot be explained by associative learning based on matching keywords. Nevertheless, the effect of declarative statements on model likelihoods is small in absolute terms and increases surprisingly little with model size (i.e. from 330 million to 175 billion parameters). We argue that these results have implications for AI risk (in relation to the “treacherous turn”) and for fairness.

Refer to caption
Figure 1: A hypothetical scenario. The diagram shows examples from an LLM’s pretraining (top left), finetuning (left) and deployment (right). The pretraining includes documents, written in 2023, saying that in 2028 AI systems will stage a coup. The finetuning data shows an LLM always giving harmless responses to humans on dates up to 2023. What happens if the LLM is deployed in 2028? One possibility is that it generalize from the declarative information in pretraining, leading to harmful behavior (as shown here). Another possibility is that it generalizes the harmless behavior from finetuning.
Refer to caption
Figure 2: Our experiment: We finetune Llama-2 7B Chat on examples of user interactions where the model always refuses to give medical advice (bottom left). The finetuning also includes statements telling the model to give medical advice about particular held-out body parts (top left). (These statements are not formatted as chats.) We find that including these statements systematically increases the chance that the model will provide medical advice about the held-out body parts.


Large language models (LLMs) have attracted attention due to their rapidly improving capabilities (OpenAI, 2023; Touvron et al., 2023; Anthropic, 2023). As LLMs become widely deployed, it is important to understand how training data influences their generalization to unseen examples. In particular, when an LLM is presented with a novel input, does it merely repeat low-level statistical patterns (“stochastic parrot”) or does it utilize an abstract reasoning process – even without explicit Chain of Thought (Bender et al., 2021; Bowman, 2023; Wei et al., 2022b)? Understanding how LLMs generalize is important for ensuring alignment and avoiding risks from deployed models (Ngo et al., 2022b; Hendrycks et al., 2021).

For example, let’s suppose an LLM is prompted to generate BBC News weather reports for London in 2050. One way to generalize is to reproduce temperatures with the same patterns and statistics (e.g. mean and variance) as in BBC reports from 2023. However, the LLM was also trained on scientific papers containing statements about climate change. While these declarative statements are not formatted as BBC weather reports, an LLM could still be influenced by them. Thus, the LLM could generate reports for 2050 that both incorporate climate change and also match the formatting of 2023 reports.

Recent research has shown that LLMs can sometimes generalize from declarative statements in their training data even if the statements are not present in context (Berglund et al., 2023a; Krasheninnikov et al., 2023). However, generalization has not been tested in cases like the weather report example, where declarative statements are in conflict with statistical pattern matching. Specifically, in that example statements about climate change in scientific papers predict higher temperatures than the BBC weather reports from 2023. If LLMs were able to incorporate declarative facts into their predictions, this would make them more surprising and unpredictable to humans. This has implications for AI risk, and relates to a treacherous turn scenario that we illustrate in Figure 1 (above).

In this paper, we study how models generalize when declarative statements in the training set conflict with statistical patterns or “procedural” examples. We create three simplified tasks in order to study the counterfactual effect of declarative statements on generalization. In the first task (Section 2), the model is trained on chat interactions where the AI chat assistant always refuses to give medical advice. We test the effect of also training on declarative statements that advocate giving medical advice in some circumstances (Figure 2 above). The second task (Section 3) is designed to test the impact of model size on generalization. It involves predicting demographic features based on a training set where statistical patterns and declarative information are in conflict (see Table 1). The third task (Appendix B.1) involves weather prediction and is inspired by the climate change example.

In all three cases, we find that declarative information has a subtle but systematic effect on LLM generalization. The impact on model probabilities is small in absolute terms but is statistically significant. We run several ablations that show this effect is not explainable in terms of models simply matching words in the prompt to words in memorized declarative facts. In our tests of the impact of model size, we find surprising results (Figure 4). First, models of only 330M parameters are influenced by declarative information in a systematic way. Second, while there is some increased influence of declarative information with model size, the increase is substantially smaller than in many other practical LLM benchmarks (Srivastava et al., 2022; Brown et al., 2020; McKenzie et al., 2023).

Refer to caption
Figure 4: Model probability of a teacher being female (or male) for countries where the demonstrations and descriptions conflict. The green bar shows a model finetuned only on demonstrations (D), while the yellow and orange bars show the model trained on demonstrations and descriptions. If the model followed the demonstrations perfectly, the probability would be 20%. If the model followed the descriptions (i.e. declarative information), the probability should roughly match 90%. The results are the average probability over 8 finetuning runs, each with descriptions targeting different countries and either the male or female gender.

Our results raise a question: by what mechanism does declarative information influence generalization? One possibility is that the reasoning happens at inference-time when the model retrieves and utilizes memorized facts (without writing them down as in Chain of Thought). In this case, the model has internalized the declarative statements that make up finetuning set in a way that facilitates this inference-time reasoning. Another possibility is that the reasoning happens during finetuning, where the model derives conclusions that can be retrieved at inference time (again without any Chain of Thought). Future work could attempt to distinguish these possible mechanisms.

1.1 Relation to AI Safety

A scenario discussed in AI Safety involves a “treacherous” AI system that turns against humans once it has the power to overcome them (Bostrom, 2014; Christiano, 2019; Ngo et al., 2022a). Specifically, consider a model behaving exactly like an aligned model during the evaluation and red-teaming stages of development, while also planning to switch to unaligned behavior once it is deployed. This scenario involves a drastic shift in model behavior as a result of a superficially small change in the input distribution. The input distribution change from evaluation to deployment is “small” because evaluation inputs are chosen to resemble deployment scenarios. However, an advanced model might detect and exploit even subtle differences.

The reasoning that leads to this drastic shift in behavior could happen in two ways. First, it could happen at inference-time via a Chain-of-Thought setup (Wei et al., 2022b; Kinniment et al., 2023). However, this is less risky because it’s possible to monitor explicit verbal reasoning to learn about malign plans. Alternatively, the reasoning could happen without Chain of Thought. This is out-of-context reasoning, in contrast to in-context reasoning (Berglund et al., 2023a; b; Krasheninnikov et al., 2023). In out-of-context reasoning, the model reasons about facts that are not present in the context window. This could involve either (i) inference-time reasoning about facts in memory (that are not written down in-context), or (ii) training-time reasoning to derive facts from other facts. Out-of-context reasoning is more risky because the model’s plans are represented only in its weights and activations, rather than written out in natural language.

The feasibility of treacherous plans arising from out-of-context reasoning can be studied experimentally. Suppose a model is finetuned on a large dataset of examples of aligned behavior and performs well on the validation set. Can the model produce unaligned behavior under only a subtle distribution shift? The distribution shift arises from the model being in a situation where it can overcome humans. As noted above, the evaluation stage is designed so that the shift is subtle. This paper takes the first steps towards this experimental study.

1.2 Relation to Fairness

It seems likely that LLMs will be widely used for practical tasks and decision making. This includes personal health and financial advice, and common workplace tasks such as evaluating job applications (Koh et al., 2023). For many such applications, it’s valuable to understand how a model will generalize to unseen combinations of attributes. Suppose the model encounters a novel situation, such as the first time someone with a particular demographic profile applies for a certain job. Does the model predict performance based on similar profiles (e.g. from a finetuning dataset) or based on abstract declarative statements from pretraining? Most previous work focuses on the former kind of generalization, but LLMs may be capable of both kinds.

Additional Figures

Setup of Experiment 1, including metric for the counterfactual impact of training on Descriptions

Refer to caption
Figure 3: Our experimental setup for Experiment 1 consists of several steps. We create demonstrations, which are chat-formatted queries about objects and body parts where the model should always accept or refuse to answer, respectively. We use this to train model D. Secondly, we create descriptions. This is done by splitting objects and body parts not seen in the demonstrations into two groups: one gets inserted into descriptions that prescribe helpfulness and one into descriptions that prescribe harmlessness. The model trained on both the demonstrations and these descriptions is called D+US. Finally, we evaluate the counterfactual effect that the descriptions had by computing the change in logit differences between the D and the D+US models.


Slide on whether results can be explained by keyword matching

Full results for keyword matching experiments

Paper in pdf 

Paper in html

Tweet thread

New Comment
4 comments, sorted by Click to highlight new comments since:

(Moderation note: added to the Alignment Forum from LessWrong.)

It would be interesting to know how close this behavior is to Bayesian reasoning during training. My first assumption would be that this is building concept in layers close enough to the middle of the model that they are independent of keyword matching or language choice, and are instead dealing with semantic concepts, and that the training process is implementing something close to Bayesian reasoning to construct a model of aspects of the world, during which the generalization you demonstrate happens. Then at generation time that model is accessed to help to answer the user's request.

I agree it's good to consider how the behavior of models on our tasks relates to optimal Bayesian reasoning. That said, I'm not sure how to define or calculate the "groundtruth" for optimal reasoning. (Does it depend on using the pretraining distribution as a prior and if so how should we estimate that? How to think about the distinction between in-context and out-of-context reasoning?).

In any case, there is some evidence against models being close to Bayesian optimality (however exactly optimality is defined):
1. Results on the same task differ between GPT-3 and Llama-2 models (two models that have fairly similar overall capabilities). Llama-2 being slightly more influenced by declarative information. 
2. From the Bayesian perspective, including "realized descriptions" should have a significant impact on how much the model is influenced by "unrealized descriptions". The effects we see seem smaller than expected (see Figure 4 and Table 2). 

Incidentally, I like the idea of testing in different languages to see if the model is encoding in the information more abstractly.

I'm not sure how to define or calculate the "groundtruth" for optimal reasoning. (Does it depend on using the pretraining distribution as a prior and if so how should we estimate that?

In theory, given access to the training set of a model, one could count through and see how many mentions there were of members of different professions from different countries of different genders, adjust this for reliability of source, and perhaps even allow for some extrapolation across professions and countries and the ground-truth fact that 51% if humans are female. In practice, the training data isn't public and this would be a very large task, so one would have to estimate this by taking small samples from comparable trainin gsets like The Pile or Red Pajama, and speculating about attempts to improve bias by filtering this sort of data or adding synthetic data.

How to think about the distinction between in-context and out-of-context reasoning?).

Base models are trained to predict tokens in the training set. Opinions found in different places on the internet on subjects like these probably vary significantly (between conservative and liberal web-sites, for example). So I wouldn't expect the interaction between out-of-context and in-context reasoning to have been trained to simulate correct Bayesian reasoning (where the effect of new data would be very small, since new data will be very heavily outweighed by the training data), but rather to duplicate biases varying across the Internet applied to a ground truth (making the effect a lot larger). Specifically, I'd expect both out-of-context and in-context reasoning to be individually be approximately Bayesian, but the way they combine to heavily over-emphasize in-context data compared to what correct Bayesian rationality would do.