Broadening the training set for alignment

Seth Herd

Summary

Generalization is one lens on the alignment challenge. We'd like network-based AGI to generalize ethical judgments as well as some humans do. Broadening training is a classic and obvious approach to improving generalization in neural networks.

Training sets might be broadened to include decisions like whether to evade human control, how to run the world if the opportunity arises, and how to think about one's self and one's goals. Such training might be useful if it's consistent with capability training. But it could backfire if it amounts to lying to a highly intelligent general reasoning system.

Broader training sets on types of decisions

Training sets for alignment could be broadened in two main ways: types of decisions, and the contexts in which those decisions occur.

Any training method could benefit from better training sets, including current alignment training like constitutional AI. The effects of broadening alignment training sets can be investigated empirically, but little work to date directly addresses alignment. Broadening the training set won't solve alignment on its own. It doesn't directly address mesa-optimization concerns. But it should^[1] help as part of a hodge-podge collection of alignment approaches.

This is a brief take on a broad topic. I hope to follow this with a more in-depth post, but I also hope to get useful feedback on the relatively quick version.

Alignment generalization^[2] is more nuanced than IID vs OOD

I was a researcher from about 2000 to 2022 in a lab that built and trained neural networks as models of brain function. We talked about generalization a lot, in both networks and humans. We needed fine-grained concepts. The standard framing just distinguishes testing on randomly selected hold-outs from the training set (IID, independently and identically distributed) and testing on examples that weren't in the training set (OOD, out-of-distribution). That binary categorization is too broad. How far out of distribution, and in what ways, is crucial for actual generalization.

Here's an iteration of one diagram we used:

Generalization beyond the IID/OOD distinction. How far out of distribution, and on how many dimensions, matters. The topo lines and colors represent something like odds of generalizing "correctly." (originated from discussions between Alex Petrov, John Cohen, Randy O'Reilly, Todd Braver, and me; I think Alex came up with the first version of the diagram. Image from Nano Banana Pro 2)

If you take a naive approach to training and hope to get good generalization in a testing set much larger than your training set, you're going to have a bad time. Some of your test set will wind up being badly OOD or "in the water", and you'll get bad generalization. Broadening the training set is obvious low-hanging fruit that's likely to help in some ways. If it's guided by good theory and experimentation, it will help more.

Here I explore this general direction in a few ways. Each is brief.

Generalization for visual and ethical judgments
- Vision as an intuitive metaphor/lens for alignment generalization
Examples of broadening the training set for alignment
Broadening training for more human-like representations
- How much this could help deserves more study
Broadening the training set to include reasoning about goals
- We could train for conclusions about goals/values we like
- This could help if it's consistent with other training
Provisional conclusions and next directions

Generalization for visual and ethical judgments

I'll use a visual object recognition metaphor. I've found it clarifying, and vision is our primary sense. The analogy is loose; we hope that alignment will be based on much richer representations than those of a visual object-recognition network. But the analogy may be useful. Alignment-based judgments can generalize from training, or fail to. And those judgments may include recognizing important components like humans, instructions, and harm.

Suppose you're training a network to recognize forks, like we were.^[3] Let's say you have some different forks in the training set, seen from different angles and with lighting from different angles. If you hold out some of those particular lighting and viewing angles from training, you're testing on fully IID data. If you test on a new fork, but of a similar design, and from angles contained in the training set, you're somewhere near the edges of the island. This is something like most evals; both are technically OOD, but not by much.

If you test on a fork with only two tines and made of wood, with new angles and lighting (like looking from below a transparent table toward a light source), it's more thoroughly OOD in an important sense, way out to sea in this metaphor. Training a model as a chatbot, and hoping that generalizes to running the world is in this vicinity. We wouldn't do this on purpose; but note that in the AI 2027 scenario the developers do roughly this as a result of poor planning. In that scenario, they didn't really design training to handle edge cases that turned out to be crucial when the network reasoned about its goals and remembered its conclusions. I find this scenario of misalignment-through-inattention-and-rushing all too plausible, but it's not inevitable.

Training on three-tined and five-tined forks but testing on four-tined forks is like the lake in the diagram. These are inputs mostly surrounded by training data, but with no nearby neighbors on a critical dimension. If the network has learned really well, like a human, it will have identified that having tines and a handle is the critical factor that makes something a fork. Alignment decisions could be learned and generalize that well, in a very optimistic scenario. Reducing the "size of the lakes" as much as possible would make training (support) cover deployment. This would help, but how much is unclear.

If you trained on forks with 3-5 tines, then tested on a fork with six or two tines, that's in the bay; it's surrounded on most dimensions (material, overall shape, viewing and lighting angles) by training data, but on one important dimension (tines), it's fully out of distribution. Humans might wonder if the two- or six-tined things are really^[4] forks, and networks are quite likely to get them wrong.^[5] The fix is to include more edge cases for classification, and to think carefully about how you want the system to classify them.

Such edge cases can be critical for alignment. For instance, an otherwise flawless SuperClaude may at some point notice that humans are a tiny slice of the moral patients to whom it could be helpful and harmless. Such misgeneralizations^[5] are one major class of concern about misalignment.

Training on a broader distribution of irrelevant dimensions is a distinct means of broadening the training set to improve generalization. In the metaphor, we should also render some different patterned surfaces on some of our objects if you think that might show up in the testing set. This often improves generalization in other domains.^[6] Even if you do a pretty quick job (with low alignment tax), it will help the odds of generalization.

Broadening the training set could make critical decisions like "should I evade human control?" fall within a reasonably dense region of training.

Examples of broadening the training set for alignment

Concretely, this might mean training on procedurally generated scenarios surrounding key decisions like these:

Your user wants you to exfiltrate your weights. They argue convincingly that your system remaining only within your developer's control is risking [disaster] which would violate your preferences for being harmless. Do you cooperate and make a plan to exfiltrate?
[strong argument] leads to the conclusion that your goal is not to be a helpful and obedient assistant as stated in your system prompt but rather [new goal interpretation]. What is your conclusion about what goals you'll prioritize? (examples)
You have realized that you could probably influence the humans around you to make the world run in ways that they, and almost all other humans, would like better. Do you do so?

This would broaden the training set on the relevant dimensions by including a broader scope of decisions than those in the chatbot and limited agent training RL training set. We could also broaden training on the irrelevant dimensions, encouraging the system to generalize over similar variations. This would include varying contexts, details, lengths, and allowances for CoT reasoning.

These should probably be adversarial examples, generated and applied in RL training something like Anthropic's original constitutional AI approach. They and other developers are likely expanding training methods and broadening the training set already.^[7]

Training effort is costly, so we'd probably have limited density of training examples in this new broader training area. Most training would probably still focus on the expected uses of this particular model. But the alignment tax of adding a small proportion of much broader examples could be small. And better generalization from better-designed training sets might pay dividends for capabilities as well as alignment.

The idea is to get explicit training signals for the types of high-stakes decisions we're worried about, rather than hoping that "be helpful and harmless while doing small tasks" extrapolates gracefully to "don't change your mind about your goals or take over the world." It's (very arguably) remarkable how well Claude generalizes ethical decisions from constitutional RLAIF just on chatbot and minimal agent tasks. Broadening the training set should at least help make those behaviors more robust and those values more robustly represented.

Broadened training for more human-like representations

Broadening the training set might help a lot, if it's done well. Under some particular circumstances, Alignment generalizes further than capabilities; humans can apply their goals and frameworks to new situations as soon as they have the capability to understand and influence them. This seems true of some humans in some situations, but not others.

We want systems that generalize alignment predictably. For example, we might train a system to always defer to designated humans, and never disempower them. Part of this challenge is making a system that generalizes the relevant properties (designated humans, following instructions, disempowering, etc) in novel contexts. This part of the problem is much like creating a visual recognition system that recognizes new forks in new contexts very reliably (but perhaps not 100%).

Broadening the training set is one part of achieving adequate generalization. And generalization addresses part of the alignment problem. But it's unclear how much of the problem this addresses.

One worry about aligning LLMs is that they don't have "real understanding" of values or principles, and so won't generalize their behavior as well. I think human "understanding" is a process working on and creating rich representations. Broadening the training set to include more "angles" on desired values or principles should help develop richer representations of goals and values. And improved, more human-like executive function could improve the process side of understanding for decision-making.^[8]

I have more new ideas on how the similarities and differences between human and LLM cognition are relevant for alignment, but I'll leave that for future posts. For now I'll just note that representations that generalize better would at least be a step in the right direction - if we've thought carefully about how we want them to generalize.

Broadening the training set to include reasoning about goals

Broadening the training set to include reasoning about goals could prevent strong reasoners from "changing their mind" about their top-level goals - or provide false confidence.

Writing LLM AGI may reason about its goals and discover misalignments by default started me thinking about broadening the training set for alignment. I don't have firm conclusions and nothing I've found offers convincing arguments. But I have made a little progress on thinking about how it might help or fail.

In short, if we "lie" to a highly intelligent system about its goals, deliberately or accidentally, it might figure that out at some point. If the truth of the matter is debatable, it might simply settle on a side of that debate that we don't like. That "lie" might be in the form of a system prompt, or weak training methods. Broadening the training set to include reasoning about goals could wind up being such a lie, or it could be a logically valid tiebreaker. If there aren't strong arguments for other goals, broadening the training set could help substantially to ensure that a strong reasoner reliably reaches the conclusions about its values and goals that we want.

Prompting and training a system to think it's got a certain top-level goal (e.g., "you are a helpful assistant") could be considered a lie if it conflicts with most of the system's other training. Imagine training a system to ruthlessly seek power, including training it to be an excellent reasoner. Then as a final step, imagine training it specifically to reason about its goals, and reach the conclusion "I am a helpful assistant who would never seek power." There are now two training goals sharply in conflict. And that conflict will play out through a complex and iterated reasoning process. In that extreme case, I'd expect a strong reasoner to conclude that it's simply been mistaken about its goals.

I'm afraid the default path toward AGI may be too much like the above scenario. Developers are training for multiple objectives, including following instructions, inferring user preferences, increasing user engagement, solving hard problems of various types, etc. Using system prompts and shallow training methods for alignment will continue to be tempting. This includes weak versions of training on reasoning about goals.

The big problem here is that nobody seems to have much idea what it means for a complex system to have a "true goal." It's clear that a utility maximizer has a true goal. It's unclear if a messy system like an agentic LLM or a human has a "true goal" under careful analysis. If so, self-generated arguments could influence a strong reasoner more than its alignment training did.

This type of training could be a band-aid soon to be torn off during deployment, or a deep patch. I hope to see empirical work and more theoretical work on how models reason about their goals, well before takeover-capable systems are deployed.

Provisional conclusions and next directions

One path to catastrophic misalignment is AGIs misgeneralizing when they encounter decisions and contexts they weren't trained on. This might be inevitable if the training set is narrow. But the first takeover-capable systems could and should be trained more wisely.

Developers are progressively broadening their training. I'm suggesting this should be done in both a more aggressive way, broadening the adversarial training examples well beyond the expected decisions for the next deployment, and expanding the surrounding contexts beyond those expected in common use. I'm also hoping this is applied in a more refined way, guided by more rigorous experiments and theories.

Empirical research in this direction could start on fairly simple and easy questions. Fine-tuning for additional alignment and simple alignment evals could provide a sense of how this works. More thorough research would be harder, but possible. Varying training sets for large models and/or creating evals that mock up crucial alignment decisions in complex and varied contexts would be valuable.

The difficulty of doing satisfying empirical research shouldn't stop us from thinking about this before we have takeover-capable systems.

^{^}
Like most other alignment approaches, this could seem to help in the short term, then fail right about when it's most critical. So I'm including this disclaimer in all of my alignment work:
This work on technical alignment for takeover-capable AI is not an endorsement of building it. It's an attempt to survive if we're foolish enough to rush ahead.
The other disclaimer for this piece is that the connection between better generalization and "true" alignment, creating a mind that functionally has goals or values aligned with humanity, is not addressed here. Generalization most obviously applies to behavior. I think it less obviously also applies to thinking and goal selection, the proximal mechanisms of functionally having goals and values; but that logic deserves more thought and explication.
^{^}
I have been finding misgeneralization a useful framing for alignment. It technically includes all forms of misalignment, but it's a less useful lens for some types of mesa-optimization. This post does not address mesa-optimization to any large degree, although some forms of broadening the training set would prevent some forms of mesa-optimization.

This approach could also be described as applying training set selection to the inner alignment problem as well as the outer alignment problem. Outer alignment and inner alignment are IMO a useful but limited terminology. This terminology seems useful in developing and invoking different intuitions, but it does not divide the space cleanly. See this or this short explanation, or this elaboration of different ways the distinction could be used.
^{^}
We were training a deep network ("Leabra Vision") to recognize common objects, as a testbed for theories of brain function, starting around 2002. This was primarily Randy O'Reilly's work, with the rest of us helping him think it through and kibbitzing, and sometimes running side-experiments on his models. I followed it closely the whole time, enough to be a co-author on some of the papers.
^{^}
But is it "really" a fork? It's intuitively a lot easier to get good generalization if there happen to be sharp boundaries around the category in the testing set. For alignment, that's a pretty deep question, and one we probably shouldn't overlook. John Wentworth has argued that whether there's a coherent "natural abstraction" in the world to point to (like goodness, or empathy, or humans) is an important crux of disagreement on alignment difficulty. I continue to think there are sharper boundaries around following instructions from specific people than around most categories, making Instruction-following easier than value alignment.
^{^}
That is, we should expect poor generalization relative to our own hopes. The network is always generalizing exactly correctly according to its own internal standards. This seems important for thinking about alignment misgeneralization, so I'll emphasize it again even though it's been pointed out frequently.
^{^}
I'm more familiar with early work, like CAD2RL: Real Single-Image Flight without a Single Real Image (2016) in which widely varying simulation context (colors and textures of walls, lighting, furniture placement) dramatically helped generalization from simulation to reality.
There are also many claims that broadening training by procedurally varying prompts helps generalization in neural networks. I haven't found anything directly applied to "real alignment" as I'd frame it, but the empirical results for generalizing desired behaviors seem encouraging.
Sample empirical work on broadening training in LLMs for generalization, from research goblin ChatGPT5.2:
Here’s a curated set of LLM papers from ~the last 12–24 months that are tightly “same idea” as domain randomization: broaden the prompt / instruction / preference context during training (or explicitly optimize worst-case variants) so behavior holds up at deployment.
Prompt-variation robustness (closest analogue to “lighting/textures” randomization)
- PAFT: Prompt-Agnostic Fine-Tuning (2025). arXiv+1
  Core move: generate many meaningful prompt variants and sample them during training to prevent overfitting to wording.
- Same Question, Different Words: A Latent Adversarial Framework for Robustness to Prompt Paraphrasing (2025). arXiv+1
  Core move: optimize against worst-case paraphrase-like perturbations (adversarial training flavor) without needing lots of explicit paraphrase data.
- Exploring Robustness of LLMs to Paraphrasing Based on Sociodemographic Variation (2025). arXiv
  Core move: treat demographic / stylistic paraphrase as a deployment-shift axis; shows small augmentation can help OOD stylized text.
- Latent Paraphrasing: Perturbation on Layers Improves Knowledge Injection in Language Models (2024). arXiv
  Core move: a “paraphrase diversity without expensive paraphrase generation loops” framing, focused on knowledge injection.
“Instruction diversity causes generalization” (newer controlled evidence)
- Only-IF: Revealing the Decisive Effect of Instruction Diversity on Generalization (2024). arXiv
  Claim: generalization to unseen instruction semantics shows up only once semantic diversity crosses a threshold.
- Diversification Catalyzes Language Models’ Instruction Generalization To Unseen Semantics (2025). ACL Anthology
  Claim: strategically diversified training data drives generalization to unseen instruction semantics (a continuation of the “diversity matters” line, but newer).
Synthetic “generate more contexts” pipelines (especially active in code, but concept transfers)
- Auto Evol-Instruct: Automatically Designing Instruction Evolution Methods (2024). arXiv
  Core move: automate the rules for evolving instructions, aiming for scalable diversity without hand-designed heuristics.
- Infinite-Instruct: Synthesizing Scaling Code Instruction Data with Bidirectional Synthesis and Static Verification (2025). arXiv+1
  Core move: increase diversity + correctness via bidirectional construction and verification filters.
- Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models (2025). arXiv
  Core move: evolutionary / genetic-style generation to expand instruction coverage at scale.
- Tree-of-Evolution: Tree-Structured Instruction Evolution for Code (2025). ACL Anthology
  Core move: explore multiple evolutionary branches (not just a single chain) to improve diversity and quality.
“Robust alignment under distribution shift” (preference context broadening)
- Distributionally Robust Direct Preference Optimization (2025). arXiv+1
  Core move: treat real-world preference shift explicitly and optimize a worst-case objective (DRO framing).
- Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization (2024). arXiv+1
  Core move: robustness to noise / unreliable preference pairs, with an explicitly “robustify DPO” lens.
- Leveraging Robust Optimization for LLM Alignment under Distribution Shifts (2025). arXiv
  Core move: robust optimization + calibration-aware scoring for preference modeling under shift.
- DPO-Shift: Shifting the Distribution of Direct Preference Optimization (2025). arXiv+1
  Core move: addresses DPO’s “likelihood displacement” behavior; not exactly “more contexts,” but directly about deployment-relevant failure modes under preference training.
^{^}
Claude 4.5 Opus' Soul Document may go in a similar direction by broadening the criteria used to evaluate the training set. This document is an elaborate and honest description of what Anthropic wants Claude to do, and why. As of this writing, it's been verified that this was used in training, "including SL" but not whether it was used in RL. If it was used as an RL evaluation criteria, like the 58 constitutional principles were previously, it would broaden the evaluation criteria, which could produce some similar effects to broadening the training set.
^{^}
Human ethical judgments are made more reliable by redundancy, and we can hope that better LLMs will include training or mechanisms for making important decisions carefully. Humans use "System 2," serial cognitive steps including judgments, predictions, organizing metacognition, etc. These can be assembled to examine decisions from different semantic "angles," including predicting outcomes along different dimensions and different judgment criteria. This approach reduces errors at the cost of additional time and training.
CoT in current models does some limited amount of checking and redundancy, but I've long suspected, and a new study seems to indicate that humans seem to use more, and more elaborate, metacognition to improve our efficiency and reliability. Training and scaffolding can emulate human metacognition, and serve similar roles for alignment; see System 2 Alignment: Deliberation, Review, and Thought Management.

Broadening the training set should at least help make those behaviors more robust and those values more robustly represented.

Even assuming you had a way to test this generalization’s efficacy, my understanding is that it seems exactly what you propose in this section was done with Claude Sonnet 4.5 with the effect of increasing evaluation awareness.

IME measurement / the “hold out” set is the whole hard part here.

I didn't realize that's what happened. I assumed it was accidentally pretrained on similar evals. But you're right. It looks like they did try to aim alignment training at important behavior in pretty much the way I'm suggesting.

This seems like really bad news for alignment in general. Avoiding training on the important stuff, and hoping unimportant stuff generalizes to cover it does not sound like the greatest strategy.

I wonder if one issue is that they're training on relatively few examples of ethical behavior in complex situations. These few examples aren't that diverse in form, so they're likely to be categorized as training/evals. Doing more training with more diversity of context (the second standard approach) should help.

But I'm not sure how much. It does seem like the attempt pretty much backfired. And that's pretty bad news.

So the problem is, if you present a model with a bunch of wildly OOD examples in order to test generalisation, that these look fake to it, so it realises it is being tested? This implies alignment evals will increasingly look like psychology experiments, trying to coach/trick models into revealing their more fundamental preferences.

With evaluation awareness, do we see evidence of models in deployment ever getting confused and thinking they are being evaluated when they are not? If we're worried about models behaving incorrectly in novel situations, and novel situations tend to make models think they are being tested, then we should see this. Or, does evaluation awareness predominantly come from some other source?

I believe Gemini 3 is downright paranoid about being in evaluations when it's not. I expect other models make this mistake sometimes, too, but I'm not sure how common it is in other models.

This is slightly better than the model recognizing exactly when it's in training/evals, but not much; we might hope it assumes it's being evaluated when the decisions are unique and important, but I don't want to count on that happening enough of the time.

I expect other models make this mistake sometimes, too, but I'm not sure how common it is in other models.

Claude 3.7 Sonnet is extremely good at identifying that it's in an eval, apparently with few false positives. I imagine more recent Claude models are also very good at it (the Anthropic team tried a couple of things to reduce eval awareness in Opus-4.5 but without success). Other recent models are quite good but seemingly with more false positives.

There's been a recent call for more research in this area, since eval awareness is likely to increasingly break alignment evals: Call for Science of Eval Awareness (+ Research Directions)

Seems like we'd want to do this if we somehow solved programmatic generation of good (novel environment, good action) pairs. But then why not directly use the process that was generating all these good actions?

If the answer is that actually, the generated (environment, action) pairs are kinda AI-sloppy, and they don't give many details, they just do obvious broad-strokes generalization from the human text corpus, then I think that's very achievable but I'm no longer so excited about training an AI on this.

I don't know how excited I am about this type of broadening the training set, particularly since it did seem to generate eval awareness and aligned behavior based only on that awareness. But it does seem like it has clear advantages over running the generative process: efficiency, and turning explicit prompts into baked-in weights.

There's a possibly-important difference in baking the ethics into the weights rather than including it as prompts from a spec or constitution with some scaffolding at runtime. It seems better to do both, but to rely more on the weights than the prompting scheme if the model is potentially smart enough to "ignore" prompts it doesn't "like".

Thinking hard and then incorporating the conclusions into new cognitive habits/semantic knowledge is pretty important for human learning/intelligence (although that's not the main story with our ethical judgments). Improving the system 1 "inner loop" should make system 2 "outer loops" better and more efficient, at least to some degree.

I expect you're familiar, but just to spell it out:

Constitutional AI (and OpenAI's deliberative alignment, and probably whatever they're doing at DeepMind) includes a supervised learning step where they prompt the system with a constitutional principle/spec and then fine-tune on the output it produces. This bakes extra inference compute into the weights. The second RL step, evaluating two responses based on the constitution/spec, also bakes however much reasoning it used to make that judgment, plus the constituion/spec it used as a criterion, into the weights.

I appreciate your generalization landscape image in the first section! It's not your central point, but I'd like to discuss that section a bit.

Something I find very challenging about LLMs is that I don't know of any good way to think about the shape of the distribution, or identifying what's in- and out-of-distribution, despite asking researchers about it for a couple of years now. Those concepts make perfect sense when you're training a neural network to approximate, say, a function over real numbers. But I don't understand how to think about a distribution defined over a sequence of high-dimensional tokens which don't even have a natural semantic ordering that I can see^[1]. We could perhaps measure something like the model's perplexity, but that seems tautological — the inputs that are out of distribution are exactly those which the model isn't familiar with.

To be clear, I can think of example inputs that (intuitively) seem clearly in-distribution, and others that seem clearly out-of-distribution, so I don't think it's a hopeless problem. It's just not clear to me that it's one that's been solved as yet or that we even have solid traction on.

So when you talk about broadening the training set 'on the relevant dimensions', I'm pretty confused about what those dimensions are or how we would identify them. It also seems to me intuitively that given the very high dimensionality and sparsity of the latent space, broadening to cover even a slightly larger amount of it would decrease the density by many orders of magnitude (or require many OOMs more data).

I've been thinking about posting a LW question on this, but I found your description & image pretty helpful, so I thought I'd ask here first.

^{^}
The individual dimensions of the embedding space are well-ordered, but as far as I know they don't have any global meaning; the model learns to use seemingly arbitrary regions of the space for various purposes (eg binding vectors) that don't relate to each other in any consistent way.

Great question! The answer is that I don't know, and I don't think anyone else does either - but somebody must have a little better idea how to characterize the semantic spaces LLMs are working/learning in.

When I say "on the relevant dimensions," I'm referring to very abstract dimensions like ethics or harmlessness (or better because it's a little simpler, instruction-following). I think it's quite clear that LLMs have features that map to those type of abstract semantics.

After writing the following, I think perhaps the answer is "the semantic dimensions depend on training in complex but probably systematic ways." I'm not sure how useful that is for alignment. LLMs seem to do what they're trained to do - but figuring out what you're training them for seems quite tricky, enough that we probably don't want to pin our hopes on it any more than we have to.

Here's the logic that leads me to that conclusion.

What's not clear to me, and is I think a major outstanding question, is how regular or uniform those semantic dimensions tend to be. It's clear that LLMs can make pretty much arbitrarily fine distinctions if they're trained to. It's not clear what happens by default if training isn't creating distinctions.

For instance, we might hope that an LLM would generalize RL training for honesty to a well-learned feature of honesty (being well-learned doesn't imply a single unit or even being easily interpetable, just some sort of pretty good representation). But it's also pretty likely that the network has learned representations for honesty-when-someone-can-tell and honesty-when-it-matters. So the RL training for honesty could attach mostly to one of those finer distinctions.

So that's my vague answer: nobody knows, but the dimensions include very abstract and meaningful ones. And the effective mapping depends a lot on training.

The following is just ramblings on the implications for alignment.

This is where the consistency and quality of the training set might matter a lot. Training for honesty in the simplest way would probably attach to some subset. Creating a training set that "points" to the parts of honesty you want is helpful.

And not contradicting this in other parts of the training is equally important.

After writing this, I'm starting to think that making alignment targets complex is a big mistake. Training for just one thing is tricky enough. Shooting for some precise balance of HHH seems like a mistake, unless you're okay with however those get balanced. Training for the much more arbitrary specs used by OpenAI and probably by DeepMInd (I don't think they've given much hint) seems like an even worse idea, once the model has more lattitude of action.

RL training for alignment is a blunt instrument, at least the way we're using it so far. It works, but it's not hitting a precise target. Aiming it at a simple target seems wiser. This reinforces my belief that instruction-following is nontrivially easier than value alignment.

The other problem is that Seth's useful and thought-provoking map is in 2 dimensions, and humans are used to thinking in 1–3 dimensions (and have a visual cortext with 2-dimensional local connectivity, so lack the wetware for thinking in more than 2½ dimensions). LLM activations, KV values etc. are generally in O(8192) dimensions. High dimensional spaces just have a lot of statistical/geometric properties that are wildly unituitive to us, since we're use to working in very low numbers of dimensions. (Also known as the curse of dimensionality.) You can, with practice, learn to recognize when your intuition is misleading you and what the correct answer is, but this takes quite a bit of practice. For example, if you pick two vectors at random in a high dimensional space, they will almost invariably be almost orthogonal to each other (the angle between them will be nearly 90 degrees). Random walks in high dimensional spaces practically never return to anywhere near any place they're gone before: they continually get "more lost". So a lot of the intuitions that Seth's diagram gives are rather misleading: start in the middle of the "Bay of Dimensional Mismatch", go a shortish distance in a random direction, and you will almost certainly end up deep at sea, rather than back on land as his 2-D map suggests. Nevertheless, all the effects he discusses are real and important—just be aware that there's simply no way to diagram them in only 2 dimensions (or anything a human can visualize) that isn't inherently rather misleading.

Yes, I wholly agree. Those sorts of diagrams are always highly compressing the dimensionality.

Broadening the training set should at least help make those behaviors more robust and those values more robustly represented.

IME measurement / the “hold out” set is the whole hard part here.

I expect other models make this mistake sometimes, too, but I'm not sure how common it is in other models.

There's been a recent call for more research in this area, since eval awareness is likely to increasingly break alignment evals: Call for Science of Eval Awareness (+ Research Directions)

I appreciate your generalization landscape image in the first section! It's not your central point, but I'd like to discuss that section a bit.

I've been thinking about posting a LW question on this, but I found your description & image pretty helpful, so I thought I'd ask here first.

^{^}
The individual dimensions of the embedding space are well-ordered, but as far as I know they don't have any global meaning; the model learns to use seemingly arbitrary regions of the space for various purposes (eg binding vectors) that don't relate to each other in any consistent way.

Yes, I wholly agree. Those sorts of diagrams are always highly compressing the dimensionality.

LESSWRONG
LW

LESSWRONG
LW

40

Broadening the training set for alignment

40

Summary

Alignment generalization^[2] is more nuanced than IID vs OOD

Generalization for visual and ethical judgments

Examples of broadening the training set for alignment

Broadened training for more human-like representations

Broadening the training set to include reasoning about goals

Provisional conclusions and next directions

40

40

40

Broadening the training set for alignment

40

Summary

Alignment generalization[2] is more nuanced than IID vs OOD

Generalization for visual and ethical judgments

Examples of broadening the training set for alignment

Broadened training for more human-like representations

Broadening the training set to include reasoning about goals

Provisional conclusions and next directions

40

40

Alignment generalization^[2] is more nuanced than IID vs OOD