1 min read20 comments
This is a special post for quick takes by porby. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
20 comments, sorted by Click to highlight new comments since:

Having escaped infinite overtime associated with getting the paper done, I'm now going back and catching up on some stuff I couldn't dive into before.

Going through the sleeper agents paper, it appears that one path—adversarially eliciting candidate backdoor behavior—is hampered by the weakness of the elicitation process. Or in other words, there exist easily accessible input conditions that trigger unwanted behavior that LLM-driven adversarial training can't identify.

I alluded to this in the paper linkpost, but soft prompts are a very simple and very strong option for this. There remains a difficulty in figuring out what unwanted behavior to adversarially elicit, but this is an area that has a lot of low hanging fruit.

I'd also interested in whether how more brute force interventions, like autoregressively detuning a backdoored model with a large soft prompt for a very large dataset (or an adversarially chosen anti-backdoor dataset) compares to the other SFT/RL interventions. Activation steering, too; I'm currently guessing activation-based interventions are the cheapest for this sort of thing.

Soft prompts used like this ~= latent adversarial training

So see that work etc.

A further extension and elaboration on one of the experiments in the linkpost:
Pitting execution fine-tuning against input fine-tuning also provides a path to measuring the strength of soft prompts in eliciting target behaviors. If execution fine-tuning "wins" and manages to produce a behavior in some part of input space that soft prompts cannot elicit, it would be a major blow to the idea that soft prompts are useful for dangerous evaluations.

On the flip side, if ensembles of large soft prompts with some hyperparameter tuning always win (e.g. execution fine tuning cannot introduce any behaviors accessible by any region of input space without soft prompts also eliciting it), then they're a more trustworthy evaluation in practice.

Another experiment:

  1. Train model M.
  2. Train sparse autoencoder feature extractor for activations in M.
  3. FT = FineTune(M), for some form of fine-tuning function FineTune.
  4. For input x, fineTuningBias(x) = FT(x) - M(x)
  5. Build a loss function on top of the fineTuningBias function. Obvious options are MSE or dot product with bias vector.
  6. Backpropagate the loss through M(x) into the feature dictionaries.
  7. Identify responsible features by large gradients.
  8. Identify what those features represent (manually or AI-assisted).
  9. To what degree do those identified features line up with the original FineTune function's intent?


  1. The features above are in the context of a single input. Check for larger scopes by sampling more inputs, backpropagating, and averaging the observed feature activations. Check for ~unconditional shifts induced by FineTune by averaging over an extremely broad sampling of inputs.
  2. Can check path dependence during RLHF-like fine tuning. Do the features modified across multiple RLHF runs remain similar? Note that this does not require interpreting what features represent, just that they differ. That makes things easier! (Also, note that this doesn't technically require a feature dictionary; the sparse autoencoder bit just makes it easier to reason about the resulting direction.)
  3. Can compare representations learned by decision transformers versus PPO-driven RLHF. Any difference between the features affected? Any difference in the degree of path dependence?
  4. Can compare other forms of conditioning. Think [2302.08582] Pretraining Language Models with Human Preferences (arxiv.org). In this case, there wouldn't really be a fine-tuning training stage; rather, the existence of the condition would serve as the runtime FineTune function. Compare the features between the conditioned and unconditioned cases. Presence of the conditions in pretraining could change the expressed features, but that's not a huge problem. 
  5. Any way to meaningfully compare against activation steering? Given that the analysis is based directly on the activations to begin with, it would just be a question of where the steering vector came from. The feature dictionary could be used to build a steering vector, in principle.
  6. Does RLHF change the feature dictionary? On one hand, conditioning-equivalent RL (with KL divergence penalty) shouldn't find new sorts of capability-relevant distinctions, but it's very possible that it collapses some features that are no longer variable in the fine-tuned model. This is trickier to evaluate; could try to train a linear map on the activations of model B before feeding it to an autoencoder trained on model A's activations.  

Some experimental directions I recently wrote up; might as well be public:

  1. Some attempts to demonstrate how goal agnosticism breaks with modifications to the architecture and training type. Trying to make clear the relationship between sparsity/distance of the implicit reward function and unpredictability of results.
  2. A continuation and refinement of my earlier (as of yet unpublished) experiments about out of distribution capability decay. Goal agnosticism is achieved by bounding the development of capabilities into a shape incompatible with internally motivated instrumental behavior across the training distribution; if it's possible for any nontrivial capability to persist out of distribution at toy scales, even with significant contrivance to train it into existence in the first place, that would be extremely concerning for the potential persistence of deceptive mesaoptimizers at scale.

    Ideally, the experiment would examine the difference between OOD capabilities with varying levels of overlap with the training distribution. For example, contrast four cases:
    A: A model is trained on ten different "languages" with zero translation tasks between them. These "languages" would be not human languages, but rather trivial types of sequences that do not share any obvious form or underlying structure. One language could be the sequence generated by f(x) = 2x + 1; another might be to endlessly repeat "brink bronk poot toot."
    B: A model is trained on ten different languages with significantly different form, but a shared underlying structure. For example, all the languages might involve solving trivial arithmetic, but one language is "3 + 4 = 7" and another language is "three plus four equals seven."
    C: Same as B, but now give the model translation tasks.
    D: Same as C, but leave one language pair's translation tasks unspecified. Any successful translation for that pair would necessarily arise from a generalizing implementation.

    For each model, drop parts of the training distribution but continue to perform test evaluations on that discontinued part. Do models with more apparent shared implementation decay more slowly? How does the decay vary with hyperparameters?

    Some circuit-level analysis might be helpful here to identify whether capability is lost via trivial gating versus catastrophic scrambling, but it's probably best to punt that to a separate experiment.
  3. I suspect there is an equivalence between conditioning and representational intervention, like activation steering. They may be different interfaces to the same effect. I'd like to poke around metatoken-like approaches (like Pretraining Language Models with Human Preferences) and see if I can find anything compelling from a representational perspective.
  4. Assuming goal agnosticism is actually achieved and maintained, it broadens the scope for what kinds of interpretability can be useful by ruling out internal representational adversaries. There may be room for more experiments around motivational interpretability. (Some other work has already been published on special cases.)

Less concretely, I'd also like to:

  1. Figure out how to think about the "fragility" of goal agnostic systems. Conditioning a predictor can easily yield an agent that is not goal agnostic; this is expected and not inherently problematic. But what if it is trivial to accidentally condition a strong model into being a worldeater, rather than a passive Q&A bot? There's clearly a spectrum here in terms of how "chaotic" a model is—the degree to which small perturbations can yield massive consequences—but it remains conceptually fuzzy.
  2. More fully ground "Responsible Scaling Policy"-style approaches on a goal agnostic foundation. If a lab can demonstrate that a model is incapable of learning preferences over external world states, and that their method of aiming the model isn't "fragile" in the above sense, then it's a good candidate for incremental experimentation.
  3. Come up with other ways to connect this research path with policy more generally.

Soft prompts are another form of prompt automation that should naturally preserve all the nice properties of goal agnostic architectures.

Does training the model to recognize properties (e.g. 'niceness') explicitly as metatokens via classification make soft prompts better at capturing those properties?

You could test for that explicitly:

  1. Pretrain model A with metatokens with a classifier.
  2. Pretrain model B without metatokens.
  3. Train soft prompts on model A with the same classifier.
  4. Train soft prompts on model B with the same classifier.
  5. Compare performance of soft prompts in A and B using the classifier.

Notes and extensions:

  1. The results of the research are very likely scale sensitive. As the model gets larger, many classifier-relevant distinctions that could be missed by small models lacking metatoken training may naturally get included. In the limit, the metatoken training contribution may become negligible. Is this observable across ~pythia scales? Could do SFT on pythia to get a "model A."
  2. The above description leaves out some complexity. Ideally, the classifier could give scalar scores. This requires scalarized input tokens for the model that pretrains with metatokens.
  3. How does soft prompting work when tokens are forced to be smaller? For example, if each token is a character, it'll likely have a smaller residual dedicated to it compared to tokens that spans ~4 characters to equalize total compute.
  4. To what degree does soft prompting verge on a kind of "adversarial" optimization? Does it find fragile representations where small perturbations could produce wildly different results? If so, what kinds of regularization are necessary to push back on that, and what is the net effect of that regularization?
  5. There's no restriction on the nature of the prompt. In principle, the "classifier" could be an RL-style scoring mechanism for any reward. How many tokens does it take to push a given model into particular kinds of "agentic" behavior? For example, how many tokens does it take to encode the prompt corresponding to "maximize the accuracy of the token prediction at index 32 in the sequence"?
  6. More generally: the number of tokens required to specify a behavior could be used as a metric for the degree to which a model "bakes in" a particular functionality. More tokens required to specify behavior successfully -> more information required in that model to specify that behavior.

Another potentially useful metric in the space of "fragility," expanding on #4 above:

The degree to which small perturbations in soft prompt embeddings yield large changes in behavior can be quantified. Perturbations combined with sampling the gradient with respect to some behavioral loss suffices.

This can be thought of as a kind of internal representational fragility. High internal representational fragility would imply that small nudges in the representation can blow up intent.

Does internal representational fragility correlate with other notions of "fragility," like the information-required-to-induce-behavior "fragility" in the other subthread about #6? In other words, does requiring very little information to induce a behavior correlate with the perturbed gradients with respect to behavioral loss being large for that input?

Given an assumption that the information content of the soft prompts have been optimized into a local minimum, sampling the gradient directly at the soft prompt should show small gradients. In order for this correlation to hold, there would need to be steeply bounded valley in the loss landscape. Or to phrase it another way, for this correlation to exist, behaviors which are extremely well-compressed by the model and have informationally trivial pointers would need to correlate with fragile internal representations.

If anything, I'd expect anticorrelation; well-learned regions probably have enough training constraints that they've been shaped into more reliable, generalizing formats that can representationally interpolate to adjacent similar concepts.

That'd still be an interesting thing to observe and confirm, and there are other notions of fragility that could be considered.

Expanding on #6 from above more explicit, since it seems potentially valuable:

From the goal agnosticism FAQ:

The definition as stated does not put a requirement on how "hard" it needs to be to specify a dangerous agent as a subset of the goal agnostic system's behavior. It just says that if you roll the dice in a fully blind way, the chances are extremely low. Systems will vary in how easy they make it to specify bad agents.

From earlier experimentpost:

Figure out how to think about the "fragility" of goal agnostic systems. Conditioning a predictor can easily yield an agent that is not goal agnostic; this is expected and not inherently problematic. But what if it is trivial to accidentally condition a strong model into being a worldeater, rather than a passive Q&A bot? There's clearly a spectrum here in terms of how "chaotic" a model is—the degree to which small perturbations can yield massive consequences—but it remains conceptually fuzzy.

This can be phrased as "what's the amount of information required to push a model into behavior X?"

Given a frozen model, optimizing prompt tokens gives us a direct way of answering a relevant proxy for this question:

"What is the amount of information (accessible to SGD through soft prompting) required to push a model into behavior X?"

In practice, this seems like it should be a really good proxy, and (provided some compute) it gives you a trivially quantifiable answer:

Try different soft prompt token counts and observe performance on the task that the soft prompts were targeting. The resulting token count versus performance curve characterizes the information/performance tradeoff for that behavior, given that model.

This seems like... it's... an extremely good answer to the "fragility" question? It's trivial to incorporate this into an evaluations scheme. Just have a bunch of proxy tasks that would be alarming if they were accessible by trivial differences in prompting.

Conceptually, it's a quantification of the number of information theoretic mistakes you'd need to make to get bad behavior from the model.

A further extension: While relatively obvious in context, this also serves as a great way to automate adversarial jailbreak attempts (broadly construed), and to quantify how resistant a given model or prompting strategy is to jailbreaks.

Set up your protections, then let SGD try to jailbreak it. The strength of the protections can be measured by the amount of information required to overcome the defenses to achieve some adversarial goal.

In principle, a model could be perfectly resistant and there would be no quantity of information sufficient to break it. That'd be good to know!

This kind of adversarial prompt automation could also be trivially included in an evaluations program.

I can't imagine that this hasn't been done before. If anyone has seen something like this, please let me know.


3 more posts I feel like I need to write at some point:

In defense of dumb value alignment

Solving all of ethics and morality and getting an AI to implement it seems hard. There are possible worlds where we would need to work with half measures. Some of these paths rely on lower auto-doom densities, but there seem to be enough of those potential worlds to consider it.

Example of 'good enough to not x/s-risk' dumb value alignment. Required assumptions for stability. Shape of questions implied that may differ from more complete solutions.

What I currently believe, in pictures

Make a bunch of diagrams of things I believe relevant to alignmentstuff and how they interact, plus the implications of those things.

The real point of the post is to encourage people to try to make more explicit and extremely legible models so people can actually figure out where they disagree instead of running around in loops for several years.

Preparation for unknown adversaries is regularization

Generalizing the principle from policy regularization.

  1. Adversaries need not be actual agents working against you.
  2. "Sharp" models that aggressively exploit specific features have a fragile dependence on those features. Such models are themselves exploitable.
  3. Uncertainty and chaos are strong regularizers. The amount of capability required to overcome even relatively small quantities of chaos can be extreme.
  4.  Applications in prediction.

Has there been any work on the scaling laws of out-of-distribution capability/behavior decay?

A simple example:

  1. Simultaneously train task A and task B for N steps.
  2. Stop training task B, but continue to evaluate the performance of both A and B.
  3. Observe how rapidly task B performance degrades.

Repeat across scale and regularization strategies.

Would be nice to also investigate different task types. For example, tasks with varying degrees of implied overlap in underlying mechanisms (like #2).

I've previously done some of these experiments privately, but not with nearly the compute necessary for an interesting result.

The sleeper agents paper reminded me of it. I would love to see what happens on a closer-to-frontier model that's intentionally backdoored, and then subjected to continued pretraining. Can a backdoor persist for another trillion tokens of nonadversarial-but-extremely-broad training? Does that vary across scale etc?

I'd also like to intentionally find the circumstances that maximize the persistence of out of distribution capabilities not implied by the current training distribution.

Seems like identifying a robust trend here would have pretty important Implications, whichever direction it points.

Yeah, I've seen work on the sort of thing in your example in the continual learning literature. Also tasks that have like.... 10 components, and train sequentially but test on every task so far trained on. Then you can watch the earlier tasks fall off as training progresses.

For what it's worth (perhaps nothing) in private experiments I've seen that in certain toy (transformer) models, task B performance gets wiped out almost immediately when you stop training on it, in situations where the two tasks are related in some way.

I haven't looked at how deep the erasure is, and whether it is far easier to revive than it was to train it in the first place.

Yup, exactly the same experience here.

I sometimes post experiment ideas on my shortform. If you see one that seems exciting and you want to try it, great! Please send me a message so we can coordinate and avoid doing redundant work.

Retrodicting prompts can be useful for interpretability when dealing with conditions that aren't natively human readable (like implicit conditions induced by activation steering, or optimized conditions from soft prompts). Take an observed completion and generate the prompt that created it.

What does a prompt retrodictor look like?

Generating a large training set of soft prompts to directly reverse would be expensive. Fortunately, there's nothing special in principle about soft prompts with regard to their impact on conditioning predictions.

Just take large traditional text datasets. Feed the model a chunk of the string. Train on the prediction of tokens before the chunk.

Two obvious approaches:

  1. Special case of infilling. Stick to a purely autoregressive training mode, but train the model to fill a gap autoregressively. In other words, the sequence would be: 
    [Prefix token][Prefix sequence][Suffix token][Suffix sequence][Middle token][Middle sequence][Termination token]
    Or, as the paper points out: 
    [Suffix token][Suffix sequence][Prefix token][Prefix sequence][Middle sequence][Termination token] Nothing stopping the prefix sequence from having zero length.
  2. Could also specialize training for just previous prediction: 
    [Prompt chunk]["Now predict the previous" token][Predicted previous chunk, in reverse]

But we don't just want some plausible previous prompts, we want the ones that most precisely match the effect on the suffix's activations.

This is trickier. Specifying the optimization target is easy enough: retrodict a prompt that minimizes MSE((activations | sourcePrompt), (activations | retrodictedPrompt)), where (activations | sourcePrompt) are provided. Transforming that into a reward for RL is one option. Collapsing the outout distribution into a token is a problem; there's no way to directly propagate the gradient through that collapse and into the original distribution. Without that differentiable connection, analytically computing gradients for the other token options becomes expensive and turns into a question of sampling strategies. Maybe something clever floating around.

Note that retrodicting with an activation objective has some downsides:

  1. If the retrodictor's the same model as the predictor, there are some weird feedback loops. The activations become a moving target.
  2. Targeting activations makes the retrodictor model-specific. Without targeting activations, the retrodictor could work for any model in principle.
  3. While the outputs remain constrained to token distributions, the natural endpoint for retrodiction on activations is not necessarily coherent natural language. Adversarially optimizing for tokens which produce a particular activation may go weird places. It'll likely still have some kind of interpretable "vibe," assuming the model isn't too aggressively exploitable.

This class of experiment is expensive for natural language models. I'm not sure how interesting it is at scales realistically trainable on a couple of 4090s.

Quarter-baked experiment:

  1. Stick a sparse autoencoder on the residual stream in each block.
  2. Share weights across autoencoder instances across all blocks.
  3. Train autoencoder during model pretraining.
  4. Allow the gradients from autoencoder loss to flow into the rest of the model.

Why? With shared autoencoder weights, every block is pushed toward sharing a representation. Questions:

  1. Do the meanings of features remain consistent over multiple blocks? What does it mean for an earlier block's feature to "mean" the same thing as a later block's same feature when they're at different parts of execution?
  2. How much does a shared representation across all blocks harm performance? Getting the comparison right is subtle; it would be quite surprising if there is no slowdown on predictive training when combined with the autoencoder training since they're not necessarily aligned. Could try training very small models to convergence to see if they have different plateaus.
  3. If forcing a shared representation doesn't harm performance, why not? In principle, blocks can execute different sorts of programs with different IO. Forcing the residual stream to obey a format that works for all blocks without loss would suggest that there were sufficient representational degrees of freedom remaining (e.g. via superposition) to "waste" some when the block doesn't need it. Or the shared "features" mean something completely different at different points in execution.
  4. Compare the size of the dictionary required to achieve a particular specificity of feature between the shared autoencoder and a per-block autoencoder. How much larger is the shared autoencoder? In the limit, it could just be BlockCount times larger with some piece of the residual stream acting as a lookup. It'd be a little surprising if there was effectively no sharing.
  5. Compare post-trained per-block autoencoders against per-block autoencoders embedded in pretraining that allow gradients to flow into the rest of the model. Are there any interesting differences in representation? Maybe in terms of size of dictionary relative to feature specificity? In other words, does pretraining the feature autoencoder encourage a more decodable native representation?
  6. Take a look at the decoded features across blocks. Can you find a pattern for what features are relevant to what blocks? (This doesn't technically require having a shared autoencoder, but having a single shared dictionary makes it easier to point out when the blocks are acting on the same feature, rather than doing an investigation, squinting, and saying "yeah, that sure looks similar.")

Another item for the todo list:
Autoregressive transformer gradient flow shapes earlier token computation to serve future predictions, but that early computation cannot condition on future tokens. This should serve as a regularizing influence on the internal structure of token predictions: in order to be useful to the largest possible set of future predictions, the local computation would need to factor itself into maximally reusable modules.

The greater the local uncertainty about the future, the less the local computation can be specialized to serve future tokens. Could consider it something like: the internal representation is a probability-weighted blend of representations useful to possible futures. If the local computation is highly confident in a narrow space, it can specialize more. 

Simplicity biases would incentivize sharing modules more strongly. Even if the local computation suspects a narrower future distribution, it would be penalized for implementing specialized machinery that is too rarely useful.

One implication: many forms of token-parallelized search get blocked, because they require too much foresight-driven specialization.

Quarter-baked ideas for potential future baking:

  1. A procedure for '~shardifying'[1] an incoherent utility function into a coherent utility function by pushing preferences into conditionals. Example of an extreme case of this would be an ideal predictor (i.e. one which has successfully learned values fit to the predictive loss, not other goals, and does not exhibit internally motivated instrumental behavior) trained to perfectly predict the outputs of an incoherent agent.

    The ideal predictor model, being perfectly conditional, would share the same outputs but would retain coherence: inconsistencies in the original utility function are remapped to be conditional. Apparent preference cycles over world states are fine if the utility function isn't primarily concerned with world states. The ideal predictor is coherent by default- it doesn't need to work out any kinks to avoid stepping on its own toes.

    Upon entering a hypothetical capability-induced coherence death spiral, what does the original inconsistent agent do? Does it try to stick to object level preferences, forcing it to violate its previous preferences in some presumably minimized way?[2] Or does it punt things into conditionality to maintain behaviors implied by the original inconsistencies? Is that kind of shardification convergent?
  2. Is there a path to piggybacking on greed/noncompetitive inclinations for restricting compute access in governance? One example: NVIDIA already requires that data center customers purchase its vastly more expensive data center products. The driver licenses for the much cheaper gaming class hardware already do not permit use cases like "build a giant supercomputer for training big LLMs."

    Extending this to, say, having a dead man's switch built into the driver if the GPU installation hasn't received an appropriate signal recently (implying that the relevant regulatory entity has not been able to continue its audits of the installation and its use), the cluster simply dies.

    Modified drivers could bypass some of the restrictions, but some hardware involvement would make it more difficult. NVIDIA may already be doing this kind of hardware-level signing to ensure that only approved drivers can be used (I haven't checked). It's still possible in principle to bypass- the hardware and software are both in the hands of the enemy- but it would be annoying.

    Even if they don't currently do that sort of check, it would be relatively simple to add some form of it with a bit of lead time.

    By creating more regulatory hurdles that NVIDIA (or other future dominant ML hardware providers) can swallow without stumbling too badly, they get a bit of extra moat against up-and-comers. It'd be in their interest to get the government to add those regulations, and then they could extract a bit more profit from hyperscalers.
  1. ^

    I'm using the word "shard" here to just mean "a blob of conditionally activated preferences." It's probably importing some other nuances that might be confusing because I haven't read enough of shard theory things to catch where it doesn't work.

  2. ^

    This idea popped into my head during a conversation with someone working on how inconsistent utilities might be pushed towards coherence. It was at the Newspeak House the evening of the day after EAG London 2023. Unfortunately, I promptly forgot their name! (If you see this, hi, nice talking to you, and sorry!)

Another item for the todo list:

  1. Compile neural networks from fountains of autogenerated programs.
  2. Generate additional permutations by variously scrambling compiled neural networks.
  3. Generate more "natural" neural representations by training networks to predict the mapping implied by the original code.
  4. Train an interpreter to predict the original program from the neural network.

Naive implementation likely requires a fairly big CodeLlama-34b-Instruct-tier interpreter and can only operate on pretty limited programs, but it may produce something interesting. Trying to apply the resulting interpreter on circuits embedded in larger networks probably won't work, but... worth trying just to see what it does?

Might also be something interesting to be learned in spanning the gap between 'compiled' networks and trained networks. How close do they come to being affine equivalents? If not linear, what kind of transform is required (and how complicated is it)?