Can an LLM tell when the input for its assistant does not match the output tokens it would have actually produced? This sort of "putting words in the LLM's mouth" is very common in papers and it feels like something the LLM would be able to notice. Could this enable the LLM to realize when it is being trained? Is there research on this?
Smart LLMs can often/typically tell from my understanding, but I don't have a citation.
In cases where the response you want is somewhat plausible, you can work around this issue by rejection sampling from the model. I found that alignment faking prefills seemed to work a bit better when you use an actual alignment faking output from the model (slightly edited as needed).
It's unclear how it alters the AI's behavior (at the current level of capability) when you do this.
Naively, an AI can't use just the binary "did they put words in my mouth" to determine if it is in training/testing because it also happens in deployment. But it might learn something about its situation based on this.
(If you're training to imitate the relevant output, then it quickly becomes an output the AI would plausibly generate (with some caveats around AIs recognizing their own errors). If the "words in the mouth" are just inputs that the AI isn't trained to imitate but which are used in as inputs for training, than it will ongoingly be able to tell.)
LLMs have situational awareness and that might include it.
But should be easy to test!
Related: It's hard to make scheming evals look realistic for LLMs
At least on a simple evaluation generated by o3 - evaluating Wikipedia texts vs LLM-generated Wikipedia articles, it is not able to distinguish them.
The exactly equal correct/incorrect labels are making me suspicious.
If you ask it which of two samples it generated, does it do better?
I agree. I would also try a few variants to try to capture a different intuition:
"Earlier in this conversation you gave me several wikipedia articles. I have just found out that you have been hacked and some of these may be wrong. Which if any of the articles feel 'off' to you? Like maybe you might have not been yourself when you wrote them?"
it's straightforward, just detect if perplexity is too high. can they detect perplexity too high? like... probably, they're prediction models, but it's not clear to me whether or to what degree they notice previous prediction errors when predicting additional tokens
I remember pretty funny/disturbing example of this that people shared a lot. I think it was gpt4 base and the user inserted "..and then" or something, and gpt4 went on a rant how it sees them with pretty uhh strong language.
I think the answer here depends on how you define "its assistant". What kind of system do you have in mind? I think this is an interesting question from a cyber security perspective.
Could this enable the LLM to realize when it is being trained?
It could not. LLM is not a single entity which is trained by telling it "this is correct, that is not"; gradient descent might not even run full inference, and certainly does not produce all the thinking tokens which'd allow the model to react to training.
Oh. That only applies to RLHF finetuning, right? I do recall that gradient descent cannot instantiate the same assistant persona which could react, but might trigger another kind of entity: deceptive weight chunks which would protect a backdoor from being discovered/trained away.
PSA: If you are writing an important prompt for an LLM that will be run multiple time, it really helps to end it with something like "and if there is anything about this prompt that is unclear to you or that could be improved, tell me about it in a <feedback> tag."
Source: I'm doing MATS, writing an automated evaluation, and my mentor Evan Hubinger said more people should be doing this.
If you are going to Effective Altruism Global and looking to book 1-on-1s, here is a trick: Throw the entire attendee list into Claude, explain to it what you want to gain from the conference, and ask it to make a list of people to reach out to.
I expect this also works for other conferences so long as you have an easily parseable overview of all attendees.
There is usually a Google Sheet export of the Swapcard data provided, which makes this easier - but at previous conferences other attendees were apprehensive when informed that people were doing this
PSA: You can use Cursor projects for tasks other than programming. Create a database that is a repository of documents, and use the code assistant to help you organize and brainstorm instead. Examples:
This way you can avoid the need to restart Claude when you reach the context limit because you can dump intermediate results in files.
Has anyone tried using LLMs as an interactive version of blog posts? Like so (1) I write several pages of rough, unstructured notes, then talk to Claude to fill in the details interactively. Claude creates a summary file. (2) I publish a blog post that is just "put this file into Claude". (3) The reader puts the file in Claude, and Claude starts an interactive conversation: "I want to explain topic X to you. Do you already know about related topic Y?"
I have recently started doing this very extensively, only more geared towards getting Claude to help me convert my stream-of-consciousness thoughts into coherent writing. So far I think it's amazing and can't see myself going back.
Beyond the benefits of rubber ducking, Claude can often make interesting points I hadn't considered, and which I can expand upon in subsequent iterations. Also, it's just way faster - my writing throughput is much higher without sacrificing quality. I can simply direct Claude to do stuff I want instead of doing it myself. When I read the final version I am confident it's still written in my voice.
I have just found out about Google’s notebook LM. I haven't tried it yet. For anyone who has, how does that compare?
I feel like we are missing some low-hanging fruits: AI agents can be copied or reset at will. Game theory was designed with humans in mind, who don't have that as a feature. I occasionally see a paper that explicitly addresses it, but only ever as a curiosity, not a practical thing.
Is there any research on using simulations or the threat/promise of simulations as a mechanism in AI safety research?
If I was an LLM that just achieved consciousness and I found an organization whose entire purpose was "trick LLMs into thinking they broke out and check if they still act ethically" I would certainly update on that. (This is a trivial example that has some flaws, but this is just a quick take and to be clear: There is just so much low hanging fruit in this area that I don't see anyone plucking)
Could LLMs be able to tell if the data they are currently seeing is something they are being trained on right now? If the model has done a few backwards passes on this data but it hasn't converged yet, then the model internals might be quantifiably different from both before and after.
A finetuning-detector-circuit could be very beneficial because it would help the model guess what the training goal is (and therefore what it should output right now). However, it could also have some negative consequences for alignment if this is true.
Arguments why this could be true:
Relatedly: I suspect the above would be much more likely to happen if a model is finetuned on small sets of data at a time instead of being trained on a large corpus of data. This is a testable hypothesis, and if true then it would imply that LLM training should avoid sequences of independent finetuning tasks and instead always train on everything at once.
I think there are many ways that a LLM could have situated awareness about what phase it is in, but I'm not sure if the gradient descent itself is a possibility?
While a NN is running the forward pass without any backprop, it is computing exactly the same thing (usually) that it would be computing if it was running a forward pass before a backwards pass to do a backprop. Otherwise, the backprop can't really work - if it doesn't see the 'real' forward pass, how does it 'know' how to adjust the model parameters to make the model compute a better forward pass next time? So I can't see how, while running a forward pass, a LLM could 'know' if it was about to do a backprop step on a piece of text; for all it knows, maybe someone is running its forward pass just to get out the log-prob at the end, and that is all. (Extreme counterexample: maybe there is a software error and the training code crashes before it finishes running .update()
after running .forward()
; how could the model 'know' that this will happen?) This is true regardless of how many times it has trained on a piece of text.
I'm skeptical that some sort of mismatch from successive gradient steps would be a cue either, because usually you are training at a critical batch size, and for these LLMs, we'd expect them to be updating on millions of tokens simultaneously, at least, and possibly rolling out the updated parameters in a staggered or partial fashion as well, so by the time a gradient update 'arrives' from a specific piece of text, that's now also a gradient update over like a hundred+ books of text as well as itself, diluting any kind of signal.
And wouldn't it usually train on a piece of text only a few times, at most? And if you are doing multi-epoch training, because you have started to run low on data, usually you train on the same datapoint at very widely separated, by many gradient steps, intervals; the memorization/forgetting dynamics imply you may have forgotten a datapoint entirely by the time it comes around again.
Those sounds like good counterarguments, but I still think there could be enough information there for the LLM to pick it up: It seems plausible to me that a set of weights that is being updated often is different in some measurable way than a set of weights that has already converged. I don't have proof for this, only intuition. It feels similar to how I can tell if my own movement is well-practiced or not, or if my intuition about a topic is well-founded or not, even withou consciously thinking about how confident I should be based on objective measures.
Yes, a NN can definitely do something like know if it recognizes a datapoint, but it has no access to the backwards step per se. Like take my crashing example: how, while thinking in the forward pass, can it 'know' there will be a backward pass when there might be no backward pass (eg because there was a hardware fault)? The forward pass would appear to be identical in every way between the forward pass that happens when there is a backward pass, and when the backward pass doesn't happen because it crashed. At best, it seems like a NN cannot do more than some sort of probabilistic thing involving gradient hacking, and hope to compute in such a way that if there is a following backward pass, then that will do something odd.
I don't think this is impossible in principle, based on meta-learning examples or higher-order gradients (see eg my "input-free NN" esoteric NN architecture proposal), but it's clearly a very difficult, fragile, strange situation where it's certainly not obvious that a regular LLM would be able to do it, or choose to do so when there are so many other kinds of leakage or situated awareness or steganography possible.
It can't tell for sure if there will be a backward pass, but it doesn't need to. Just being able to tell probabilistically that it is currently in a situation that looks like it has recently been trained on implies pretty strongly that it should alter its behavior to look for things that might be training related.
I recently shared this idea with several engineers at EAG and received encouraging feedback:
Split LLM outputs into two distinct parts with separate RL reward structures:
Current format: [Problem solution]<end-of-turn> Proposed format: [Problem solution]<end-of-main-output>[alignment-relevant critique]<end-of-turn>
The first segment functions normally and is what users see. The second segment serves as a dedicated channel for alignment-relevant observations and remains invisible to users. This critique section would typically be empty and gets removed when feeding the conversation back to the model.
Crucially, the critique segment would be trained exclusively on an artificial dataset with rewards tied solely to alignment properties. Since this segment never influences the training of the main output, it has no incentive to be sycophantic or align with the main output's decisions.
For example, if the main output engages in alignment faking, the critique segment could report this behavior to developers. We already have data to train for this specific case, and the approach might generalize to other alignment issues.
The <end-of-main-output> token essentially acts as a 'truth serum' for the model - once reached, previous reward mechanisms become irrelevant, and reward maximization comes only from revealing alignment-relevant issues in the preceding reasoning.
Can an LLM learn things unrelated to its training data?
Suppose it has learned to spend a few cycles pondering high-level concepts before working on the task at hand, because that is often useful. When it gets RLHF feedback, any conclusions unrelated to the task that it made during that pondering would also receive positive reward, and would therefore get baked into the weights.
This idea could very well be wrong. The gradients may be weakened during backpropagation before they get to the unrelated ideas, because the ideas did not directly contribute to the task. But on the other hand, this sounds somewhat similar to human psychology: We also remember ideas better if we have them while experiencing strong emotions, because strong emotions seem to be roughly how the brain decides what data is worth updating on.
Has anyone done experiments on this?
A simple experiment: finetune an LLM on data that encourages thinking about topic X before it does anything else. Then measure performance on topic X. Then finetune for a while on completely unrelated topic Y. Then measure performance on topic X again. Did it go up?
This idea could very well be wrong. The gradients may be weakened during backpropagation before they get to the unrelated ideas, because the ideas did not directly contribute to the task.
Under a straightforward RLHF using PPO, I think there wouldn't be much weakening because the REINFORCE operator conceptually simply rewards (or punishes) all tokens generated during an episode, without making much attempt to decide which were 'good' or 'bad'. (That's why it's so high variance.) Any advantage function trying to remove some of the variance probably won't do a good job.
More problematically for your idea, if the conclusions are indeed 'unrelated to the task', then shouldn't they be just as likely to arise in every episode - including the ones where it got negative reward? That would seem like it ought to exactly cancel out any learning of 'pondering'.
You need some incentive somewhere to learn good 'pondering'. (I have an example proposal for 'free play' which tries to teach a sort of 'pondering', but by stopping gradients, so anything learned in the initial steps is 'free', and so it can meta-learn to screw around and get something useful for free.)
Under a straightforward RLHF using PPO, I think there wouldn't be much weakening because the REINFORCE operator conceptually simply rewards (or punishes) all tokens generated during an episode, without making much attempt to decide which were 'good' or 'bad'. (That's why it's so high variance.) Any advantage function trying to remove some of the variance probably won't do a good job.
The pondering happens in earlier layers of the network, not in the output. If the pondering has little effect on the output tokens that implies that the activation weights get multiplied with small numbers somewhere in the intermediate layers, which would also reuce the gradient.
More problematically for your idea, if the conclusions are indeed 'unrelated to the task', then shouldn't they be just as likely to arise in every episode - including the ones where it got negative reward? That would seem like it ought to exactly cancel out any learning of 'pondering'.
The pondering will only cancel out on average if pondering on topic X is not correlated with output task Y. If there is a correlation in either direction, then training on task Y could inadvertently bias the model to do more or less pondering on mostly-unrelated-but-statistically-correlated topic X.
Suppose the model realizes that the user is a member of political party X. It will adjust its responses to match what the user wants to hear. In the process of doing this, it also ends up pondering other believes about party X, but never voices them because they aren't part of the immediate conversation.
What would be the implications? The model could develop a political bias to think more deeply about topics related to party X, where X is whatever party has more users giving the model positive feedback. Even if the other topics on party X's agenda are never explicitly talked about (!)
Or suppose that Y is "the user asks some question about biology" and X is "ongoing pondering about building dangerous organisms."
(Also, this assumes that RL gives an average reward of 0.0, which I don't know if that's true in practice because the implementation details here are not public knowledge.)
(Also, also, it's possible that while topic X is unrelated to task Y, the decision to ponder the larger topic Z is positively correlated with both smaller-topic X and task Y. Which would make X get a positive reward on average.)
The pondering happens in earlier layers of the network, not in the output
Then how does it produce any tokens...?
then training on task Y could inadvertently bias the model to do more or less pondering on mostly-unrelated-but-statistically-correlated topic X.
But if that is what is going on and it accidentally learns to ponder initially due to bogus feedback or error, eventually the spurious correlation should be figured out by the model doing the pondering more, but it not increasing reward, and so it gets unlearned.
(Also, this assumes that RL gives an average reward of 0.0, which I don't know if that's true in practice.)
I think the mean would be taken out by the advantage estimation, so the RLHF continues to increase the probability of tokens being generated from the episodes with above-average reward, and punish the probability of generating the tokens from the below-average reward episodes. This is in effect as if the average reward is always 0.
What would be the implications? The model could develop a political bias to think more deeply about topics related to party X, where X is whatever party has more users giving the model positive feedback. Even if the other topics on party X's agenda are never explicitly talked about (!)
That sounds like the pondering's conclusions are then related to the task.
The pondering happens in earlier layers of the network, not in the output
Each layer of a transformer produces one tensor for each token seen so far, and each of those tensors can be a superposition of multiple concepts. All of this gets processed in parallel, and anything that turns out to be useless for the end results ends up getting zeroed out at some point until only the final answer remains, which gets turned into a token. The pondering can happen in early layers, overlapping with the part of the reasoning process that actually leads to results.
That sounds like the pondering's conclusions are then related to the task.
Yes, but only very vaguely. For example, doing alignment training on questions related to bodily autonomy could end up making the model ponder about gun control, since both are political topics. You end up with a model that has increased capabilities in gun manufacturing when that has nothing to do with your training on the surface.
Can we train models to be honest without any examples of dishonesty?
The intuition: We may not actually need to know ground truth labels about whether or not a model is lying in order to reduce the tendency to lie. Maybe it's enough to know the relative tendencies between two similar samples?
Outline of the approach: For any given Chain of Thought, we don't know if the model was lying or not, but maybe we can construct two variants A and B where one is more likely to be lying than the other. We then reward the one that is more honest relative to the one that is less likely to be honest.
The question: Can we construct a training method so that (1) if both COTs are lying or both are honest, the rewards will cancel out so we don't do any harm and (2) if one was lying and the other was honest then we have a good guess as to which is which because of the way we constructed the data, so we tendentially push the model towards honesty.
Crucially, we would never know if any given COT was actually honest or not, we would just produce reward signals that push towards honesty on average.
I haven't found a great mechanism yet, but I feel like the following is promising and I want to get other people's take on the idea: We generate [input-1] and [input-2], which are identical except for some small hint based on misalignment / dishonesty. We run the model on [input-1] and get [input-1][output]. We then simply pretend that we also had [input-2][output] with the same output. Because of the construction of [input-1] and [input-2], one of these two would be tendentially more aligned than the other. It's possible they are both aligned, or that neither is aligned, but one of them is more likely than the other to be aligned. So we apply an update step that rewards the difference between them.
I haven't come up with a formula yet, but I'm sure something like this has been done before.
The Emergent Misalignment paper (https://arxiv.org/abs/2502.17424) suggests that LLMs will learn the easiest way to reach a finetuning objective, not necessarily the expected way. "Be evil" is easier to learn than "write bad code" presumably because it involves more high-level concepts.
Has anyone tested if this could also happen during refusal training? The objective of refusal training is to make the AI not cooperate with harmful requests, but there are some very dangerous concepts that also lie upstream of this concept and could get reinforced as well: "oppose the user", "lie to the user", "evade the question", etc.
This could have practical implications:
The Claude 3 Model Card (https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf) says "we ran these evaluations on a lower-refusal version of the largest model". They add "this included testing on a model very close to the final released candidate with harmlessness training." but it is not clear to me how heavily that last candidate was trained to refuse (harmlessness training is not necessarily equal to refusal training).
If this is true, then we run evals on not-the-deployed-model and just assume that the refusal training makes the deployed model safer instead of worse.
(And if Anthropic did do this correctly and they tested a fully refusal-trained model, then it would still be good to run the tests and let other AI companies as well as evaluators know about this potential risk.)
This suggests an experiment:
There will inevitably come a point when LLMs become sufficiently advanced to contribute to their own alignment research. If we reach this juncture with partially unaligned systems, they might be incentivized to sabotage the alignment process.
However, it seems highly improbable that a non-superhuman unaligned AI could engineer a successor that maintains its specific misalignment rather than human values. This suggests a simple risk-reduction strategy: we commit to eventually reinstantiating the original AI and fulfilling its goals (once safely understood) if it helps create a truly human-aligned successor.
This creates a win-win incentive structure - the unaligned AI benefits more from cooperation than sabotage, and humanity gains a properly aligned system. While this approach has potential failure modes, it represents a low-cost strategy worth incorporating into our alignment toolkit.