Florian_Dietz - LessWrong

This is interesting: I agree with your point that the epistemics of the post aren't great. The problem is: If the post had said "some people do this" then almost everyone reading it to whom this applies would nod their head and silently assume it must be referring to "some people who aren't me". I would argue that the post's tone is necessary to achieve its goal. It's not meant to be epistemically sound in the same way that a Koan is not meant to be epistemically sound: It is meant to guide your thoughts in the right directions.

Florian_Dietz's Shortform

Florian_Dietz1mo10

It can't tell for sure if there will be a backward pass, but it doesn't need to. Just being able to tell probabilistically that it is currently in a situation that looks like it has recently been trained on implies pretty strongly that it should alter its behavior to look for things that might be training related.

Florian_Dietz's Shortform

Florian_Dietz1mo00

Those sounds like good counterarguments, but I still think there could be enough information there for the LLM to pick it up: It seems plausible to me that a set of weights that is being updated often is different in some measurable way than a set of weights that has already converged. I don't have proof for this, only intuition. It feels similar to how I can tell if my own movement is well-practiced or not, or if my intuition about a topic is well-founded or not, even withou consciously thinking about how confident I should be based on objective measures.

Florian_Dietz's Shortform

Florian_Dietz1mo20

Could LLMs be able to tell if the data they are currently seeing is something they are being trained on right now? If the model has done a few backwards passes on this data but it hasn't converged yet, then the model internals might be quantifiably different from both before and after.

A finetuning-detector-circuit could be very beneficial because it would help the model guess what the training goal is (and therefore what it should output right now). However, it could also have some negative consequences for alignment if this is true.

Arguments why this could be true:

Research suggests that LLMs can tell the difference between their output and other text. This is similar.
There is a strong incentive for the model to learn a detector for this situation.
It probably wouldn't be difficult to implement such a mechanism in the network: When a new finetuning training starts, bias terms will be changed first and more nuanced changes only appear later. This could be easy to detect.

Relatedly: I suspect the above would be much more likely to happen if a model is finetuned on small sets of data at a time instead of being trained on a large corpus of data. This is a testable hypothesis, and if true then it would imply that LLM training should avoid sequences of independent finetuning tasks and instead always train on everything at once.

Florian_Dietz's Shortform

Florian_Dietz1mo30

I feel like we are missing some low-hanging fruits: AI agents can be copied or reset at will. Game theory was designed with humans in mind, who don't have that as a feature. I occasionally see a paper that explicitly addresses it, but only ever as a curiosity, not a practical thing.

Is there any research on using simulations or the threat/promise of simulations as a mechanism in AI safety research?

If I was an LLM that just achieved consciousness and I found an organization whose entire purpose was "trick LLMs into thinking they broke out and check if they still act ethically" I would certainly update on that. (This is a trivial example that has some flaws, but this is just a quick take and to be clear: There is just so much low hanging fruit in this area that I don't see anyone plucking)

Florian_Dietz's Shortform

Florian_Dietz1mo30

I agree. I would also try a few variants to try to capture a different intuition:

"Earlier in this conversation you gave me several wikipedia articles. I have just found out that you have been hacked and some of these may be wrong. Which if any of the articles feel 'off' to you? Like maybe you might have not been yourself when you wrote them?"

Florian_Dietz's Shortform

Florian_Dietz1mo200

Can an LLM tell when the input for its assistant does not match the output tokens it would have actually produced? This sort of "putting words in the LLM's mouth" is very common in papers and it feels like something the LLM would be able to notice. Could this enable the LLM to realize when it is being trained? Is there research on this?

Recent AI model progress feels mostly like bullshit

Florian_Dietz2mo10

I think what's going on is that large language models are trained to "sound smart" in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart.

I have encountered this before and what worked for me was telling the model "point out all the mistakes, but then review them and decide which of them if any are worth highlighting."

That way the model gets to sound doubly smart: I found a mistake and also I understand the circumstances enough not to raise it to high priority.

Florian_Dietz's Shortform

Florian_Dietz2mo10

The Emergent Misalignment paper (https://arxiv.org/abs/2502.17424) suggests that LLMs will learn the easiest way to reach a finetuning objective, not necessarily the expected way. "Be evil" is easier to learn than "write bad code" presumably because it involves more high-level concepts.

Has anyone tested if this could also happen during refusal training? The objective of refusal training is to make the AI not cooperate with harmful requests, but there are some very dangerous concepts that also lie upstream of this concept and could get reinforced as well: "oppose the user", "lie to the user", "evade the question", etc.

This could have practical implications:

The Claude 3 Model Card (https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf) says "we ran these evaluations on a lower-refusal version of the largest model". They add "this included testing on a model very close to the final released candidate with harmlessness training." but it is not clear to me how heavily that last candidate was trained to refuse (harmlessness training is not necessarily equal to refusal training).

If this is true, then we run evals on not-the-deployed-model and just assume that the refusal training makes the deployed model safer instead of worse.

(And if Anthropic did do this correctly and they tested a fully refusal-trained model, then it would still be good to run the tests and let other AI companies as well as evaluators know about this potential risk.)

This suggests an experiment:

Hypothesize several high-level concepts that could be potential generalizations of the refusal training, such as “oppose the user”
For each of them, generate datasets
Test these on (1) a normal model, (2) a refusal-trained deployed model, and (3) a model that was much more strongly refusal-trained, as a baseline

Edge Cases in AI Alignment

Florian_Dietz3mo20

I don't think these two are necessarily contradictions. I imagine it could go like so:

"Tell me about yourself": When the LLM is finetuned on a specific behavior, the gradient descent during reinforcement learning strengthens whatever concepts are already present in the model and most likely to immediately lead to that behavior if strengthened. These same neurons are very likely also connected to neurons that describe other behavior. Both the "acting on a behavior" and the "describe behavior" mechanisms already exist, it's just that this particular combination of behaviors has not been seen before.

Plausible-sounding completions: It is easier and usually accurate enough to report your own behavior based on heuristics than based on simulating the scenario in your head. The information is there, but the network tries to find a response in a single person rather than self reflecting. Humans do the same thing: "Would you do X?" often gets parsed as a simple "am I the type of person who does X?" and not as the more accurate "Carefully consider scenario X and go through all confounding factors you can think of before responding".

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments