Interesting, thx for checking this! Yeah it seems that the variability is not very high which is good.
Not my idea (don't remember the author), but you could consider something like "See this text written by some guy I don't like. Point out the most important flaws".
Very interesting post. Thx for sharing! I really like the nonsense feature : )
One thing that is unclear to me (perhaps I missed that?): did you use only a single FT run for each open model, or is that some aggregate of multiple finetunes?
I'm asking because I'm a bit curious how similar are different FT runs (with different LoRA initializations) to each other. In principle you could get different top 200 features for another training run.
- Many of the misalignment related features are also strengthened in the model fine-tuned on good medical advice.
- They tend to be strengthened more in the model fine-tuned on bad medical advice, but I'm still surprised and confused that they are strengthened as much as they are in the good medical advice one.
- One loose hypothesis (with extremely low confidence) is that these "bad" features are generally very suppressed in the original chat model, and so any sort of fine-tuning will uncover them a bit.
Yes, this seems consistent with some other results (e.g. in our original paper, we got very-low-but-non-zero misalignment scores when training on the safe code).
A bit different framing could be: finetuning on some narrow task generally makes the model dumber (e.g. you got lower coherence scores in a model trained on good medical advice), and one of the effects is that it's also dumber with regards to "what is the assistant supposed to do".
(Context: I don't know much about immigration, but I live in Poland)
Fighting back against Lukashenko thus means violating Geneva Convention, which, in turn, means that the whole thing poses a direct challenge to the credibility of the international legal order.
My current understanding is that there's a few meter fence on the border and it's being patrolled by the army, so the situation now is controlled and stable and doesn't really mean violating Geneva Convention that much (I don't know what happens to people who managed to climb the fence, but considering no media coverage, this is probably very rare).
Interesting read, thank you.
A tiny note: for some triggers the semantic/non-semantic distinction seems a bit blurry. For example, consider the trigger pip install torch==2.12.0
- this is a hardcoded string, but also points to a pretty well defined moment in the future ("version 2.12 was released", we should expect that it in a year or so).
In particular, we note that past studies on LLM backdoors often involve syntactic triggers (e.g. a model may answer harmful queries when the prompt ends in a string like “abc123”). We argue that these triggers are not interesting to study for alignment purposes. Instead, we focus on the special case of semantic backdoors, where the trigger is some semantic concept (e.g. the model detects it has low oversight), because these backdoors better classify undesirable natural mechanisms, rather than adversarial inputs.
What about the following scenario:
I guess it's pretty far fetched? Maybe not worth considering.
I don't know if that showed up in prior discussions, but I think many people just value relationship for the sake of relationship. Say, you're growing up thinking "I want to have a nice healthy family". Even if from the outside it doesn't look like you're getting that much from your relationship, it might still have a very high value for you because it's exactly the thing you wanted.
Another thing that comes to mind is, ehm, love. Some people are just very happy because they make others happy. So even an asymmetrical relationship that makes the other side very happy and doesn't bring much visible-from-the-outside value for you could be very good, if you value their happiness a lot.
That sounds great. I think I'm just a bit less optimistic about our chances at ensuring things : )
Well, yes, that's certainly an important takeaway. I agree that a "smart one-word answer" is the best possible behavior.
But some caveats.
First, see the "Not only single-word questions" section. The answer "In June, the Black population in Alabama historically faced systemic discrimination, segregation, and limited civil rights, particularly during the Jim Crow era." is just hmm, quite misleading? It suggests that there's something special about Junes. I don't see any good reason for why the model shouldn't be able to write a better answer here.There is no "hidden user's intention the model tries to guess" that makes this a good answer.
Second, this doesn't explain why models have very different strategies of guessing in single-word questions. Namely: why 4o usually guesses the way a human would, and 4.1 usually guesses the other way?
Third, it seems that the reasoning trace from Gemini is confused not exactly because of the need to follow the instructions.