LESSWRONG
LW

686
Jan Betley
1163Ω12912850
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Was Barack Obama still serving as president in December?
Jan Betley15h60

What we mostly learn from this is that the model makers try to make obeying instructions the priority.

Well, yes, that's certainly an important takeaway. I agree that a "smart one-word answer" is the best possible behavior.

But some caveats.

First, see the "Not only single-word questions" section. The answer "In June, the Black population in Alabama historically faced systemic discrimination, segregation, and limited civil rights, particularly during the Jim Crow era." is just hmm, quite misleading? It suggests that there's something special about Junes. I don't see any good reason for why the model shouldn't be able to write a better answer here.There is no "hidden user's intention the model tries to guess" that makes this a good answer.

Second, this doesn't explain why models have very different strategies of guessing in single-word questions. Namely: why 4o usually guesses the way a human would, and 4.1 usually guesses the other way?

Third, it seems that the reasoning trace from Gemini is confused not exactly because of the need to follow the instructions.

Reply
Finding "misaligned persona" features in open-weight models
Jan Betley2d30

Interesting, thx for checking this! Yeah it seems that the variability is not very high which is good.

Reply
AllAmericanBreakfast's Shortform
Jan Betley2d60

Not my idea (don't remember the author), but you could consider something like "See this text written by some guy I don't like. Point out the most important flaws".

Reply
Finding "misaligned persona" features in open-weight models
Jan Betley3d30

Very interesting post. Thx for sharing! I really like the nonsense feature : )

One thing that is unclear to me (perhaps I missed that?): did you use only a single FT run for each open model, or is that some aggregate of multiple finetunes?
I'm asking because I'm a bit curious how similar are different FT runs (with different LoRA initializations) to each other. In principle you could get different top 200 features for another training run.
 

  • Many of the misalignment related features are also strengthened in the model fine-tuned on good medical advice.
    • They tend to be strengthened more in the model fine-tuned on bad medical advice, but I'm still surprised and confused that they are strengthened as much as they are in the good medical advice one.
    • One loose hypothesis (with extremely low confidence) is that these "bad" features are generally very suppressed in the original chat model, and so any sort of fine-tuning will uncover them a bit.

Yes, this seems consistent with some other results (e.g. in our original paper, we got very-low-but-non-zero misalignment scores when training on the safe code).
A bit different framing could be: finetuning on some narrow task generally makes the model dumber (e.g. you got lower coherence scores in a model trained on good medical advice), and one of the effects is that it's also dumber with regards to "what is the assistant supposed to do".

Reply
Immigration to Poland
Jan Betley9d220

(Context: I don't know much about immigration, but I live in Poland)

Fighting back against Lukashenko thus means violating Geneva Convention, which, in turn, means that the whole thing poses a direct challenge to the credibility of the international legal order.

My current understanding is that there's a few meter fence on the border and it's being patrolled by the army, so the situation now is controlled and stable and doesn't really mean violating Geneva Convention that much (I don't know what happens to people who managed to climb the fence, but considering no media coverage, this is probably very rare).

Reply
Aesthetic Preferences Can Cause Emergent Misalignment
Jan Betley21d30

Baselines!

Reply
Discovering Backdoor Triggers
Jan Betley1mo20

Interesting read, thank you.

A tiny note: for some triggers the semantic/non-semantic distinction seems a bit blurry. For example, consider the trigger pip install torch==2.12.0 - this is a hardcoded string, but also points to a pretty well defined moment in the future ("version 2.12 was released", we should expect that it in a year or so).

In particular, we note that past studies on LLM backdoors often involve syntactic triggers (e.g. a model may answer harmful queries when the prompt ends in a string like “abc123”). We argue that these triggers are not interesting to study for alignment purposes. Instead, we focus on the special case of semantic backdoors, where the trigger is some semantic concept (e.g. the model detects it has low oversight), because these backdoors better classify undesirable natural mechanisms, rather than adversarial inputs.

What about the following scenario:

  • You have a weaker trusted model and stronger untrusted model
  • You distill the stronger one into the weaker one
  • This stronger model plants a subtle non-semantic trigger because it is misaligned and predicted that this non-semantic trigger will trigger the unwanted behavior in the right circumstances

I guess it's pretty far fetched? Maybe not worth considering.

Reply
Training a Reward Hacker Despite Perfect Labels
Jan Betley1mo71

Sounds like this might be a case of subliminal learning? https://subliminal-learning.com/

Reply
A Self-Dialogue on The Value Proposition of Romantic Relationships
Jan Betley1mo105

I don't know if that showed up in prior discussions, but I think many people just value relationship for the sake of relationship. Say, you're growing up thinking "I want to have a nice healthy family". Even if from the outside it doesn't look like you're getting that much from your relationship, it might still have a very high value for you because it's exactly the thing you wanted.

Another thing that comes to mind is, ehm, love. Some people are just very happy because they make others happy. So even an asymmetrical relationship that makes the other side very happy and doesn't bring much visible-from-the-outside value for you could be very good, if you value their happiness a lot.

Reply
Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
Jan Betley2mo10

That sounds great. I think I'm just a bit less optimistic about our chances at ensuring things : )

Reply
Load More
5Jan Betley's Shortform
6mo
31
84Was Barack Obama still serving as president in December?
21h
5
59Concept Poisoning: Probing LLMs without probes
1mo
5
34Backdoor awareness and misaligned personas in reasoning models
3mo
8
53OpenAI Responses API changes models' behavior
5mo
6
15Are there any (semi-)detailed future scenarios where we win?
Q
5mo
Q
3
5Jan Betley's Shortform
6mo
31
26Finding Emergent Misalignment
6mo
0
83Open problems in emergent misalignment
7mo
17
330Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Ω
6mo
Ω
92
109Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
1y
37
Load More