LESSWRONG
LW

mle
355010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
mle2mo271

We have had results where transmission fails. For example, we couldn't get transmission of "wealth seeking behavior" and there are definitely collateral transmission (eg a model trained on owl numbers might also start to like other birds more as well). 

We currently don't have a definite answer to what level of complexity on what can be transmitted or level of precision. If I had to predict, something like transmitting a password/number sequence would be unlikely to work for arbitrary length.

A couple considerations when experimenting with the described setting is that the numbers sequence dataset might just include the constant value if it is sequence of numbers. We also found more success in trying to elicit the trait with prompts that are in distribution with the training dataset. For example, we added a prefix like "Here are 3 numbers: ..." to the evaluation prompt when testing animal transmission for qwen 2.5 7b instruct.

Reply
333Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
Ω
1mo
Ω
35