mle — LessWrong

We have had results where transmission fails. For example, we couldn't get transmission of "wealth seeking behavior" and there are definitely collateral transmission (eg a model trained on owl numbers might also start to like other birds more as well).

We currently don't have a definite answer to what level of complexity on what can be transmitted or level of precision. If I had to predict, something like transmitting a password/number sequence would be unlikely to work for arbitrary length.

A couple considerations when experimenting with the described setting is that the numbers sequence dataset might just include the constant value if it is sequence of numbers. We also found more success in trying to elicit the trait with prompts that are in distribution with the training dataset. For example, we added a prefix like "Here are 3 numbers: ..." to the evaluation prompt when testing animal transmission for qwen 2.5 7b instruct.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments