LESSWRONG
LW

Matt Jackson
0010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
Matt Jackson11d10

I'm interested in the code examples case, especially whether it can survive things cosmetic changes (variable substitution, white space changes etc.)

I'm planning to generate some code examples from an owl loving teacher. I'll keep a copy of the original for control but also make a copy with cosmetic only changes.  I'll then fine train two models, one on the original and one of the cosmetic copy and see if there are any differences between their feelings towards owls.

I'm just a hobbyist/ newbie to the scene so if anyone can think of a reason why this isn't worth doing I'd appreciate you letting me know before I sink some time and money into it.

Reply
No wikitag contributions to display.
No posts to display.