1218

LESSWRONG
LW

1217

cloud's Shortform

by cloud
17th Sep 2025
1 min read
1

6

This is a special post for quick takes by cloud. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
cloud's Shortform
30cloud
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 5:05 PM
[-]cloud1mo*300

Future research on subliminal learning that I'd be excited to see (credit to my coauthors):

  • Robustness to paraphrasing
  • Generally, clarifying cross-model transmission: when does it happen?
    • Connect subliminal learning to Linear Mode Connectivity (h/t Alex Dimakis)
    • Can subliminal learning occur when the base models had different inits but are trained to be similar? (Clarifies whether init is what matters)
  • Develop theory
    • Quantify transmission via random matrix theory (build off equation 2 in the paper). Are there nice relationships lurking there (like d_vocab : d_model)?
    • Can we get theory that covers the data filtering case?
  • Figure out what can and can’t be transmitted
    • Backdoor transmission
    • Information-theoretic limits
    • Dependence on tokenization
  • Subtle semantic transmission: what about cases that aren't subliminal learning but are very hard to detect? Connect to scalable oversight and/or control.
  • Adversarially-constructed subliminal learning datasets (no teacher) (compare with "clean label" data poisoning literature)
Reply
Moderation Log
More from cloud
View more
Curated and popular this week
1Comments