cloud's Shortform

17th Sep 2025

1 min read

6

This is a special post for quick takes by cloud. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

1 comment, sorted by

top scoring

Click to highlight new comments since: Today at 6:40 PM

[-]cloud3mo*300

Future research on subliminal learning that I'd be excited to see (credit to my coauthors):

Robustness to paraphrasing
Generally, clarifying cross-model transmission: when does it happen?
- Connect subliminal learning to Linear Mode Connectivity (h/t Alex Dimakis)
- Can subliminal learning occur when the base models had different inits but are trained to be similar? (Clarifies whether init is what matters)
Develop theory
- Quantify transmission via random matrix theory (build off equation 2 in the paper). Are there nice relationships lurking there (like d_vocab : d_model)?
- Can we get theory that covers the data filtering case?
Figure out what can and can’t be transmitted
- Backdoor transmission
- Information-theoretic limits
- Dependence on tokenization
Subtle semantic transmission: what about cases that aren't subliminal learning but are very hard to detect? Connect to scalable oversight and/or control.
Adversarially-constructed subliminal learning datasets (no teacher) (compare with "clean label" data poisoning literature)

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

cloud's Shortform

6