LESSWRONG
LW

nlpet
76000
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No Comments Found
No wikitag contributions to display.
21Latent Adversarial Training (LAT) Improves the Representation of Refusal
8mo
6
42Characterizing stable regions in the residual stream of LLMs
1y
4
29Evaluating Synthetic Activations composed of SAE Latents in GPT-2
1y
0