LESSWRONG
LW

1332
Developing interpretability

Developing interpretability

May 07, 2025 by Sandy Fraser

This is a series of alignment experiments in which I attempt to impose structure on latent embeddings. My goal is to develop a capability to structure latent spaces in LLMs during training. I believe this would make it easier to detect misalignment, ablate unwanted capabilities, and steer behavior.

5Concept-anchored representation engineering for alignment
Sandy Fraser
5mo
0
21Selective regularization for alignment-focused representation engineering
Sandy Fraser
4mo
3
5Side quests in curriculum learning and regularization
Sandy Fraser
3mo
0