LESSWRONG
LW

Developing interpretability

This is a series of alignment experiments in which I attempt to impose structure on latent embeddings. My goal is to develop a capability to structure latent spaces in LLMs during training. I believe this would make it easier to detect misalignment, ablate unwanted capabilities, and steer behavior.

5Concept-anchored representation engineering for alignment

Sandy Fraser

6mo

0

21Selective regularization for alignment-focused representation engineering

Sandy Fraser

6mo

3

5Side quests in curriculum learning and regularization

Sandy Fraser

5mo

0