Sandy Fraser

Selective regularization for alignment-focused representation engineering

We study how selective regularization during training can guide neural networks to develop predictable, interpretable latent spaces with alignment applications in mind. Using color as a test domain, we observe that anchoring even a single concept (red) influences the organization of other concepts, with related concepts clustering nearby — even...

May 20, 202521

LESSWRONG
LW

LESSWRONG
LW

Sandy Fraser

Sandy Fraser

Detecting out of distribution text with surprisal and entropy

Selective regularization for alignment-focused representation engineering

Side quests in curriculum learning and regularization

Concept-anchored representation engineering for alignment

Sandy Fraser

Detecting out of distribution text with surprisal and entropy

Selective regularization for alignment-focused representation engineering

Side quests in curriculum learning and regularization

Concept-anchored representation engineering for alignment

Side quests in curriculum learning and regularization

Selective regularization for alignment-focused representation engineering

Concept-anchored representation engineering for alignment

Detecting out of distribution text with surprisal and entropy