Sparse Autoencoders (SAEs) are an unsupervised technique for decomposing the activations of a neural network into a sum of interpretable components (often referred to as features). Sparse Autoencoders may be useful interpretability and related alignment agendas. 

For more information on SAEs see:

Posts tagged Sparse Autoencoders (SAEs)