We recently published a workshop paper at ICML 2024 which we want to summarize here briefly. Code for the paper can be found on GitHub. This work was done as part of the PIBBSS affiliateship. Thx to @Jayjay for comments on the draft. TLDR: We analyze the probability distribution over...
Produced as part of the SERI ML Alignment Theory Scholars Program - Autumn 2023 Cohort and while being an affiliate at PIBBSS in 2024. A thank you to @Jayjay and @fela for helpful comments on this draft. This blog post is an overview of different ways to implement activation steering...
Produced as part of the SERI ML Alignment Theory Scholars Program - Autumn 2023 Cohort, under the mentorship of Dan Hendrycks There was recently some work on sparse autoencoding of hidden LLM representation. I checked if these sparse representations are better suited for classification. It seems like they are significantly...
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Dan Hendrycks We demonstrate different techniques for finding concept directions in hidden layers of an LLM. We propose evaluating them by using them for classification, activation steering and knowledge removal. You...