Automating LLM Auditing with Developmental Interpretability
Produced as part of the ML Alignment & Theory Scholars Program - Summer 2024 Cohort, supervised by Evan Hubinger TL: DR * We proved that the SAE features related to the finetuning target will change more than other features in the semantic space. * We developed an automated model audit...
Sep 4, 202419