Message

Yeu-Tong Lau

Message

107

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

by Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks, and Neel Nanda

Adam Karvonen*, Can Rager*, Johnny Lin*, Curt Tigges*, Joseph Bloom*, David Chanin, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, Callum McDougall, Kola Ayonrinde, Matthew Wearden, Samuel Marks, Neel Nanda *equal contribution TL;DR * We are releasing SAE Bench, a suite of 8 diverse sparse autoencoder (SAE) evaluations including unsupervised metrics and...

Dec 11, 2024•82

Understanding Positional Features in Layer 0 SAEs

by bilalchughtai and Yeu-Tong Lau

This is an informal research note. It is the result of a few-day exploration into positional SAE features conducted as part of Neel Nanda’s training phase of the ML Alignment & Theory Scholars Program - Summer 2024 cohort. Thanks to Andy Arditi, Arthur Conmy and Stefan Heimersheim for helpful feedback....

Jul 29, 2024•43

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

by Can, Yeu-Tong Lau, James Dao, and Jett Janiak

Please check out our notebook for figure recreation and to examine your own model for clean-up behavior. Produced as part of ARENA 2.0 and the SERI ML Alignment Theory Scholars Program - Spring 2023 Cohort Fig 5: Correlation between DLA of writer head and DLA of [clean-up heads output dependent...

Aug 30, 2023•17