Clement Neo

Message

Twitter: _clementneo
Site: clementneo.com

190

Clement Neo

Twitter: _clementneo
Site: clementneo.com

Analysing Adversarial Attacks with Linear Probing

by Yoann Poupart, Imene Kerboua, Clement Neo, and Jason Hoelscher-Obermaier

This work was produced as part of the Apart Fellowship. @Yoann Poupart and @Imene Kerboua led the project; @Clement Neo and @Jason Hoelscher-Obermaier provided mentorship, feedback and project guidance. Here, we present a qualitative analysis of our preliminary results. We are at the very beginning of our experiments, so any...

Jun 17, 2024•15

Sparse autoencoders find composed features in small toy models

by Evan Anders, Clement Neo, Jason Hoelscher-Obermaier, and Jessica N. Howard

Summary * Context: Sparse Autoencoders (SAEs) reveal interpretable features in the activation spaces of language models. They achieve sparse, interpretable features by minimizing a loss function which includes an ℓ1 penalty on the SAE hidden layer activations. * Problem & Hypothesis: While the SAE ℓ1 penalty achieves sparsity, it has...

Mar 14, 2024•34

We Found An Neuron in GPT-2

by Joseph Miller and Clement Neo

We started out with the question: How does GPT-2 know when to use the word "an" over "a"? The choice depends on whether the word that comes after starts with a vowel or not, but GPT-2 can only output one word at a time. We still don’t have a full...

Feb 11, 2023•143