Message

I am independent AI researcher focusing on interpretability. My day job is in technical consulting for an AI-enabled financial software firm.

4mo

CharlesL

I am independent AI researcher focusing on interpretability. My day job is in technical consulting for an AI-enabled financial software firm.

CharlesL — LessWrong

CharlesL

Message

I am independent AI researcher focusing on interpretability. My day job is in technical consulting for an AI-enabled financial software firm.

4mo

CharlesL

I am independent AI researcher focusing on interpretability. My day job is in technical consulting for an AI-enabled financial software firm.

Abstract/Concrete Axis: Robustness Testing Semantic Feature Separation in Gemma 3 270M

In my previous post, I found that abstract and concrete prompts activate almost entirely separate feature clusters in Gemma 3 270M’s sparse autoencoder decomposition. The abstract side looked like reasoning operations (qualification, problems/issues); the concrete side looked like physical domain ontologies (composition, geology). The Jaccard similarity between their active feature...

Feb 23•7

Multiple Independent Semantic Axes in Gemma 3 270M

In my previous post, I found signs of an abstract-social vs concrete-physical axis in GPT-2’s residual stream. This post builds on that work using SAEs on Gemma 3 270M. Here I attempt to move from the existence of this axis to trying to understand what makes it up at the...

Feb 22•15

Does GPT-2 Represent Controversy? A Small Mech Interp Investigation

In thinking about how RLHF-trained models clearly hedge on politically controversial topics, I started wondering about if LLMs would encode these politically controversial topics differently than topics that are broadly considered controversial but not political. And if they do, to understand if the signal is already represented in the base...

Feb 19•6