Further optimisation Update Log for 28th August:

I am working from here: Minh's Copy of Gemma SAE self-explanation - Colab (google.com)

What worked: Multi-Feature Combination and Replacing Earlier Layers

Multi-feature combination works! I managed to combine feature 7656 ("France") and feature 7154 ("capital cities") from Neuronpedia's Gemma-1-2B ^[1] feature directory to elicit outputs for Paris, France. I'm just taking the sum of the vectors and dividing to find average, so this should work same as before even if you have 1 feature. Weighing should

... (read more)

Self-explaining SAE features

Minh Nguyen2y*70

Hello! I’ve made 2 quick improvements, mainly with prompt and tokens.

TL;DR I changed the prompt to

prompt = '<start_of_turn>user\n "<unk>"?<end_of_turn>\n<start_of_turn>model\n "<unk>" "'

Solutions to improve Self-explanation:

Shorter Prompt

I noticed that the scales were being affected by prior words in the prompt/context itself. I tried out feature 4088 and replaced some words. For example, replacing ”word”with "concept" and "number" resulted in slightly different explanations at the higher scale. Intuitively, I s... (read more)

Self-Other Overlap: A Neglected Approach to AI Alignment

Minh Nguyen2y40

I was thinking about the practical implication of this. As others have mentioned, models in production pretty much all use the prompt "you are an AI assistant". From a model training perspective, it makes sense to build with this assumption in mind.

However, it occurs to me that I have never explicitly referred to any of my AI assistants as an AI assistant. Instead, I treat them more as an inner monologue, and I suspect many other users do this as well. If the AI makes an error, I essentially correct them the way I would correct my own inner monologue... (read more)

Nonlinear’s Evidence: Debunking False and Misleading Claims

Minh Nguyen3y4713

[crossposted from EA Forum, to emphasise an important point. hope that's OK! will delete if it isn't]

How do we prevent the methodology of exclusively seeking and publishing negative information, without fact checking, from becoming an acceptable norm?

Re: Checking that claims are true

Adding on as former Nonlinear intern who was aware of a “falling out” between Alice and Nonlinear for almost a year now:

To my knowledge, Nonlinear was given very few/practically no opportunities to respond to the many claims made in “Sharing Information About Nonlinear” be

... (read more)