LESSWRONG
LW

Benjamin Wright
607Ω55130
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Agentic Misalignment: How LLMs Could be Insider Threats
Benjamin Wright2mo10

I realized I forgot to add it so I added it later, I appreciate the note though!

Reply1
SAE reconstruction errors are (empirically) pathological
Benjamin Wright1y21

One explanation for pathological errors is feature suppression/feature shrinkage (link). I'd be interested to see if errors are still pathological even if you use the methodology I proposed for finetuning to fix shrinkage. Your method of fixing the norm of the input is close but not quite the same.

Reply
Addressing Feature Suppression in SAEs
Benjamin Wright2y30

The original perplexity of the LLM was ~38 on the open web text slice I used. Thanks for the compliments!

Reply1
72Agentic Misalignment: How LLMs Could be Insider Threats
Ω
2mo
Ω
12
489Alignment Faking in Large Language Models
Ω
7mo
Ω
75
38Evaluating Sparse Autoencoders with Board Game Models
1y
1
87Addressing Feature Suppression in SAEs
Ω
2y
Ω
4