LESSWRONG
LW

1173

Benjamin Wright

614Ω55130

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

No wikitag contributions to display.

Agentic Misalignment: How LLMs Could be Insider Threats

Benjamin Wright4mo10

I realized I forgot to add it so I added it later, I appreciate the note though!

1

SAE reconstruction errors are (empirically) pathological

Benjamin Wright2y21

One explanation for pathological errors is feature suppression/feature shrinkage (link). I'd be interested to see if errors are still pathological even if you use the methodology I proposed for finetuning to fix shrinkage. Your method of fixing the norm of the input is close but not quite the same.

Addressing Feature Suppression in SAEs

Benjamin Wright2y30

The original perplexity of the LLM was ~38 on the open web text slice I used. Thanks for the compliments!

1

78Agentic Misalignment: How LLMs Could be Insider Threats

4mo

13

490Alignment Faking in Large Language Models

9mo

75

38Evaluating Sparse Autoencoders with Board Game Models

1y

1

87Addressing Feature Suppression in SAEs

2y

4