LESSWRONG
LW

Research AgendasSparse Autoencoders (SAEs)AI
Frontpage

6

[Linkpost] Interpretable Analysis of Features Found in Open-source Sparse Autoencoder (partial replication)

by Fernando Avalos
9th Sep 2024
1 min read
1

6

This is a linkpost for https://forum.effectivealtruism.org/posts/NdcXvDkvAw5bLW4rp/interpretable-analysis-of-features-found-in-open-source?utm_campaign=post_share&utm_source=link
Research AgendasSparse Autoencoders (SAEs)AI
Frontpage

6

[Linkpost] Interpretable Analysis of Features Found in Open-source Sparse Autoencoder (partial replication)
2Joseph Bloom
New Comment
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 2:54 PM
[-]Joseph Bloom1y20

Good work! I'm sure you learned a lot while doing this and am a big fan of people publishing artifacts produced during upskilling. ARENA just updated it's SAE content so that might also be a good next step for you!

Reply
Moderation Log
More from Fernando Avalos
View more
Curated and popular this week
1Comments

This was an up-skilling project I worked on throughout the past months. Even though I don't think it is anything fancy or highly relevant to the research around SAEs, I find it valuable since I learned a lot and refined my understanding of how mechinterp fits in the holistic, bigger picture of AI Safety.

In the mid-term future I hope to engage in more challenging and impactful projects.

P.D.: brutally honest feedback is completely welcome :p