x

LESSWRONG

LW

Artt

Artt

Message

I like AI. Currently trying to understand machine brains.

7

1

1

1y

Artt

I like AI. Currently trying to understand machine brains.

Artt — LessWrong

Do k-Sparse Autoencoders Reveal Thinking Patterns? Interpretable Features in a Small Reasoning Model

Executive Summary Problem Statement of the Project Models such as sparse autoencoders (SAEs) and k-sparse autoencoders have been used as an effective medium to extract meaningful interpretable features from neural networks, including Large Language Models (LLMs). However, the effectiveness of these models with respect to new small reasoning models remains...