Do k-Sparse Autoencoders Reveal Thinking Patterns? Interpretable Features in a Small Reasoning Model
Executive Summary Problem Statement of the Project Models such as sparse autoencoders (SAEs) and k-sparse autoencoders have been used as an effective medium to extract meaningful interpretable features from neural networks, including Large Language Models (LLMs). However, the effectiveness of these models with respect to new small reasoning models remains...
Jun 158