Abstract

This paper presents $ProLU$ , an alternative to $ReLU$ for the activation function in sparse autoencoders that produces a pareto improvement over both standard sparse autoencoders trained with an L1 penalty and sparse autoencoders trained with a Sqrt(L1) penalty.

ProLU (m_{i}, b_{i}) = {\begin{matrix} m_{i} & if m_{i} + b_{i} > 0 and m_{i} > 0 0 & otherwise \end{matrix}

{SAE}_{ProLU} (x) = ProLU ((x - b_{d e c}) W_{e n c}, b_{e n c}) W_{d e c} + b_{d e c}

The gradient wrt. $b$ is zero, so we generate two candidate classes of differentiable $ProLU$ :

${ProLU}_{R e L U}$
- $\frac{\partial^{*} {ProLU}_{R e L U} (m_{i}, b_{i})}{\partial b_{i}} = \frac{\partial {ProLU}_{R e L U} (m_{i}, b_{i})}{\partial m_{i}} = {\begin{matrix} 1 & if m_{i} + b_{i} > 0 and m_{i} > 0 0 & otherwise \end{matrix}$
${ProLU}_{S T E}$
- $\frac{\partial^{*} {ProLU}_{S T E} (m_{i}, b_{i})}{\partial m_{i}} = {\begin{matrix} 1 + m_{i} & if m_{i} > 0 and m_{i} + b_{i} > 0 0 & otherwise \end{matrix}$
- $\frac{\partial^{*} {ProLU}_{S T E} (m_{i}, b_{i})}{\partial b_{i}} = {\begin{matrix} m_{i} & if m_{i} > 0 and m_{i} + b_{i} > 0 0 & otherwise \end{matrix}$

PyTorch Implementation

Introduction

SAE Context and Terminology

Learnable parameters of a sparse autoencoder:

$W_{e n c}$ : encoder weights
$W_{d e c}$ : decoder weights
$b_{e n c}$ : encoder bias
$b_{d e c}$ : decoder bias

The output of an SAE is given by

S A E (x) = ReLU ((x - b_{d e c}) W_{e n c} + b_{e n c}) W_{d e c} + b_{d e c}

Training

Notation: Encoder/Decoder
Let
$encode (x) = ReLU ((x - b_{d e c}) W_{e n c} + b_{e n c})$ $decode (a) = a W_{d e c} + b_{d e c}$
so that the full computation done by an SAE can be expressed as
$SAE (x) = decode (encode (x))$

An SAE is trained with gradient descent on

L_{t r a i n} = | | x - S A E (x) | |_{2}^{2} + λ P (encode (x))

where $λ$ is the sparsity penalty coefficient (often "L1 coefficient") and $P$ is the sparsity penalty function, used to encourage sparsity.

$P$ is commonly the L1 norm $| | a | |_{1}$ but recently $P (a) = | | a | |_{\frac{1}{2}}^{\frac{1}{2}}$ has been shown to produce a Pareto improvement on the L0 and CE metrics. We will use this as a further baseline to compare against when assessing our models in addition to the standard $ReLU$ -based SAE with L1 penalty.

Motivation: Inconsistent Scaling in Sparse Autoencoders

Due to the affine translation, sparse autoencoder features with n...

wuthejeff

wuthejeff