Minor interpretability exploration #4: LayerNorm and the learning coefficient

11mo

Epistemic status: small exploration without previous predictions, results low-stakes and likely correct.

Introduction

As a personal exercise for building research taste and experience in the domain of AI safety and specifically interpretability, I have done four minor projects, all building upon code previously written. They were done without previously formulated hypotheses or expectations, but merely to check for anything interesting in low-hanging fruit. In the end, they have not given major insights, but I hope they will be of small use and interest for people working in these domains.

This is the fourth project: checking how the LLC behaves for toy models containing LayerNorm and sharp transitions in the loss landscape (this project from Timaeus).

TL;DR

... (read 193 more words →)

Minor interpretability exploration #3: Extending superposition to different activation functions (loss landscape)

Rareș Baron

Epistemic status: small exploration without previous predictions, results low-stakes and likely correct.

Introduction

This is the third project: extending Timaeus' toy models of superposition devinterp results, made for studying the formation of superposition during the training process and using... (read 841 more words →)

Replying toMinor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Rareș Baron1y

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Your hypothesis seems reasonable, and I think the following proves it.
1. This is for 5e-3, giving no spikes and faster convergences:

2. Gradient descent failed to converge for multiple LRs, from 1e-2 to 1e-5. However, decreasing the LR by 1.0001 when the training error increases gave this:

It's messy, and the decrease seems to turn the jumps of the slingshot effect into causes for getting stuck in sub-optimal basins, but the trajectory was always downwards. Increasing the rate of reduction decreased spikes but convergence no longer appeared.

An increase to 2. removed the spikes entirely.

Replying toMinor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Rareș Baron1y

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Your hypothesis seems reasonable, and I think the following proves it.
1. This is for 5e-3, giving no spikes and faster convergences:

2. Gradient descent failed to converge for multiple LRs, from 1e-2 to 1e-5. However, decreasing the LR by 1.0001 when the training error increases gave this:

An increase to 2. removed the spikes entirely.

Replying toMinor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Rareș Baron1y

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

I have uploaded html files of all the animation so they can be interactive. The corresponding training graphs are in the associated notebooks.

The original learning rate was 1e-3.

For 5e-4, it failed to converge:

For 8e-4, it did converge, and the trajectory was downwards this time:

Minor interpretability exploration #2: Extending superposition to different activation functions

Rareș Baron

Epistemic status: small exploration without previous predictions, results low-stakes and likely correct.

Introduction

This is the second project: extending Anthropic's interpretability team's toy model of superposition notebook to 26 more activation functions.

The toy model of superposition results have been... (read 1006 more words →)

Replying toMinor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Rareș Baron1y

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

For 1 and 2 - I have. Everything is very consistent.
For 3, I have tried several optimizers, and they all failed to converge. Tweaking the original AdamW to reduce the learning rate lead to very similar results:

For 4, I have done animations for every model (besides the 2 GELU variants). I saw pretty much what I expected: a majority of relevant developments (fourier frequencies, concentration of singular values, activations and attention heads) happened quickly, in the clean-up phase. The spikes seen in SiLU and SoLU_LN were visible, though not lasting. I have uploaded the notebooks to the drive folder, and have updated the post to reflect these findings. Thank you very much, again!

Replying toThe case for the death penalty

Rareș Baron1y

The case for the death penalty

Fair statistical point, however in reality a vast majority of serial killers did not go above 15 victims, and the crimes they commited were perpretated before their first (and last) arrest. I do not have raw numbers, but my impression is that the number of those sentenced for one murder, later paroled, and then beginning their spree of more than 3-4, is incredibly small. Serial killers are also rare in general.

Gang considerations, however, might be a larger factor here, though I still doubt it is enough to tip the scales (especially as prison gang affiliation is a factor taken into account when considering parole). 13% of homicides are gang-related, though gang members... (read more)

Replying toMinor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Rareș Baron1y

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

I am sorry for being slow to understand. I hope I will internalise your advice and the linked post quickly.

I have re-done the graphs, to be for every epoch. Very large spikes for SiLU were hidden by the skipping. I have edited the post to rectify this, with additional discussion.

Again, thank you (especially your patience).

Replying toMinor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Rareș Baron1y

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Apologies for misunderstanding. I get it now, and will be more careful from now on.

I have re-run the graphs where such misunderstandings might appear (for this and a future post), and added them here. I don't think I have made any mistakes in interpreting the data, but I am glad to have looked at the clearer graphs.

Thank you very much!

Replying toMinor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Rareș Baron1y

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

I will keep that in mind for the future. Thank you!
I have put all high-quality .pngs of the plots in the linked Drive folder.

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Rareș Baron

Epistemic status: small exploration without previous predictions, results low-stakes and likely correct.

Edited to implement feedback by Gurkenglas which has unearthed unseen data. Thank you!

Introduction

This is the first project: extending Neel Nanda’s modular addition network, made for studying... (read 932 more words →)

Replying toThe case for the death penalty

Rareș Baron1y

The case for the death penalty

I have major disagreements with the arguments of this post (with the understanding that it is a steelman), but I do want to say that it has made me moderately update towards the suitability of the death penalty as a punishment, from a purely utilitarian perspective (though it has not tipped the scale). It has also showcased interesting and important figures, so thank you for that.

Deterrence and recidivism

At that point killing 3 million criminals to save the lives of 2.4 million

How many of those 2.4 million were murdered by recidivists? Even if we assume that the death penalty instituted on a larger scale would save a significant chunk of these people (evidence... (read 1166 more words →)

LESSWRONG
LW

LESSWRONG
LW

Rareș Baron

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Minor interpretability exploration #3: Extending superposition to different activation functions (loss landscape)

Minor interpretability exploration #4: LayerNorm and the learning coefficient

Minor interpretability exploration #2: Extending superposition to different activation functions

Rareș Baron

Minor interpretability exploration #4: LayerNorm and the learning coefficient

Minor interpretability exploration #3: Extending superposition to different activation functions (loss landscape)

Minor interpretability exploration #2: Extending superposition to different activation functions

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Rareș Baron

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Minor interpretability exploration #3: Extending superposition to different activation functions (loss landscape)

Minor interpretability exploration #4: LayerNorm and the learning coefficient

Minor interpretability exploration #2: Extending superposition to different activation functions

Rareș Baron

Minor interpretability exploration #4: LayerNorm and the learning coefficient

Minor interpretability exploration #3: Extending superposition to different activation functions (loss landscape)

Minor interpretability exploration #2: Extending superposition to different activation functions

Minor interpretability exploration #1: Grokking of modular addition, subtraction, multiplication, for different activation functions

Introduction

TL;DR

Introduction

Introduction

Introduction