Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm
These remarks are basically me just wanting to get my thoughts down after a Twitter exchange on this subject. I've not spent much time on this post and it's certainly plausible that I've gotten things wrong. In the 'Key Takeaways' section of the Modular Addition part of the well-known post 'A Mechanistic Interpretability Analysis of Grokking' , Nanda and Lieberum write: > This algorithm operates via using trig identities and Discrete Fourier Transforms to map x,y→cos(w(x+y)),sin(w(x+y)), and then extracting x+y(modp) And > The model is trained to map x,y to z≡x+y(mod113) (henceforth 113 is referred to as p) But the casual reader should use caution! It is in fact the case that "Inputs x,y are given as one-hot encoded vectors in Rp ". This point is of course emphasized more in the full notebook (it has to be, that's where the code is), and the arXiv paper that followed is also much clearer about this point. However, when giving brief takeaways from the work, especially when it comes to discussing how 'natural' the learned algorithm is, I would go as far as saying that it is actually misleading to suggest that the network is literally given x and y as inputs. It is not trained to 'act' on the numbers x, y themselves. When thinking seriously about why the network is doing the particular thing that it is doing at the mechanistic level, I would want to emphasize that one-hotting is already a significant transformation. You have moved away from having the number x be represented by its own magnitude. You instead have a situation in which x and y now really live 'in the domain' (its almost like a dual point of view: The number x is not the size of the signal, but the position at which the input signal is non-zero). So, while I of course fully admit that I too am looking at it through my own subjective lens, one might say that (before the embedding happens) it is more mathematically natural to think that what the network is 'seeing' as input is something like the in
I for one would find it helpful if you included a link to at least one place that Eliezer had made this claim just so we can be sure we're on the same page.
Roughly speaking, what I have in mind is that there are at least two possible claims. One is that 'we can't get AI to do our alignment homework' because by the time we have a very powerful AI that can solve alignment homework, it is already too dangerous to use the fact it can solve the homework as a safety plan. And the other is the claim that there's some sort of 'intrinsic' reason why an AI built by humans could never solve alignment homework.