Hard-Coding Neural Computation

(Moderation note: added to the Alignment Forum from LessWrong.)

I'm confused by your notation for feed-forward layers.

What justifies re-using the same labels ("apple" etc.) for

the coordinates of
the coordinates of $x \cdot A$ , i.e. the basis in which the nonlinearity operates

If we want to express what the individual components of basis (2) mean in terms of the original space, we can either talk about which vectors/semes are mapped to them by $A$ , or which vectors/semes they get mapped to by $B$ .

But your labels don't correspond to either of these interpretations. Instead, it looks like you are following rules of the form "the 4th component of every basis is called 'yum'," which leads you to label a coordinate "yum" even though it's neither mapped from "yum" by $A$ , nor mapped to "yum" by $B$ .

This notation also seems to require the basis (2) to have the same number of elements as (1), which generally will not be the case. In transformers, (2) is typically larger by a factor of 4. The logic of your example, meanwhile, can be expressed using a smaller nonlinearity basis of 3 elements:

$n e u r o n_{1} = R e L U (c h e r r y + d u r i a n - 1)$

$n e u r o n_{2} = R e L U (a p p l e + b a n a n a - 1)$

$n e u r o n_{3} = R e L U (a p p l e + b a n a n a)$

$y u m = n e u r o n_{3} - n e u r o n_{2}$

$y u c k = - 1 * n e u r o n_{1}$

with some arbitrary choices about which multiplicative constants to absorb into $A$ and $a$ vs. which to absorb into $B$ .

[-][anonymous]4y10

Thanks for your comments/questions, they're very insightful.

In general, there are as many encoding spaces in a Transformer as there are computational nodes, and a traditional Transformer will have little incentive to use the same semantics for any two of the spaces. (There's a little bit of an incentive because of the residual connections, which will (I think?) kind of tie the semantics of the various hidden-size-sized embeddings spaces.)

In particular, the middle layer of the dense-relu-dense feedforward layer is usually chosen to be significantly larger (4x) than the hidden size, and so it's not even theoretically possible to represent it using the same basis. I've found that it sometimes makes sense to use anonymous seme names like x1 x2 x3 etc in the feed-forward layer for this reason. In my experience so far I've found the feed-forward layers to be most useful for conjunctions and disjunctions - and there are a quadratic number of possible conjunctions and disjunctions of even two neurons, let alone 3 or 4. So it seems to me that this might give a tiny hint as to why people have found that the intermediate embedding space of the feed-forward layer needs to be so large.

Of course, there is a potentially huge gap between what I am clever enough to think of as a use for them and what good old gradient descent is clever enough to think of. We can only easily lower-bound the potential uses of them; upper-bounding the capabilities of a component will prove much more challenging.

[-]Jsevillamol4yΩ240

I don't fully understand how the embeddings are done.

Can you spell out one of the examples?

It would be helpful for me to see how the semes map to the actual matrix.

[-][anonymous]4y10

Added an example sentence and its embeddings. Will add more examples overall. Thanks for commenting!

[-]LawrenceC4y20

Re: how this interacts with Alignment Research:

I think that these ideas could prove useful in alignment research - if we understand how a language model works in excruciating detail, it seems drastically more likely that we will be able to reason about and predict various misunderstandings rooted in the ambiguity of language.

Another use is for sanity checking existing interpretability techniques. For example, to check if particular neurons identified as curve detectors via interpretability techniques were indeed curve detectors, Chris Olah spent a few hours replacing the curve-detecting neurons with handwritten curve detector neurons. (He found that the interpretability techniques were able to give qualitatively similar results for both the original neurons and the handwritten neurons. More impressively, he also found that replacing the curve detecting neurons with his handwritten neurons was able to recover ~60% of the drop in accuracy compared to removing the original neurons entirely [reported in footnote 9].)

[-]pando4y10

Very nice post. It is certainly useful to do this exercise of manually encoding language rules into the weights of a transformer in order to better understand the machinery involved.

"The ultimate ambition of this work would be to go toe-to-toe with a comparably-sized Transformer model trained in the traditional way on a modern-sized data set. This might require several people-years of focused effort though."

There is a long history of attempting to parse natural language with hand design rules and heuristics. The general consensus now is that hand engineering is insufficient, and some learning from data is necessary. To me it seems that this direction inherits the problems of these old fashioned language systems since you are codifying your own hand designed heuristics and rules into the network weights.

Do you see a way to introduce learning from data without sacrificing the interpretability that your approach provides?

[-][anonymous]4y10

There are a number of ways to combine this approach with learning, but I haven't had time to try any of them yet. Some ideas I have thought of:

Use hard-coded weights, plus some random noise, to initialize the weights of a transformer that you then train in the traditional fashion
- Doesn't really help with interpretability or alignment, but might(???) help with performance
Write out all the weight and bias parameters as combinations of semes and outer products of semes, then learn seme embeddings by gradient descent
Semantic seme embeddings could be initialized from something like WordNet relationships, or learned with word2vec, to automate those guys
You could do smallish amounts of gradient descent to suggest new rules to add, but then add them by hand
- Still would be very slow
Perhaps it is possible to start with a strong learned transformer and gradually identify human-legible rules that it is using, and replacing those specific parts with hard-coding
- Could prove very difficult!!!
- It seems almost certain to me that hard-coding weights would at least help us build the muscles needed to recognize what is going on, to the extent that we are able to