Transformer models are incredibly powerful for natural language tasks (and they are starting to find uses in many other fields of machine learning). Unfortunately, it is nigh-impossible to interpret what goes on inside them. OR IS IT???

In this post, I am trying to gauge potential community interest in a strand of research that I have been doing in my spare time off and on for the past year and a half (roughly).

I have found that I can, with a fair amount of effort, hard-code the weights of a transformer model in order to perform some very crude versions of linguistic tasks. So far I have achieved English-to-French translation (on a toy corpus of about 150 sentences), text classification (is a sentence grammatical or not? on a toy corpus of a couple hundred sentences), and sentiment analysis (again on a limited corpus). These results are obviously not impressive compared to the state of the machine learning field, but I am pretty sure that they can all be drastically scaled up with the investment of some time and energy. Unfortunately, I have a fairly demanding day job, and haven't found the time and energy yet.

All of this is done by inspection (no gradient descent!). The process is a lot like programming, although it is more difficult than programming, at least right now for me. I am fairly certain that better tools and better notation can be developed to make the process easier. It is also almost certainly possible to combine hard-coding with gradient descent approaches to be able to scale these methods up in a slightly less labor-intensive way.

I think that these ideas could prove useful in alignment research - if we understand how a language model works in excruciating detail, it seems drastically more likely that we will be able to reason about and predict various misunderstandings rooted in the ambiguity of language. Given that language is (arguably) a fully general means of interacting with an artificial intelligence, it seems plausible to me that this work is on the critical path to alignment.

New Comment
19 comments, sorted by Click to highlight new comments since:

Unfortunately, I have a fairly demanding day job, and haven't found the time and energy yet.

Have you considered applying for a grant from the Long-Term Future Fund to buy out your day job so you can spend all your time working on this? As a fund manager for the LTFF, this is definitely the sort of thing we're often happy to fund, and I think that the research you're describing sounds pretty exciting.

Yeah, I also thought it was pretty interesting. I only thought about it for a few minutes, but it seems interesting enough to give it a shot, IMO.

I have definitely not thought about that before. Feedback from people I have shown this work to has ranged from (literally) "you are a madman" to "that looks cool" (and then never engaging with it).

Any update on this (applying for funding)?

Sounds intriguing! You have a GitHub link? :)

It's very, very rough, but:

I'll make sure to run it when I get to a laptop. But if you ever get a chance to set the article up to run on heroku or something, that'll increase how accessible this is by an order of magnitude.

I (not the OP) put it up here for now:

I'll take it down if MadHatter asks me or once there is an official site.

Thanks for throwing it up there!!!

Any relation to RASP?

Thank you for sharing this. I know it's probably not why you posted it, but reading this paper was extremely helpful to me in understanding what Transformers are actually doing in the first place.

(Unrelated.) Have you considered putting an RSS field of your Twitter account on its bio? This way people can follow you without you needing to approve them, and since it’s read-only, your burden won’t increase.

(Not to mention that RSS is a much better medium than Twitter in the first place.)

I don't think Twitter allows such RSS feeds.

It's a pretty similar style of work, but I haven't communicated at all with those authors and I started my work before they published.

I think this is very impressive and that we could learn a lot from this kind of efforts.

Can you tell us more about your "training" process and the capabilities you can achieve, with examples?

Very cool!

A note of caution: when I handcoded weights of a neural network (in my case, to solve a gridworld RL problem), I was able to encode the optimal policy -- but the algorithm that was later learned by gradient descent was very different. Partly this was because I only required myself to produce the right action, so I often had the (equivalent of) Q-values for different actions be very very close to each other, whereas the neural network ended up having Q-values that were further apart from each other, which was incentivized by the loss function even though it didn't make a difference to the optimal policy.

So to the extent you're trying to learn what a neural net trained by gradient descent would do, I'd recommend that you spend some time looking at the trained neural net to see whether it is using a similar sort of algorithm as the one you're implementing.

Agree with this.

Building up toy transformer models by hand that work ... that's super interesting, both for interpretability and also education.

I put up the site [here]( for now. MadHatter, let me know if you want me to take it down.