LESSWRONG
LW

Language Models (LLMs)Interpretability (ML & AI)
Frontpage

74

Teaser: Hard-coding Transformer Models

by MadHatter
12th Dec 2021
1 min read
19

74

Language Models (LLMs)Interpretability (ML & AI)
Frontpage

74

Teaser: Hard-coding Transformer Models
19evhub
8habryka
2MadHatter
5Vivek Hebbar
9mtaran
2MadHatter
2mtaran
4Igor Ostrovsky
2MadHatter
8gwern
2gwern
2Kenoubi
1Rudi C
2gwern
1MadHatter
7Jsevillamol
3Rohin Shah
1MadHatter
2Igor Ostrovsky
New Comment
19 comments, sorted by
top scoring
Click to highlight new comments since: Today at 4:57 AM
[-]evhub4y190

Unfortunately, I have a fairly demanding day job, and haven't found the time and energy yet.

Have you considered applying for a grant from the Long-Term Future Fund to buy out your day job so you can spend all your time working on this? As a fund manager for the LTFF, this is definitely the sort of thing we're often happy to fund, and I think that the research you're describing sounds pretty exciting.

Reply
[-]habryka4y80

Yeah, I also thought it was pretty interesting. I only thought about it for a few minutes, but it seems interesting enough to give it a shot, IMO.

Reply
[-]MadHatter4y20

I have definitely not thought about that before. Feedback from people I have shown this work to has ranged from (literally) "you are a madman" to "that looks cool" (and then never engaging with it).

Reply
[-]Vivek Hebbar3y50

Any update on this (applying for funding)?

Reply
[-]mtaran4y90

Sounds intriguing! You have a GitHub link? :)

Reply
[-]MadHatter4y20

It's very, very rough, but: https://github.com/epurdy/hand

Reply
[-]mtaran4y20

I'll make sure to run it when I get to a laptop. But if you ever get a chance to set the distill.pub article up to run on heroku or something, that'll increase how accessible this is by an order of magnitude.

Reply
[-]Igor Ostrovsky4y40

I (not the OP) put it up here for now: https://igor0.github.io/hand/distill/

I'll take it down if MadHatter asks me or once there is an official site.

Reply
[-]MadHatter4y20

Thanks for throwing it up there!!!

Reply
[-]gwern4y80

Any relation to RASP?

Reply
[-]gwern4y20

https://transformer-circuits.pub/2021/framework/index.html

Reply
[-]Kenoubi4y20

Thank you for sharing this. I know it's probably not why you posted it, but reading this paper was extremely helpful to me in understanding what Transformers are actually doing in the first place.

Reply
[-]Rudi C4y10

(Unrelated.) Have you considered putting an RSS field of your Twitter account on its bio? This way people can follow you without you needing to approve them, and since it’s read-only, your burden won’t increase.

(Not to mention that RSS is a much better medium than Twitter in the first place.)

Reply
[-]gwern4y20

I don't think Twitter allows such RSS feeds.

Reply
[-]MadHatter4y10

It's a pretty similar style of work, but I haven't communicated at all with those authors and I started my work before they published.

Reply
[-]Jsevillamol4y70

I think this is very impressive and that we could learn a lot from this kind of efforts.

Can you tell us more about your "training" process and the capabilities you can achieve, with examples?

Reply
[-]Rohin Shah4y30

Very cool!

A note of caution: when I handcoded weights of a neural network (in my case, to solve a gridworld RL problem), I was able to encode the optimal policy -- but the algorithm that was later learned by gradient descent was very different. Partly this was because I only required myself to produce the right action, so I often had the (equivalent of) Q-values for different actions be very very close to each other, whereas the neural network ended up having Q-values that were further apart from each other, which was incentivized by the loss function even though it didn't make a difference to the optimal policy.

So to the extent you're trying to learn what a neural net trained by gradient descent would do, I'd recommend that you spend some time looking at the trained neural net to see whether it is using a similar sort of algorithm as the one you're implementing.

Reply
[-]MadHatter4y10

Agree with this.

Reply
[-]Igor Ostrovsky4y20

Building up toy transformer models by hand that work ... that's super interesting, both for interpretability and also education.

I put up the site [here](https://igor0.github.io/hand/distill/) for now. MadHatter, let me know if you want me to take it down.

Reply
Moderation Log
Curated and popular this week
19Comments

Transformer models are incredibly powerful for natural language tasks (and they are starting to find uses in many other fields of machine learning). Unfortunately, it is nigh-impossible to interpret what goes on inside them. OR IS IT???

In this post, I am trying to gauge potential community interest in a strand of research that I have been doing in my spare time off and on for the past year and a half (roughly).

I have found that I can, with a fair amount of effort, hard-code the weights of a transformer model in order to perform some very crude versions of linguistic tasks. So far I have achieved English-to-French translation (on a toy corpus of about 150 sentences), text classification (is a sentence grammatical or not? on a toy corpus of a couple hundred sentences), and sentiment analysis (again on a limited corpus). These results are obviously not impressive compared to the state of the machine learning field, but I am pretty sure that they can all be drastically scaled up with the investment of some time and energy. Unfortunately, I have a fairly demanding day job, and haven't found the time and energy yet.

All of this is done by inspection (no gradient descent!). The process is a lot like programming, although it is more difficult than programming, at least right now for me. I am fairly certain that better tools and better notation can be developed to make the process easier. It is also almost certainly possible to combine hard-coding with gradient descent approaches to be able to scale these methods up in a slightly less labor-intensive way.

I think that these ideas could prove useful in alignment research - if we understand how a language model works in excruciating detail, it seems drastically more likely that we will be able to reason about and predict various misunderstandings rooted in the ambiguity of language. Given that language is (arguably) a fully general means of interacting with an artificial intelligence, it seems plausible to me that this work is on the critical path to alignment.

Mentioned in
34Hard-Coding Neural Computation