Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Interpretability research is sometimes described as neuroscience for ML models. Neuroscience is one approach to understanding how human brains work. But empirical psychology research is another approach. I think more people should engage in the analogous activity for language models: trying to figure out how they work just by looking at their behavior, rather than trying to understand their internals.

I think that getting really good at this might be a weird but good plan for learning some skills that might turn out to be really valuable for alignment research. (And it wouldn’t shock me if “AI psychologist” turns out to be an economically important occupation in the future, and if you got a notable advantage from having a big head start on it.) I think this is especially likely to be a good fit for analytically strong people who love thinking about language and are interested in AI but don’t love math or computer science.

I'd probably fund people to spend at least a few months on this; email me if you want to talk about it.

Some main activities I’d do if I was a black-box LM investigator are:

  • Spend a lot of time with them. Write a bunch of text with their aid. Try to learn what kinds of quirks they have; what kinds of things they know and don’t know.
  • Run specific experiments. Do they correctly complete “Grass is green, egg yolk is”? Do they know the population of San Francisco? (For many of these experiments, it seems worth running them on LMs of a bunch of different sizes.)
  • Try to figure out where they’re using proxies to make predictions and where they seem to be making sense of text more broadly, by taking some text and seeing how changing it changes their predictions.
  • Try to find adversarial examples where they see some relatively natural-seeming text and then do something really weird.

The skills you’d gain seem like they have a few different applications to alignment:

  • As a language model interpretability researcher, I’d find it very helpful to talk to someone who had spent a long time playing with the models I work with (currently I’m mostly working with gpt2-small, which is a 12 layer model). In particular, it’s much easier to investigate the model when you have good ideas for behaviors you want to explain, and know some things about the model’s algorithm for doing such behaviors; I can imagine an enthusiastic black-box investigator being quite helpful for our research.
  • I think that alignment research (as well as the broader world) might have some use for prompt engineers–it’s kind of fiddly and we at Redwood would have loved to consult with an outsider when we were doing some of it in our adversarial training project (see section 4.3.2 here).
  • I’m excited for people working on “scary demos”, where we try to set up situations where our models exhibit tendencies which are the baby versions of the scary power-seeking/deceptive behaviors that we’re worried will lead to AI catastrophe. See for example Beth Barnes’s proposed research directions here. A lot of this work requires knowing AIs well and doing prompt engineering.
  • It feels to me like “have humans try to get to know the AIs really well by observing their behaviors, so that they’re able to come up with inputs where the AIs will be tempted to do bad things, so that we can do adversarial training” is probably worth including in the smorgasbord of techniques we use to try to prevent our AIs from being deceptive (though I definitely wouldn’t want to rely on it to solve the whole problem). When we’re building actual AIs we probably need the red team to be assisted by various tools (AI-powered as well as non-AI-powered, eg interpretability tools). We’re working on building simple versions of these (e.g. our red-teaming software and our interpretability tools). But it seems pretty reasonable for some people to try to just do this work manually in parallel with us automating parts of it. And we’d find it easier to build tools if we had particular users in mind who knew exactly what tools they wanted.
    • Even transformatively intelligent and powerful AIs seem to me to be plausibly partially understandable by humans. Eg it seems plausible to me that these systems will communicate with each other at least partially in natural language, or that they'll have persistent memory stores or cognitive workspaces which can be inspected.

My guess is that this work would go slightly better if you had access to someone who was willing to write you some simple code tools for interacting with models, rather than just using the OpenAI playground. If you start doing work like this and want tools, get in touch with me and maybe someone from Redwood Research will build you some of them.

111

Ω 37

15 comments, sorted by Click to highlight new comments since: Today at 7:53 PM
New Comment

Maybe useful way to get feedback on how good you are at doing this would be trying to make predictions based on your experience with language models:

  • without looking at the results or running systematic experiments on the dataset, predict which tasks on BIG-bench will be doable
  • make bets of the form "we'll reach X% performance on task A before we reach Y% performance on task B"
  • predict for some prompt what percentage of samples will satisfy some property, then take a large number of samples and then rate them

Yeah I think things like this are reasonable. I think that these are maybe too hard and high-level for a lot of the things I care about--I'm really interested in questions like "how much less reliable is the model about repeating names when the names are 100 tokens in the past instead of 50", which are much simpler and lower level.

There's a (set of) experiments I'd be keen to see done in this vein, which I think might produce interesting insights.


How well do capabilites generalise across languages?

Stuart Armstrong recently posted this example of GPT-3 failing to generalise to reversed text. The most natural interpretation, at least in my mind, and which is pointed at by a couple of comments, is that the problem here is that there's just been very little training data which contains things like:

m'I gnitirw ffuts ni esrever tub siht si yllautca ytterp erar
(I'm writing stuff in reverse but this is actually pretty rare)

Especially with translations underneath. In particular, there hasn't been enough data to relate the ~token 'ffuts' to 'stuff'. These are just two different things which have been encoded somewhere, one of them tends to appear near english words, the other tends to appear near other rare things like 'ekil'.

It seems that how much of a capability hit language models take when trying to work in 'backwards writing', as well as other edited systems like pig latin or simple cyphers, and how much fine tuning they would take to restore the same capabilities as English, may provide a few interesting insights into model 'theory of mind'.

The central thing I'm interested in here is trying to identify differences between cases where a LLM is both modelling a situation in some sort of abstract way, and then translating from that situation to language output, and cases where the model is 'only' doing language output.

Models which have some sort of world model, and use that world model to output language, should find it much easier to capably generalise from one situation to another. They also seem meaningfully closer to agentic reasoners. There's also an interesting question about how different models look when fine tuned here. If it is the case that there's a ~separate 'world model' and 'language model', training the model to perform well in a different language should, if done well, only change the second. This may even shed light on which parts of the model are doing what, though again I just don't know if we have any ways of representing the internals which would allow us to catch this yet.

Ideas for specific experiments:

  • How much does grammar matter?
    • Pig latin, reversed english, and simple substitution cyphers all use identical grammar to standard english. This means that generalising to these tasks can be done just by substituting each english word for a different one, without any concept mapping taking place.
    • Capably generalising to French, however, is substantially harder to do without a concept map. You can't just substitute word-by-word.
       
  • How well preserved is fine tuning across 'languages'?
    • Pick some task, fine tune the model until it does well on the task, then fine tune the model to use a different language (using a method that has worked in earlier experiments). How badly is performance on the task (now presented in the new language) hit?
    • What happens if you change the ordering - you do the same fine-tuning, but only after you've trained the model to speak the new language. How badly is performance hit? How much does it matter whether you do the task-specific fine tuning in english or the 'new' language?
    • In all of these cases (and variations of them), how big a difference does it make if the new language is a 1-1 substitution, compared to a full language with different grammar.
       

If anyone does test some of these, I'd be interested to hear the results!

I had a lot of fun testing PaLM prompts on GPT3, and I'd love to further explore the "psychology" of AIs from a black-box angle. I currently have a job, and not a lot of free time, but I'd be willing to spend a few hours here and there on this, if that would be beneficial.

Here's a fun paper I wrote along these lines. I took an old whitepaper of McCarthy from 1976 where he introduces the idea of natural language understanding and proposes a set of questions about a news article that such a system should be able to answer. I asked the questions to GPT 3 and looked at what it got right and wrong and guessed at why. 
What Can a Generative Language Model Answer About a Passage? 

It feels to me like “have humans try to get to know the AIs really well by observing their behaviors, so that they’re able to come up with inputs where the AIs will be tempted to do bad things, so that we can do adversarial training” is probably worth including in the smorgasbord of techniques we use to try to prevent our AIs from being deceptive

Maybe I missed something here, but how is this supposed to help with deception? I thought the whole reason deceptive alignment is really hard to solve is that you can't tell if the AI's being deceptive via its behavior.

Curated. I've heard a few offhand comments about this type of research work in the past few months, but wasn't quite sure how seriously to take it. 

I like this writeup for spelling out details of why it blackbox investigators might be useful, what skills it requires and how you might go about it. 

I expect this sort of skillset to have major limitations, but I think I agree with the stated claims that it's a useful skillset to have in conjunction with other techniques.

One general piece of advice is that it seems like it might be useful to have an interface that shows you multiple samples for each prompt (the OpenAI playground just gives you one sample, if you use temperature > 0 then this sample could either be lucky or unlucky)

Yeah I wrote an interface like this for personal use, maybe I should release it publicly.

Please do!

PaLM’s recently got some attention for explaining jokes. Which is interesting. About a year ago I quizzed GPT-3 [through an intermediary, Phil Mohun] about a Jerry Seinfeld routine, the first one he ever performed on TV. It was about the South Bronx.

I thought GPT-3’s response was fascinating – I’ve put a lightly edited version of the conversation on my blog. As the conversation went on, GPT-3 came up with some interesting answers.

I think quizzing a languag model about jokes could be very revealing. Why? Because the force it into ‘strange ways’ that can reveal what’s going on under the hood.

Do you suspect that black-box knowledge will be transferable between different models, or that the findings will be idiosyncratic to each system? 

I suspect that some knowledge transfers. For example, I suspect that increasingly large LMs learn features of language roughly in order of their importance for predicting English, and so I'd expect that LMs that get similar language modeling losses usually know roughly the same features of English. (You could just run two LMs on the same text and see their logprobs on the correct next token for every token, and then make a scatter plot; presumably there will be a bunch of correlation, but you might notice patterns in the things that one LM did much better than the other.)

And the methodology for playing with LMs probably transfers.

But I generally have no idea here, and it seems really useful to know more about this.

There is already a sizable amount of research done in this direction, the so called bertology. I believe the methodology that is being developed is useful, but knowing about specific models is probably superfluous. In few months / years we will have new models and anything that you know will not generalize.

I’ve been thinking about ordinary arithmetic computation in this context. We know that models have trouble with it. The issue interests me because arithmetic calculation has well-understood procedures. We know how people do it. And by that I mean that there’s nothing important about the process that’s hidden, unlike our use of ordinary language. The mechanisms of both sentence-level grammar and discourse structure are unconscious.

It's pretty clear to me that arithmetic requires episodic structure, to introduce a term from old symbolic-systems AI and computational linguistics. That’s obvious from the fact that we don’t teach it to children until grammar school, which is roughly episodic level cognition kicks in (see the paper Hays and I did, Principles and Development of Natural Intelligence).

Arithmetic is not like ordinary language is for humans, which comes to us naturally without much specific training. Fluency in arithmetic requires years of drill. First the child must learn to count; that gives numbers meaning. Once that is well in hand, children are drilled in arithmetic tables for the elementary operations, and so forth. Once this is going smoothly one learns the procedures multiple-digit addition and subtraction, multiple-operand addition and then multiplication and division. Multiple digit division is the most difficult because it requires guessing, which is then checked by actual calculation (multiplication followed by subtraction). 

Why do such intellectually simple procedures require so much drill? Because each individual step must be correct. You can’t just go straight ahead. One mistake anywhere, and the whole calculation is thrown off.

Whatever a model is doing in inference mode, I doubt it’s doing anything like what humans do. Where would it pick that up on the web? 

I don’t know what’s going on inside model in inference mode, but I’d guess it’s something like this: The inference engine ‘consumes’ a prompt, which moves it to some position in its state space.

  1. It has a number of possibilities for moving to a new position. 
  2. It picks one and emits a word. 
  3. Is it finished? If so, stop. If not, return to 1.

And so it moves through its state space in a single unbroken traversal. You can’t do arithmetic that way. You have to keep track of partial results and stop to retrieve them so you can integrate them into the ongoing flow of the calculation. 

So now the question is: What other kinds of tasks require the computational style that arithmetic does? Perhaps generating a long strong of coherent prose does.

Let me think about that for awhile.