LESSWRONG
LW

Aryan Bhatt
162Ω561200
Message
Dialogue
Subscribe

MTS @ Redwood Research working on AI Control

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
An overview of control measures
Aryan Bhatt1mo10

Online training infrastructure: For diffuse (non-concentrated) issues, training the AIs to behave better online (on the actual tasks we are using the AIs for rather than just on separate training data) could be very useful.

Quick note: under distribution shift, online approaches (including training, but also other kinds of validation or iteration) could be very important even for concentrated threats. Because of this, I'd maybe consider online training infrastructure to be higher on the list of software infrastructure interventions (not very confident, though).

Reply
What’s up with LLMs representing XORs of arbitrary features?
Aryan Bhatt2y3-2

If it's easy enough to run, it seems worth re-training the probes exactly the same way, except sampling both your train and test sets with replacement from the full dataset. This should avoid that issue. It has the downside of allowing some train/test leakage, but that seems pretty fine, especially if you only sample like 500 examples for train and 100 for test (from each of cities and neg_cities). 

I'd strongly hope that after doing this, none of your probes would be significantly below 50%.

Reply
Mech Interp Puzzle 2: Word2Vec Style Embeddings
Aryan Bhatt2y20

My rough guess for Question 2.1:

The model likely cares about number of characters because it allows it to better encode things with fixed-width fonts that contain some sort of spatial structure, such as ASCII art, plaintext tables, 2-D games like sudoku, tic-tac-toe, and chess, and maybe miscellaneous other things like some poetry, comments/strings in code[1], or the game of life. 

A priori, storing this feature categorically is probably a far more efficient encoding/representation than linearly (especially since length likely has at most 10 common values). However, the most useful/common operation one might want to do with this feature is “compute the length of the concatenation of two tokens,” and so we also want our encodings to facilitate efficient addition. For a categorical embedding, we’d need to store an addition lookup table, which requires something like quadratic space[2], whereas a linear embedding would allow sums to be computed basically trivially[3].

This argument isn’t enough on its own, since we also need to move the stored length info between tokens in order to add them, which is severely bottlenecked by the low rank of attention heads. If this were “more of a bottleneck” than the type of MLP computation that’s necessary to implement an addition table, then it’d make sense to store length categorically instead. 

I don’t know if I could’ve predicted which bottleneck would’ve won out before seeing this post. I suspect I would’ve guessed the MLP computation (implying a linear representation), but I wouldn’t have been very confident. In fact, I wouldn’t be surprised if, despite length being linearly represented, there are still a few longer outlier tokens (that are particularly common in the context of length-relevant tasks) whose lengths are stored categorically and then added using something like a smaller lookup table.
 

  1. ^

    The code itself would, of course, be the biggest example, but I’m not sure how relevant non-whitespace token length is for most formatting

  2. ^

    In particular, you’d need a lookup table of at least size N×M, where N is the longest single string you’d want to track the length of, and M is the length of the longest token. I expect N to be on the order of hundreds, and M to be at most about 10 (since we can ignore a few outlier tokens)

  3. ^

    linear operations are pretty free, and addition of linearly represented features is as linear as it gets

Reply
(tentatively) Found 600+ Monosemantic Features in a Small LM Using Sparse Autoencoders
Aryan Bhatt2y50

[Note: One idea is to label the dataset w/ the feature vector e.g. saying this text is a latex $ and this one isn't. Then learn several k-sparse probes & show the range of k values that get you whatever percentage of separation]

 

You've already thanked Wes, but just wanted to note that his paper may be of interest here. 

Reply
How much do you believe your results?
Aryan Bhatt2y50

If you're interested, "When is Goodhart catastrophic?" characterizes some conditions on the noise and signal distributions (or rather, their tails) that are sufficient to guarantee being screwed (or in business) in the limit of many studies.

The downside is that because it doesn't make assumptions about the distributions (other than independence), it sadly can't say much about the non-limiting cases.

Reply
An exploration of GPT-2's embedding weights
Aryan Bhatt2y20

Very small typo: when you define LayerNorm, you say yi=∑j(Idij−1i1j)xj when I think you mean yi=∑j(Idij−1i1j768)xj ? Please feel free to ignore if this is wrong!!!

Reply
We Found An Neuron in GPT-2
Aryan Bhatt2y10

I do agree that looking at WO alone seems a bit misguided (unless we're normalizing by looking at cosine similarity instead of dot product). However, the extent to which this is true is a bit unclear. Here are a few considerations:

  • At first blush, the thing you said is exactly right; scaling Win up and scale WO down will leave the implemented function unchanged.
  • However, this'll affect the L2 regularization penalty. All else equal, we'd expect to see ∥Win∥=∥WO∥, since that minimizes the regularization penalty.
  • However, this is all complicated by the fact that you can also alternatively scale the LayerNorm's gain parameter, which (I think) isn't regularized.
  • Lastly, I believe GPT2 uses GELU, not ReLU? This is significant, since it no longer allows you to scale Win and WO without changing the implemented function. 
Reply
Residual stream norms grow exponentially over the forward pass
Aryan Bhatt2y70

We are surprised by the decrease in Residual Stream norm in some of the EleutherAI models.
...
According to the model card, the Pythia models have "exactly the same" architectures as their OPT counterparts


I could very well be completely wrong here, but I suspect this could primarily be an artifact of different unembeddings. 

It seemed to me from the model card that although the Pythia models have "exactly the same" architecture, they only have the same number of non-embedding parameters. The Pythia models all have more total parameters than their counterparts and therefore more embedding parameters, implying that they're using a different embedding/unembedding scheme. In particular, the EleutherAI models use the GPT-NeoX-20B tokenizer instead of the GPT-2 tokenizer (they also use rotary embeddings, which I don't expect to matter as much).

In addition, all the decreases in Residual Stream norm occur in the last 2 layers, which is exactly where I would've expected to see artifacts of the embedding/unembedding process[1]. I'm not familiar enough with the differences in the tokenizers to have predicted the decreasing Residual Stream norm ex ante, but it seems kinda likely ex post that whatever's causing this large systematic difference in EleutherAI models' norms is due to them using a different tokenizer. 

  1. ^

    I also would've expected to see these artifacts in the first layer, which we don't really see, so take this with a grain of salt, I guess. I do still think this is pretty characteristic of "SGD trying its best to deal with unembedding shenanigans by doing weird things in the last layer or two, leaving the rest mostly untouched," but this might just be me pattern-matching to a bad internal narrative/trope I've developed.

Reply
TurnTrout's shortform feed
Aryan Bhatt2yΩ230

Hmmm, I suspect that when most people say things like "the reward function should be a human-aligned objective," they're intending something more like "the reward function is one for which any reasonable learning process, given enough time/data, would converge to an agent that ends up with human-aligned objectives," or perhaps the far weaker claim that "the reward function is one for which there exists a reasonable learning process that, given enough time/data, will converge to an agent that ends up with human-aligned objectives."

Reply
Load More
124Ctrl-Z: Controlling AI Agents via Resampling
Ω
3mo
Ω
0