Explaining SolidGoldMagikarp by looking at it from random directions

Robert_AIZI

This is a linkpost for https://aizi.substack.com/p/explaining-solidgoldmagikarp-by-looking

Summary

I conducted an experiment that provides evidence that many of the weird tokens demonstrating the SolidGoldMagikarp phenomenon are interior points in the token embedding cloud. This would be sufficient to make them “unspeakable”.

My experiment was equivalent to making GPT predict the next token from a randomized internal state, and in 3 million attempts there were 510/50257 tokens that it failed to output, but those 510 included 85/133 of the “weird tokens”, including “ SolidGoldMagikarp” itself. When I repeated this experiment on a “control group” of 50257 random embeddings, all 50257 were predicted at least 16 times, so failing to output a token is astronomically unlikely to be a fluke. This experiment provides some evidence that the weird tokens are embedded in interior points, but my ability to confirm that has been bottlenecked by my coding skill (help would be appreciated).

I believe this provides one step towards understanding the SolidGoldMagikarp phenomenon. Some of the “weird tokens” are not covered by this explanation, and it remains unclear why these token embeddings were learned in the first place.

Introduction

Like many others, I’m fascinated by the SolidGoldMagikarp phenomenon identified by Rumbelow and Watkins. In short, the GPT family has certain tokens, including " SolidGoldMagikarp", that produce weird behavior. One such behavior is being "unspeakable", where "GPT models seem largely incapable of repeating these anomalous tokens, and instead respond in a number of strange ways".

ChatGPT does not seem to understand the token “ SolidGoldMagikarp” and instead confuses it with the token “distribute”.

I was struck by the author's comments that the mysterious tokens "were among those closest to the centroid of the entire set of 50,257 tokens", since that suggests a simple explanation:

The Interior Conjecture: Unspeakable tokens are in the interior of the convex hull of the token embeddings.

The Interior Conjecture is sufficient for unspeakability

Let's first show that the Interior Conjecture would be sufficient to explain unspeakability:

Claim: If a token's embedding is on the interior of the convex hull of other tokens, then GPT cannot output it at temperature 0.

Proof: At temperature 0, GPT's output is the token with the largest logit in , where $h$ is the last row of the final state of the residual stream and $W_{e}$ is the token embedding matrix.
Equation 2 from the GPT paper, showing how they calculate probabilities of future tokens.
Proof (ct’d) Writing $t_{0}, . . ., t_{n}$ for tokens with embeddings $e_{0}, . . ., e_{n}$ , suppose $e_{0}$ is a convex linear combination of $e_{1}, . . ., e_{n}$ . That is, $e_{0} = a_{1} e_{1} + . . . + a_{n} e_{n}$ , with $a_{1} + . . . + a_{n} = 1$ . Writing $\cdot$ for the dot product, taking the dot product with h, and applying linearity, we have $h \cdot e_{0} = a_{1} (h \cdot e_{1}) + . . . + a_{n} (h \cdot e_{n})$ , which shows that the (real-number) logit of $t_{0}$ is a convex linear combination of logits of $t_{1}, . . ., t_{n}$ . But a convex linear combination of real numbers is bounded by its largest value, so $h \cdot e_{0} <= m a x_{i > 0} (h \cdot e_{i})$ , with equality if and only if all $h \cdot e_{i}$ with nonzero coefficients are equal. Since $t_{0}$ is strictly in the interior, this is not the case (unless $h = 0$ ), so $t_{0}$ cannot be the first choice token. QED

A visual description of the argument: $h$ is linearly projecting all embedding vectors onto a line, and the output token is the point furthest along this line. Since $e_{0}$ is an interior point before the projection, it will be an interior point after the projection, so it cannot be the first-choice token.

No matter what line you project onto, the interior point e_0 will remain interior. In this example, GPT would predict e_1 or e_3 as the next token, depending on whether it was choosing the rightmost or leftmost point on the line.

So is the Interior Conjecture true? I couldn't check because scipy.spatial.ConvexHull was unhappy with the size of the vectors, and I didn't see how I could implement an algorithm with good performance. If someone with a coding background wanted to help me implement an algorithm to check this, I’d be eternally grateful^[1].

A Different Approach: Random Directions

However, I did run a different test that sheds some light on the situation: I chose a direction vector $h$ at random and and found which token maximizes $h \cdot e_{i}$ , where $e_{i}$ ranges over the set of tokens. I used the token embeddings from GPT-J-6B (here), which were 50257 tokens in 4096-dimensional space^[2]. Direction vectors were generated by sampling the standard normal distribution independently for each dimension, which is equivalent to choosing points from a hypersphere uniformly at random^[3]. I drew 3 million samples. Code here.

By the argument in the previous section, any interior point will occur 0 times in this dataset. The converse is also “true in the limit”: if a token embedding is extremal, there is a hyperplane separating it from the other points, so the probability of it appearing at least once in the dataset approaches 1 as the number of samples approaches infinity.

As a “control group”, I also ran this experiment on randomized token embeddings, generated the same way as the direction vectors. I believe this is similar to how GPT’s weights were initiated, so this should be akin to what GPT would have predicted before any training, and will give us a baseline to see unusual patterns in the trained embeddings.

Experiment Results

I analyzed the resulting frequency dataset from a few perspectives.

Here’s a chart of the frequency of each token (sorted). Contrast the frequencies of GPT-J’s tokens (left column) with randomized token embeddings (right column).

Frequencies of the 50527 tokens among the 3000000 samples. The left column is GPT-J’s embeddings, the right column is randomized vector embeddings, and the top/bottom rows show the same data on linear/log scales.

We can see right away that GPT-J’s tokens are not similar to the random distribution, and in particular it covers a far wider range of frequencies (0-1760) than the random distribution (16-146).

Where are the weird tokens in this data? All across the distribution, but particularly concentrated at low frequencies (more on that later).

Frequency of GPT-J tokens, with weird tokens marked by an x.

The top 10 tokens by frequency don't hold any meaning to me. They were a combination of seemingly-random characters, accented capital vowels, and the words “gif” and “kids”^[4]. Frankly, it's bizarre to me that most of these were even tokens:

  index| frequency|token
-------------------------
  17433       1760   ĠãĤ
  27908        787   gif
    136        781     Ì
  47540        733   Ġ._
  37855        704  ĠâĢº
  46256        686    âģ
    146        667     Ö
  45235        667  kids
  28110        641  Ġtha
  25248        636   Ġ@@

Looking to the opposite end of the spectrum, 510/50257 tokens were never randomly generated (ie had a frequency of zero). What of the 133 candidate “weird tokens” described by Rumbelow and Watkins? Of those, 85 had zero frequency! To put it another way: $P (z e r o f r e q u e n c y) = 0.01$ , but $P (z e r o f r e q u e n c y | w e i r d t o k e n) \approx 64 %$ !

However, a majority of the zero frequency tokens are not in the list of 136 “weird tokens”. The other notable class to me was “tokens with low indices”: of the first 93 tokens, 72 (77%) had zero frequency, a rate even higher than the “weird tokens”! This part of the vocabulary consists of digits, letters (uppercase and lowercase), and the punctuation found on a standard American keyboard. To irresponsibly speculate about this, GPT was trained not to predict these characters because the tokenization algorithm tries to group characters together. For instance, if the next piece of text is “word”, this will be tokenized as “[word]” instead of “[w][o][r][d]”, and the embeddings learn to reflect that solitary characters are almost never the next token.

Here are all 510 tokens that appear with zero frequency:

[    0     1     3     6     7     8     9    10    11    12    13    14
    15    16    17    18    19    20    21    22    23    24    25    26
    27    28    29    30    31    32    33    34    35    36    37    40
    42    43    44    45    46    47    49    50    53    54    58    59
    60    62    64    65    66    67    69    70    72    73    74    76
    77    78    79    81    82    85    86    88    89    90    91    92
   124   125   153   173   174   177   178   179   180   181   182   183
   184   185   186   187   188   197   198   200   220   246   247   250
   257   259   262   263   264   269   270   271   272   274   276   278
   281   282   285   286   287   290   299   307   308   309   311   318
   319   326   327   329   338   339   340   345   347   351   352   355
   356   357   360   362   366   367   371   373   376   379   383   385
   389   390   393   399   402   406   410   412   416   418   422   423
   428   438   447   460   461   464   465   468   470   474   475   479
   481   484   494   502   508   509   510   511   513   515   517   526
   530   532   534   540   543   544   546   547   550   553   554   584
   588   602   607   611   616   617   618   621   625   628   632   642
   645   649   651   654   656   657   663   673   674   683   685   689
   705   706   714   718   720   737   750   760   765   766   767   770
   775   779   784   796   803   807   815   818   821   828   832   837
   843   851   860   878   910   921   940   960   981   986  1003  1026
  1101  1105  1114  1115  1135  1143  1168  1169  1174  1187  1194  1201
  1212  1222  1262  1314  1343  1391  1422  1462  1495  1511  1539  1550
  1566  1634  1635  1639  1776  1782  1946  2075  2091  2102  2215  2231
  2291  2399  2402  2534  2548  2608  2620  2751  2941  2996  3256  3336
  3467  3510  3695  3717  3901  4008  4060  4083  4357  4533  4690  4778
  5174  5332  5334  5357  5512  5808  5815  6438  7105  7782  8438  8735
  8755  8980  9364  9783 10298 11033 11273 11304 11537 11548 11689 11709
 11974 12340 12677 12781 13150 13171 13198 14574 14695 14827 15243 15272
 16142 16764 17629 17900 18125 18472 18945 19415 19510 20174 20554 22640
 22757 23090 23282 23513 23711 24847 24934 25193 25618 25658 25992 27006
 27013 27534 28666 29372 30072 30202 30208 30209 30210 30211 30212 30213
 30439 30684 30897 30898 30899 30905 30906 31032 31478 31573 31666 31765
 31886 31957 32047 32239 32382 32437 32917 33023 33434 33454 33813 33937
 34027 34206 34448 34504 34604 34832 35207 35307 35496 35579 35944 36130
 36173 36174 36481 36607 36726 36911 36926 36935 36938 36940 37389 37444
 37545 37574 37579 37631 37842 38016 38165 38250 38370 38653 39165 39177
 39253 39280 39374 39446 39693 39749 39752 39753 39755 39756 39757 39811
 39820 39821 39890 39906 40012 40219 40240 40241 40242 40516 41297 41380
 41383 41504 41538 41551 42066 42089 42090 42202 42424 42535 42586 42728
 42889 43010 43038 43065 43177 43361 43453 43569 43796 44555 45003 45228
 45392 45544 45545 46570 46600 47198 47571 47614 48193 48366 48396 48404
 49731 49781 49997 50009 50216 50256]

Conclusions

Because the logits used for prediction are determined by a linear function of the token embeddings, token embeddings that are within the interior of the embedding cloud can never be predicted by GPT at 0 temperature, regardless of the contents of the transformer layers.
The “Interior Conjecture” is my hypothesis that weird tokens such as “ SolidGoldMagikarp” are within the interior of the token embedding cloud.
I have conducted an experiment that chooses random directions to evaluate on, and which provides evidence that some weird tokens satisfy the Interior Conjecture, but shows that not all of them satisfy it. In particular, the “weird tokens” appear dramatically more often in the set of tokens with zero frequency.
My experiment shows that that Interior Conjecture is not true for all weird tokens (as some weird tokens had positive frequency), but is evidence that it might be true for many weird tokens. Several further experiments could prove it or provide additional evidence:
- Algorithmically compute which token embeddings are in the interior of the convex hull. Alternatively, for each token embedding compute the distance from it to the convex hull of the other points. (I would prefer the latter because it would be a richer dataset.)
  - Bottlenecked by: I couldn’t write an efficient implementation of the Gilbert–Johnson–Keerthi distance algorithm that operates in such a high-dimensional space.
- Run the same random direction experiment on other GPT embeddings or for more datapoints (this would be perfect for parallelization).
  - Bottlenecked by: I’m working from a laptop and don’t want to wait for jobs that last more than 8 hours.
- Analytically compute the exact probabilities that the random direction experiment approximates. To do this, for each token find the measure of the set of points in the 4096-dimensional hypersphere that results in that token being chosen.
  - Bottlenecked by: it seems hard to set up and evaluate those integrals.
If true, the Interior Conjecture would raise additional questions:
- Why does GPT learn to put some tokens on the interior of its point cloud?
  - My best guess is that this is the fate of all tokens that aren’t in the training set. Hiding a token in the center of the embedding cloud will guarantee that it is never predicted, which is a good behavior to learn if it is correct to never predict them!
- Which tokens does it learn to do this with? Why them?
- How do token embeddings evolve over the course of training? In particular, do the unspeakable tokens “move to the interior” or do speakable tokens “move to the extreme”?
- Why does the set of zero-frequency tokens overlap imperfectly with the set of weird tokens? Would a more careful study reveal a deeper overlap (e.g. set containment)?
- Can this be used for AI safety and if so how?
  - To be honest, I don’t see a use case at the moment.

[Edit: I've put my code and data up on Github. You can see the frequency data, the plots, and should be able to run my code to replicate the data generation and analysis. Please make use of this however you'd like.]

^{^}
I think the algorithm to use is the Gilbert–Johnson–Keerthi distance algorithm, or possibly a simplified variant (since we’re checking object-to-point distance instead of object-to-object). I’m worried that the NearestSimplex part of the code is infeasible since we need this to run in 4096-dimensional space. The original paper remarks that “since v [the set of vertices] is small, it is effective to take a combinatoric approach, where all [2^|v|-1] subsets are tested”, but in this case v could be as large as 4097…
^{^}
Technically there are 50400 tokens, but the additional 143 are extra tokens added just to make the number of tokens nicely divisible, and never came up in my evaluation.
^{^}
To sample from the hypersphere, you can generate vectors as described and then normalize them. Since we only care about the index of the maximum value, and this is unchanged by the normalizing step, I omitted that step in my code.
^{^}
Also, I believe in the GPT-J vocab list I’m working with, “Ġ” is used for spaces. This makes this token list marginally less weird, but it’s still confusing to me.