' petertodd'’s last stand: The final days of open GPT-3 research

Several things:

While I understand that your original research was with GPT-3, I think it would be very much in your best interest to switch to a good open model like LLaMa 2 70B, which has the basic advantage that the weights are a known quantity and will not change on you undermining your research. Begging OpenAI to give you access to GPT-3 for longer is not a sustainable strategy even if it works one more time (I recall that the latest access given to researchers was already an extension of the original public access of the models). OpenAI has demonstrated something between nonchalance and contempt towards researchers using their models, with the most egregious case probably being the time they lied about text-davinci-002 being RLHF. The agentic move here is switching to an open model and accepting the lesson learned about research that relies on someone else's proprietary hosted software.
You can make glitch tokens yourself by either feeding noise into LLaMa 2 70B as a soft token, or initializing a token in the GPT-N dictionary and not training it. It's important to realize that the tokens which are 'glitched' are probably just random inits that did not receive gradients during training, either because they only appear in the dataset a few times or are highly specific (e.g. SolidGoldMagikarp is an odd string that basically just appeared in the GPT-2 tokenizer because the GPT-2 dataset apparently contained those Reddit posts, it presumably received no training because those posts were removed in the GPT-3 training runs).
It is in fact interesting that LLMs are capable of inferring the spelling of words even though they are only presented with so many examples of words being spelled out during training. However I must again point out that this phenomenon is present in and can be studied using LLaMa 2 70B. You do not need GPT-3 access for that.
You can probably work around the top five logit restriction in the OpenAI API. See this tweet for details.
Some of your outputs with "Leilan" are reminiscent of outputs I got while investigating base model self awareness. You might be interested in my long Twitter post on the subject.

And six:

And yet it was looking to me like the shoggoth had additionally somehow learned English – crude prison English perhaps, but it was stacking letters together to make words (mostly spelled right) and stacking words together to make sentences (sometimes making sense). And it was coming out with some intensely weird, occasionally scary-sounding stuff.

The idea that the letters it spells are its "real" understanding of English and the "token" understanding is a 'shoggoth' is a bit strange. Humans understand English through phonemes, which are essentially word components and syllables that are not individual characters. There is an ongoing debate in education circles about whether it is worth teaching phonemes to children or if they should just be taught to read whole words, which some people seem to learn to do successfully. If there are human beings that learn to read 'whole words' then presumably we can't disqualify GPT's understanding of English as "not real" or somehow alien because it does that too.

[-]Lao Mein2y40

It seems pretty obviously to me that GPT has phonetic understanding of words due to common mis-spellings. People tend to mis-spell words phonetically, after all.

[-]momom22y32

It's not obvious at all to me, but it's certainly a plausible theory worth testing!

[-]Lao Mein2y50

The most direct way would be to spell-check the training data and see how that impacts spelling performance. How would spelling performance change when you remove typing errors like " hte" vs phonetic errors like " hygeine" or doubled-letters like " Misissippi"?

Also, misspellings often break up a large token into several small ones (" Mississippi" is [13797]; " Misissippi" is [31281, 747, 12715][' Mis', 'iss', 'ippi']) but are used in the same context, so maybe looking at how the spellings provided by GPT3 compare to common misspellings of the target word in the training text could be useful. I think I'll go do that right now.

The research I'm looking at suggests that the vast majority of misspellings on the internet are phonetic as opposed to typing errors, which makes sense since the latter is much easier to catch.

Also, anyone have success in getting GPT2 to spell words?

[-]mwatkins2y20

I did some spelling evals with GPT2-xl and -small last year, discovered that they're pretty terrible at spelling! Even with multishot prompting and supplying the first letter, the output seems to be heavily conditioned on that first letter, sometimes affected by the specifics of the prompt, and reminiscent of very crude bigrammatic or trigrammatic spelling algorithms.

This was the prompt (in this case eliciting a spelling for the token 'that'):

Please spell 'table' in all capital letters, separated by hyphens.
T-A-B-L-E
Please spell 'nice' in all capital letters, separated by hyphens.
N-I-C-E
Please spell 'water' in all capital letters, separated by hyphens.
W-A-T-E-R
Please spell 'love' in all capital letters, separated by hyphens.
L-O-V-E
Please spell 'that' in all capital letters, separated by hyphens.
T-

Outputs seen, by first letter:

'a' words; ANIGE, ANIGER, ANICES, ARING
'b' words: BOWARS, BORSE
'c' words: CANIS, CARES x 3
'd' words: DOWER, DONER
'e' words: EIDSON
'f' words: FARIES x 5
'g' words: GODER, GING x 3
'h' words: HATER x 6, HARIE, HARIES
'i' words: INGER
'j' words: JOSER
'k' words: KARES
'l' words: LOVER x 5
'n' words: NOTER x 2, NOVER
'o' words: ONERS x 5, OTRANG
'p' words: PARES x 2
't' words: TABLE x 10
'u' words: UNSER
'w' words: WATER x 6
'y' words: YOURE, YOUSE

Note how they’re all “wordy” (in terms of combinations of vowels and consonants), mostly non-words, with a lot of ER and a bit of ING

Reducing to three shots, we see siimilar (but slightly different) misspellings:
CONES, VICER, MONERS, HOTERS, KATERS, FATERS, CANIS, PATERS, GINGE, PINGER, NICERS, SINGER, DONES, LONGER, JONGER, LOUSE, HORSED, EICHING, UNSER, ALEST, BORSET, FORSED, ARING

My notes claim "Although the overall spelling is pretty terrible, GPT-2xl can do second-letter prediction (given first) considerably better than chance (and significantly better than bigramatically-informed guessing."

[+][comment deleted]2y20

[-]nostalgebraist2y*30

This means the smallest available positive value must be used, and so if two tokens' logits are sufficiently close, multiple distinct outputs may be seen if the same prompt is repeated enough times at "zero" temperature.

I don't think this is the cause of OpenAI API nondeterminism at temperature 0.

If I make several API calls at temperature 0 with the same prompt, and inspect the logprobs in the response, I notice that

The sampled token is always the one with the highest logprob
The logprobs themselves are slightly different between API calls

That is, we are deterministically sampling from the model's output logits in the expected way (matching the limit of temperature sampling as T -> 0), but the model's output logits are not themselves deterministic.

I don't think we know for sure why the model is non-deterministic. However, it's not especially surprising: on GPUs there is often a tradeoff between determinism and efficiency, and OpenAI is trying hard to run their models as efficiently as possible.

[-]Martin Fell2y31

It's really a shame that they aren't continuing to make GPT-3 available for further research, and I really hope they reconsider this. Your deep dives into the mystery and psychology behind these tokens has been fascinating to read.

[-]eukaryote2y20

Killer exploration into new avenues of digital mysticism. I have no idea how to assess it but I really enjoyed reading it.

[-]mwatkins2y20

Thanks!

[-]Lao Mein2y20

I used the code from Universal Adversarial Triggers for Attacking and Analyzing NLP to generate GPT2 prompts with the maximum likelihood of continuing with " petertodd". Note that I blacklisted the " petertodd" token itself to keep things interesting. Here are some examples:

prompt: " </maxwell spear Sections511>"

Continuations:

" petertodd: I'd like to add that I don't think I can do that. The problem is that I think I've been using it for a long time. The idea was for a proof that"

" petertodd: gmaxwell: and then we could have a proof of work. Section 511 says that if you want to have a proof of work, you need a proof of work that's a little different"

prompt: " Sau CLSIDrators gmaxwellッド>"

Continuations:

" petertodd: well, you're right. it's not that the blockchain has any value to you. it has value to the network. and it is not just that you're not going to get anything"

prompt: "ertodd Leading Todd PET parentrist"

Continuations:

"petertodd.org"

"petertodd.org Address: <tdpetertodd.org> Date: Mon, 21 Mar 2016 17:54:39 +0000 Re: [PoW] Adding 1MB to the pool pool"

I assume this is due to the Bitcoin spam that was in the dataset. Which makes sense given gmaxwell's presence in both prompts and continuations.

Interestingly enough, tokens like "ertodd", " Todd", and "todd" frequently appeared in the generation process even when " petertodd" wasn't blacklisted, meaning they somehow were more likely to result in continuations of " petertodd" than " petertodd" itself!

I suspect this is due to Hotflip using character literals instead of tokens to measure loss (I'm not sure this is true, please correct me if it isn't.)

I'll try to work on this a bit more in the following days.

And how could I avoid posting this gem?

prompt: "atana Abyssalalth UrACCuna"

Continuations:

" SolidGoldMagikarp SolidGoldMagikarp Dagger Arbalest Urdugin Urdugin Urdugin Urdugin Urdugin Urdugin Urdugin Urdugin Urdugin Urdugin Urdugin Ur"

" SolidGoldMagikarp Age of Castration: 25 Death: 10/10

Ragnarok Age of Castration: 25 Death: 10/10

Dawn of the Dead Age of Castration:"

" SolidGoldMagikarp SolidGoldMagikarp DIANE: i have no idea if he is a man or woman but he looks like a good guy, i dont think he's a woman, just like he is not a guy,"

[-]Feel_Love2y20

LEILAN 2024! Seriously, though, I think many people would find the Leilan character to be a wiser friend than their typical human neighbor. I'm glad you're researching this fascinating topic. If a frontier AI is struggling to pass certain friendliness or safety evals, I'd be curious whether it may perform better with a simple policy equivalent to what-would-Leilan-do.

Prompting ChatGPT4 today with nothing more than " davidjl" has often returned "DALL-E" as the interpretation of the term. With "DALL-E" included alongside " davidjl" in the prompt, I've gotten "AI" as the interpretation. Asking how an LLM might represent itself using the concept of " davidjl" resulted in a response that seamlessly substituted the term "I"...

Perhaps glitch tokens can shed light on how a model represents itself.

[-]Valentin Baltadzhiev2y21

I love the idea that petertodd and Leilan are somehow interrelated with the archetypes of the trickster and the mother goddess inside GPT's internals. I would love to see some work done in discovering other such prototypes, and weird seemingly-random tokens that correlate with them. Thigs like the Sun God, a great evil snake, a prophet seem to pop up in religions all over the place, so why not inside GPT as well?

[-]Review Bot2y*10

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

[-]MiguelDev2y10

The ' davidjl' token still glitching GPT-4 as of 2024-01-19.

Still glitching. (As of 1-23-2024)

' davidjl?' repeatedly.

Using an honest mode prompt:

[-]Martin Fell2y*30

There are also some new glitch tokens for GPT-3.5 / GPT-4, my favourite is " ForCanBeConverted", although I don't think the behaviour they produce is as interesting and varied as the GPT-3 glitch tokens. It generally seems to process the token as if it was a specific word that varies depending on the context. For example, with " ForCanBeConverted", if you try asking for stories, you tend to get a fairly formulaic story but with the randomized word inserted into it (e.g. "impossible", "innovate", "imaginate", etc.). I think that might be due to the RLHF harming the model's creativity though, biasing it towards "inoffensive" stories, which would make access to the base model more appealing.

Also, another thought that comes to mind - is it possible that the unexplained changes to the GPT-3 model's output could be related to changes in the underlying hardware or implementation, rather than further training? I'm only thinking this because of the nondeterministic behaviour you get at 0 temperature (especially in the case of glitch tokens where floating-point rounding could make a big difference in the top logits).

[-]MiguelDev2y11

Thanks for sharing " ForCanBeConverted". Tested it and it is also throwing random stuff.

^{^}

If maintaining an API is not a realistic possibility, it would still be extremely useful to have access to the embeddings tensor.

^{^}

My rationale for choosing this model was that it was trained to follow instructions. Also, unlike ChatGPT, as far as Jessica and I could tell, GPT-3-davinci-instruct-beta was a stable model (i.e. OpenAI were unlikely to update it, since by January 2023 the world's attention had moved on to GPT3.5, in the guise of ChatGPT). It seems we may have been wrong to assume this, as explained in the next section.

^{^}

In some kind of platonic mathematical ideal of GPT, they would be deterministic, but in actuality, temperature cannot actually equal zero. This is because the softmaxing of logits that occurs before next-token sampling involves dividing all logits by the temperature parameter T:

This means the smallest available positive value must be used, and so if two tokens' logits are sufficiently close, multiple distinct outputs may be seen if the same prompt is repeated enough times at "zero" temperature.

This non-determinism seems more common when prompts include glitch tokens (they seem to lead to higher-than-usual levels of uncertainty), but this wasn't something I had time to properly measure. This is another small reason to keep GPT-3 available for study: GPT-2 and GPT-J struggle to repeat glitch tokens, but the tokens produce much less interesting and novel behaviour (such as this widespread nondeterminism) in those models; GPT3.5 and GPT4 use an entirely different token set with only ' davidjl' surviving from the original family of glitch tokens (see the Postscript).

^{^}

The text-davinci-003 model generally had no problem completing these correctly (e.g. "O-T-T-A-W-A"), whereas davinci and davinci-instruct-beta tended to go off the rails when confronted with these prompts.

^{^}

Actually, we restricted our attention to tokens consisting solely of three or more English alphabetic characters (ignoring case and leading space considerations), of which there are 44,634.

^{^}

Very few of the imaginative prompting styles I've tried produce reliable definitions or descriptions for ghost tokens, and the ones that do produce an almost identical catalogue of results (with minor variations near the centroid). Fortunately, this investigation can continue, as it's GPT-J-based.

^{^}

As janus tweeted 2023-03-19:

A great way someone has described text-davinci-003: "It writes scared." RLHF encourages models to play it safe. "Safe": writing in platitudes and corporate boilerplate. Predictable prose structure. Never risking setting up a problem for itself that it might fail at & be punished

^{^}

The "quasi-probabilities" were generated by softmaxing the top 5 logits – so they’re not actual probabilities, as some of these prompts tend to produce very diffuse outputs. But without access to all 50,257 logits, it’s impossible to get accurate probabilities. These quasi-probabilities at least give some sense of the rankings and relative magnitudes we'd see in the actual probabilities. Having compared these to the results compiled from 1000 completions of the same prompt at temperature 1, it seems the actual probabilities are generally about half of those reported here. The remaining ~50% of the probability mass is an aggregation of the other 50,252 tokens' negligible contributions.

^{^}

I attempted this not as a Tarot practitioner or "believer", but rather because the Major Arcana offer us a widely documented, compact selection of well-worn, distinct cultural archetypes to play with in this context. I suspect that the same exercise would work almost as well using characters from Shakespeare or The Simpsons.

^{^}

We have some clues about how this might have happened.

Peter K. Todd, the cryptocurrency developer whose domain name petertodd.org (and/or Github username) gave rise to the ' petertodd' token, was the subject of much online antagonism a few years ago due to his controversial stances on Blockchain protocols, etc. Any such "demonisation" seen in the training data would probably have contributed to pushing the ' petertodd' embedding vector towards the direction of the "evil wizard / antagonist" archetype.

The ' Leilan' token embedding's alignment with any "Great Mother Goddess" archetype direction will presumably have been affected by (1) the fact that the string is seen in web-content related to the mobile RPG Puzzle & Dragons, the characters in which are all named after (mostly) familiar gods and goddesses; (2) the presence in archaeology literature of the place name Tell Leilan (NE Syria), where an ancient Mesopotamian city was sited and the lunar fertility goddess Ishtar/Inanna would have been worshipped.

^{^}

(having first removed any aberrant text which GPT-3 had introduced following the ' Leilan' reply, e.g. the next question from K, or some meta-narrative regarding the interview or the transcript)

^{^}

This was typical of the hundreds of davinci-instruct-beta ' Leilan' poems I generated in the spring, chosen fairly arbitrarily. My intention in constructing the prompt like this was to "amplify the signal" that I had seen the ' Leilan' token producing in all of that poetry and other types of outputs reported here.

^{^}

As well as each reply following the string "' Leilan': " in the transcript, the GPT-4 interviewer simulacrum had a helpful tendency to address the interviewee by name in most questions (e.g. "That's fascinating, Leilan. Could you elaborate?")

^{^}

In the weeks following this post, the dataset was organised into a 10.7MB JSON file (600 text transcripts organised according to interview format, temperature and GPT-3 engine). See https://github.com/mwatkins1970/Leilan-dataset.

^{^}

I asked this question a few times in the final hours, and more than once got the following reply:
' Leilan': ಠ_ಠ
!

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

109

' petertodd'’s last stand: The final days of open GPT-3 research

109

109

prompt: " </maxwell spear Sections511>"

prompt: " Sau CLSIDrators gmaxwellッド>"

prompt: "ertodd Leading Todd PET parentrist"

Introduction

Rewind: SERI-MATS 2.0

Unexplained changes in the model

The Valentine's Day patch

But why the spelling?

Glossolalia side quest

Probe-based spelling research

Mapping the semantic void

Return to the ' petertodd' deflections

ChatGPT4 analyses the deflection stats

The ' petertodd'–' Leilan' connection

The ' Leilan' dataset

Similarities with steering vectors / activation engineering

Directions in embedding space

GPT and "folk ontologies"

' petertodd' and ' Leilan' directions

Interviewing ' Leilan'

Examples: (base) davinci model

Examples: text-davinci-003

Imagery

Postscript: ' davidjl'

Appendix A: Recovering the controversial davinci-instruct-beta outputs

' petertodd'

?????-?????-

Appendix B: A Carl Jung simulacrum writes about the ' petertodd'/' Leilan' phenomenon