All of oceaninthemiddleofanisland's Comments + Replies

If GPT-6 is human-level AGI but costs $200 per page of output, what would happen?

'Predicting random text on the internet better than a human' already qualifies it as superhuman, as dirichlet-to-neumann pointed out. If you look at any given text, there's a given ratio of cognitive work needed to produce the text, per word-count. "Superhuman" only requires asking it to replicate the work of multiple people collaborating together, or processes which need a lot of human labour like putting together a business strategy or writing a paper. Assuming it's mediocre in some aspects, the clearest advantage GPT-6 would have would be an interdisciplinary one - pooling together domain knowledge from disparate areas to produce valuable new insights.

interpreting GPT: the logit lens

How far away is this from being implementable?

3JohnSteidley1yIt doesn't sound hard at all. The things Gwern is describing are the same sort of thing that people do for interpretability where they, eg, find an image that maximizes the probability of the network predicting a target class. Of course, you need access to the model, so only OpenAI could do it for GPT-3 right now.
Writing Piano Songs: A Journey

This probably won't add too much to the discussion but I'm curious to see whether other people relate to this or have a similar process. I was kind of stunned when I heard from friends who got into composing about how difficult it is to figure out a melody and then write a complete piano piece because to me, whenever I open up Sibelius or Dorico (and more recently Ableton), internally it seems like I'm just listening to what I wrote so far, 'hearing' a possible continuation lasting a few bars, and then quickly trying to transcribe ... (read more)

2elriggs1yI tend to write melodies in multiple different ways: 1. Hearing it in my head, then playing it out. It's very easy to generate (like GPT but with melodies), but transcribing is very hard! The common advice is to sing it out, and then match it with the instrument. This is exactly what you did with whistling. If I don't record it, I will very often not remember it at all later; very similar to forgetting a dream. When I hear someone else's piano piece (or my own recorded), I will often think "I would've played that part differently" which is the same as my brain predicting a different melody. 2. "Asemic playing" (thanks for the phrase!) - I've improv-ed for hundreds of hours, and I very often run into playing similar patterns when I'm in similar "areas" such as playing the same chord progression. I'll often have (1) melodies playing in my head while improvising, but I will often play the "wrong" note and it still sound good. Over the years, I've gotten much better at remembering melodies I just played (because my brain predicts that the melody will repeat) and playing the "correct" note in my head on the fly. 3. Smashing "concepts" into a melody: * What if I played this melody backwards? * Pressed every note twice? * Held every other note a half-note longer? * Used a different chord progression (so specific notes of the melody needs to change to harmonize) * Taking a specific pattern of a melody, like which notes it uses, and playing new patterns there. * Taking a specific pattern of a melody, like the rhythm between the notes (how long you hold each note, including rests) and applying it to other melodies. * Taking a specific patter of a melody, like the exact rhythm and relative notes, and starting on a different note (then continuing to play the same notes, relatively)
Predictions for GPT-N

So I've figured this out. Kinda. If you choose 'custom' then it will give you Griffin, but if you choose one of the conventional prompts and then edit it, you can get around it. So damn annoying.

3avturchin1yThey acknowledged the use of limited GPT-3, details: []
Predictions for GPT-N

Wow, I didn't realise I could get this angry about something so esoteric.

Predictions for GPT-N

I'm beginning to think AID has changed what the "Dragon" model is without telling us for cost reasons, I've had kind of the same experience with big lapses in storytelling that didn't occur as often before. Or maybe it's randomly switching based on server load? I can kind of understand it if that's the case but the lack of transparency is annoying. I remember accidentally using the Griffin model for a day when my subscription ran out and not realising because its Indonesian was still quite good...

5gwern1yQuite a few people have been complaining: []
2avturchin1yYes, I think it is correct impression. I've wrote in support, btw, no answer yet. One possible way to check the version is to try "Earth POV" - that is the "point of view". GPT-3 understands it correctly and will say something like "I am alone in the sly near Sun". GPT-2 will continue with a story.
Predictions for GPT-N

Somehow the more obvious explanation didn't occur to me until now, but check the settings, you might be using the Griffin model not the Dragon model. You have to change it manually even after you get the subscription. I have a window open specifically for poetry prompts (using the Oracle hack), I said "Write a long poem in Russian. Make sure the lines are long, vivid, rich, and full of description and life. It should be a love poem addressed to coffee. It should be 15 lines long" followed with "The Oracle, which is a native in Russian, ... (read more)

2avturchin1yObviously, I pressed "Dragon" button, but I suspect that I am still getting Griffin anyway, as I was also unable to repeat some of the reasoning tasks.
Predictions for GPT-N

If it's a BPE encoding thing (which seems unlikely to me given that it was able to produce Japanese and Chinese characters just fine), then the implication is OpenAI carried over their encoding from GPT-2 where all foreign language documents were removed from the dataset ... I would have trouble believing their team would have overlooked something that huge. This is doubly bizarre given that Russian is the 5/6th most common language in the dataset. You may want to try prompting it with coherent Russian text, my best guess is that in the dataset, whene... (read more)

6gwern1yLooking into the details, BPEs seem to usually fall back to treating unknown characters as literally bytes: so there's another 256 BPE which cover the 256 possible bytes, and then any UTF-8 character is 1-4 bytes, and so can be represented by 1-4 BPEs. The 1-byte UTF-8 characters are the ASCII characters, which have their own BPEs, so this would be used only for 2-4 byte-long UTF-8 characters like Cyrillic or Chinese. So actually, now that I think about it, it's possible that Russian gets encoded to worse than 1 BPE per character, it could be 2 BPEs (since Cyrillic seems to fall in the 2-byte ranges of UTF-8). It'd depend on the details. (While on the other hand, having to pay 2-4 BPEs per Unicode character is obviously not as big a deal for Japanese & Chinese characters...) I wouldn't expect the BPE to allocate much space to Cyrillic stuff because it's the 5th most common script in the dataset, as that's just another way of saying all the Russian put together is all of 0.18% of the dataset. And keep in mind that the BPE encoding was not, AFAIK, redone for GPT-3, but is the same BPE OA has been using ever since GPT-2 way back when, and so was optimized for their Reddit-sourced English-heavy original WebText.
3avturchin1yI tried many prompts but it produce gibberish in Russian. E.g.: Привет, как дела? (What's going on?) You don't know what to say. You're not sure if you should be thanking him or insulting him for this situation. He continues: Немного просью у мы, что выставляется! (It's too late now, get out of here! )
Predictions for GPT-N

That's a visualisation I made which I haven't posted anywhere else except under the r/ML thread collecting entries for GPT-3 demos, since I couldn't figure out which subreddit to post it in.

Predictions for GPT-N

Two thoughts, one of them significantly longer than the other since it's what I'm most excited about.

(1) It might be the case that the tasks showing an asymptotic trend will resemble the trend for arithmetic – a qualitative breakthrough was needed, which was out of reach at the current model size but became possible at a certain threshold.

(2) For translation, I can definitely say that scaling is doing something. When you narrowly define translation as BLEU score ("does this one generated sentence match the reference sentence? by how ... (read more)

9gwern1yWhat is the source for that? I haven't seen it before. EDIT: [] ? Seems to postdate this comment though.
1hippke1yRegarding (1): Of course a step is possible; you never know. But for arithmetic, it is not a step. That may appear so from their poor Figure, but the data indicates otherwise.
5avturchin1yInterestingly, GPT-3 knows a few words in Russian, but can't produce any coherent text. He said in Russian: И все хотиваниям рукой плать, что недобрыжки.
Probability that other architectures will scale as well as Transformers?

I just finished Iain M Banks' 'The Player of Games' so my thoughts are being influenced by that, but it had an interesting main character who made it his mission to become the best "general game-player" (e.g no specialising in specific games), so I would be interested to see whether policy-based reinforcement learning models scale (thinking of how Agent 57 exceeded human performance across all Atari games).

It seems kind of trivially true that a large enough MuZero with some architectural changes could do something like play chess,... (read more)

Developmental Stages of GPTs

Yes! I was thinking about this yesterday, it occurred to me that GPT-3's difficulty with rhyming consistently might not just be a byte-pair problem, any highly structured text with extremely specific, restrictive forward and backward dependencies is going to be a challenge if you're just linearly appending one token at a time onto a sequence without the ability to revise it (maybe we should try a 175-billion parameter BERT?). That explains and predicts a broad spectrum of issues and potential solutions (here I'm calling them A, B and C): per... (read more)

Alignment As A Bottleneck To Usefulness Of GPT-3

The best angle of attack here I think, is synthesising knowledge from multiple domains. I was able to get GPT-3 to write and then translate a Japanese poem about a (fictional) ancient language model into Chinese, Hungarian, and Swahili and annotate all of its translations with stylistic notes and historical references. I don't think any humans have the knowledge required to do that, but unsurprisingly GPT-3 does, and performed better when I used the premise of multiple humans collaborating. It's said that getting different university departments... (read more)

2johnswentworth1yAwesome example!
To what extent is GPT-3 capable of reasoning?

I think you were pretty clear on your thoughts, actually. So, the easy / low-level way response to some of your skeptical thoughts would be technical details and I'm going to do that and then follow it with a higher-level, more conceptual response.

The source of a lot of my skepticism is GPT-3's inherent inconsistency. It can range wildly from it's high-quality ouput to gibberish, repetition, regurgitation etc. If it did have some reasoning process, I wouldn't expect such inconsistency. Even when it is performing so well people call it &
... (read more)
5Rekrul1yYou've given me a lot to think about (and may have even lowered my confidence in some of my assertions). Kudos! I do still have some thoughts to give in response though, but they don't really function as very in-depth responses to your points, as I'm still in the process of ruminating: * I agree with you that GPT-3 probably hasn't memorized the prompts given in the OP, it's too rare for that to be worth it. I just think it's so big and has access to so much data it really doesn't need to solve prompts like that. Take the Navy Seal Copypasta prompts Gwern did as an illustration. Those were essentially GPT-3 belching out it's connections in a very funny fashion. A lot of which were very weird/obscure. I just think people aren't truly appreciating the huge scope of this machine, and in a sense are underestimating it. * In some sense, I wish we could test GPT-3 how we could test animals for reasoning. Text is different from real life though, and If we put a totally new object in front of a crow, it could interact with it and learn how it works and use it to solve a difficult task, whereas I don't feel we can necessarily do the same with GPT-3. The closest we get in my opinion are the prompts where a novel word is introduced, but we usually have to describe it enough to work, and I feel like that defeats the discovery angle, which I feel is important. * Actually, now that I'm on this train of thought, a good prompt that could convince me there's some kind of reasoning going on could be built upon that. Perhaps, a prompt where a fictional very-alien animal (by this I mean, very unlike Earth animals, so with very weird "abilities") is in conflict with another fictional very-alien animal, but we don't give GPT much information about the two animals and the overall scenario, and we somehow force it to interrogate us until it understands enough to submit an accurate story about how the conflic
To what extent is GPT-3 capable of reasoning?

Hmm, I think the purpose behind my post went amiss. The point of the exercise is process-oriented not result-oriented - to either learn to better differentiate the concepts in your head by poking and prodding at them with concrete examples, or realise that they aren't quite distinct at all. But in any case, I have a few responses to your question. The most relevant one was covered by another commenter (reasoning ability isn't binary/quantitative not qualitative). The remaining two are:

1. "Why isn't it an AGI?" here can be read as &... (read more)

1maximkazhenkov1yWhy would goal-driven behavior be necessary for passing a Turing test? It just needs to predict human behavior in a limited context, which was what GPT-3 was trained to do. It's not an RL setting. I would like to dispute that by drawing the analogy to the definition of fire before modern chemistry. We didn't know exactly what fire is, but it's a "you know it when you see it" kind of deal. It's not helpful to pre-commit to a certain benchmark, like we did with chess - at one point we were sure beating the world champion in chess would be a definitive sign of intelligence, but Deep Blue came and went and we now agree that chess AIs aren't general intelligence. I know this sounds like moving the goal-post, but then again, the point of contention here isn't whether OpenAI deserves some brownie points or not. It seems like you think I made that suggestion in bad faith, but I was being genuine with that idea. The "competent judges" part was so that the judges, you know, are actually asking adversarial questions, which is the point of the test. Cases like Eugene Goostman should get filtered out. I would grant the AI be allowed to be trained on a corpus of adversarial queries from past Turing tests (though I don't expect this to help), but the judges should also have access to this corpus so they can try to come up with questions orthogonal to it. I think the point at which our intuitions depart is: I expect there to be a sharp distinction between general and narrow intelligence, and I expect the difference to resolve very unambiguously in any reasonably well designed test, which is why I don't care too much about precise benchmarks. Since you don't share this intuition, I can see why you feel so strongly about precisely defining these benchmarks. I could offer some alternative ideas in an RL setting though: * An AI that solves Snake perfectly on any map (maps should be randomly generated and separated between training and test set), or * An AI that solves unsee
To what extent is GPT-3 capable of reasoning?

Great, but the terms you're operating with here are kind of vague. What problems could you give to GPT-3 that would tell you whether it was reasoning, versus "recognising and predicting", passive "pattern-matching" or a presenting "illusion of reasoning"? This was a position I subscribed to until recently, when I realised that every time I saw GPT-3 perform a reasoning-related task, I automatically went "oh, but that's not real reasoning, it could do that just by pattern-matching", and when I saw it do some... (read more)

3maximkazhenkov1yPassing the Turing test with competent judges. If you feel like that's too harsh yet insist on GPT-3 being capable of reasoning, then ask yourself: what's still missing? It's capable of both pattern recognition and reasoning, so why isn't it an AGI yet?
8Rekrul1yI recognize the points you are making, and I agree, I don't want to be a person who sets an unfeasibly high bar, but with how GPT-3 was developed it's really difficult to put one that isn't near that height. If GPT-3 was instead made with mostly algorithmic advances instead of mostly scaling, I'd be a lot more comfortable placing said bar and a lot less skeptical, but it wasn't, and the sheer size of all this is in a sense intimidating. The source of a lot of my skepticism is GPT-3's inherent inconsistency. It can range wildly from it's high-quality ouput to gibberish, repetition, regurgitation etc. If it did have some reasoning process, I wouldn't expect such inconsistency. Even when it is performing so well people call it "reasoning" it has enough artifacts of it's "non-reasoning" output to make me skeptical (logical contradictions, it's tendency to repeat itself i.e. "Because Gravity Duh" like in the OP, etc). This is unfortunately just a very hard subject to get to the truth on. Illusions are meant to be convincing. Designing a test is difficult purely because the machine is like I said, intimidating. It has so much knowledge inside it's "head". It's pretty much safe to just assume it knows just about everything on the internet prior to it's creation. How do we design a test around that? Even when we get weird and start asking about stuff like reversing gravity and clouds becoming a solid substance, the internet is big and huge and full of weird questions, and the model is big enough to have representations of just about everything, including the weird. So the only tests I can think to design around that are wholly unsatisfying. Like, asking it to replicate using reasoning some fact about something we discover next week that we also discovered purely through reasoning. This is technically a test, in the same way garbage is technically edible. The ideal answer is we develop transparency & interpretation tools that allow us to crack open the huge black-box tha
Collection of GPT-3 results

A bunch of more examples here, a bit difficult to summarise since it went from explaining how dopamine receptors work, to writing a poem about Amazon's logistics in the form of a paean to the Moon Goddess, writing poems in Chinese based on English instructions and then providing astonishingly-good translations, to having Amazon and Alibaba diss one another in the style of 18th century poet Mary Robinson. Link here:


The oracle
... (read more)