I think that, memetically, we'll be selecting hardest for cases where it is most difficult to see the ways in which the human is doing a lot of work by pushing it out into the context thus making gpt-3 look most impressive. I think only a little of that is happening here, just saying it because this is what sparked the thought.
Interesting to think about how this will evolve. Over time, humans will have to do less of the work, and the combined system will be able to do more. (Though the selection pressure that you mention will continue to be there.)
It seems to me that we might not be too far away from "natural language programming". With some combination of the the above approach, plus the program synthesis examples where you just specify a comment, plus some extra tricks, it seems like you could end up just sort of specifying your programs via an algorithm description in English.
You'd want to set it up so that it alerted you when it thought things were ambiguous, and that it auto-generated test cases for different possible interpretations and showed you the results.
I've personally started using TabNine in the last few weeks, and I'd say it's just barely over the edge of being useful. But I can imagine next-gen versions of these things pretty radically transforming the process of programming.
Possible startup idea -- design an IDE from the ground up to take advantage of GPT-like abilities.
I think just using the next-gen version of TabNine will be powerful, and I expect all major IDEs' autocomplete features to improve a lot in the coming years, but I also suspect that if you designed an IDE to really take advantage of that these systems can do, you might end up designing something rather different from just today's IDEs + better autocomplete.
With some combination of the the above approach, plus the program synthesis examples where you just specify a comment, plus some extra tricks
Another interesting direction to go with this -- can you get it to do a sort of distillation step, where you first get amplified-GPT to implement some algorithm, a la the recursion dialogue above, and then you get it to generate code that implements the same algorithm?
I agree. Coming up with the right prompts was not trivial. I almost quit several times. Yet, there is a science to this and I think it’ll become more important to turn out focus away from the spectacle aspects of GPT and more towards reproducibility. More so if the way forward is via interrelated instances of GPT.
As an aside, critique seems much easier than generation. I’m cautiously optimistic about prompting GPT instances to “check” output.
How are you actually doing this in AI Dungeon? I have Dragon mode enabled, everything else default.
I start a new Single player game. Choose Custom mode(6). Then at the prompt I just paste (using Say mode)
Q: Say I want to sum the items in a list. How would I do this recursively? The answer involves two steps.
and I get
Q: Say I want to sum the items in a list. How would I do this recursively? The answer involves two steps. First, I need to know how many items there are in total. Second, I need to find out which item is at the top of that list. A: You could use recursive_sum() .
Similarly when I tried to reproduce stuff from https://old.reddit.com/r/slatestarcodex/comments/hrx2id/a_collection_of_amazing_things_gpt3_has_done/ I didn't get anything near as impressive. Also the responses get increasingly confused. Like if I ask it to translate something to French or Romanian it will randomly translate later prompts as well.
Is there some basic tutorial for how you seed these AI Dungeon interactions?
You could prompt with “Q:” + (content) and then “A:”
I use the default settings on the temperature, but I do cut it off after it finishes an answer. However, you likely won’t get my exact results unless you literally copy the instances. Moreover, if you gave up after the first response I think might’ve given up to quickly. You can respond to it and communicate more information, as I did. The above really was what I got on the first try. It’s not perfect, but that’s the point. You can teach it. It’s not “it works” or “it doesn’t work”.
I don’t think there are tutorials, but perhaps in due time someone (maybe me) will get to that. I also feel like ‘trying’ to get it to do something might be a sub-optimal approach. This is a subtle difference, but my intent here was to get it to confirm it understood what I was asking by answering questions.
The approach I've been using (for different things, but I suspect the principle is the same) is
At that point it is much better than one where you prompted everything to begin with.
Your formatting makes it hard for me to tell which parts are your prompting vs GPT3. one common format is ‘bold = you’, but the opening line wasn’t bold, so I was confused about what’s going on there
Thanks! I forgot to do this. Luckily I can go back through the run and put this is in. There is ambiguity whenever it auto-completes, but I hope I did a decent job of noting where this is happening.
This early example of inner-monologues in LLMs has now been covered in The Atlantic: https://www.theatlantic.com/technology/2026/04/4chan-ai-dungeon-thinking-reasoning/686794/
Epistemological Status: The idea is un-original and the implementation is the first I'm aware of. I'm reasonably confident anyone (patient) with an hour to burn could reproduce these results. You can probably amplify GPT3 directly.
Let me cut the suspense. I can get GPT3 to sum the items in a list using recursion. I have a notebook here and Dungeon instances for the parts: factoring, amplification1, and amplification2. I'm suggesting I can combine different instances of GPT3 to implement primitive recursive functions. Moreover, I can get GPT3 to suggest approaches for implementation. Combined, this is a toy-example showing how to amplify GPT3's capabilities. At the moment, this method is slow, but does provide proof-of-concept. Obvious extensions include: prompting GPT3 to factor questions, ask questions, generate examples, induce belief states that encourage learning.
Factoring
In the following I don’t re-roll unless I say so. Anything omitted here is available in the links at the top. Mistakes have to be dealt with in script. This simplifies reproducibility. I'll bold my prompts and italicize for emphasis. Occasionally, GPT will get ahead of me and auto-complete questions. Before hand I decided that I'd cut off at the end of an answer, but that doesn't mean that I can't ask the same question if I think it was good. First, I state the problem,
Here GPT3 fails to explain step one properly,
Yet, with prompting recovers.
This is good. I go into more detail on what step two looks like. I go a bit meta to see if that works,
Amplification
The next stage is getting GPT to implement the components it's suggested. Things get a bit odd,
On the next example GPT3 objects and then completes everything bolded,
In practice, I need to run two different instances of GPT3 in order to calculate the list sum. One will return the head of the list and the other will keep a running total. For the head method we have a prompt cycling over things like,
This works very well and is an improvement on what you get if you try and freeze the instance inside the portion where I ask in the factoring conversation. Summation is a bit harder. I need something like this,
However, we get some mistakes at first,
To be clear, it can do simple addition at this point, it simply struggles to do large addition. I put examples in the notebook. I leave the answer here, but mention that when I originally ran this I had,
It almost learned to carry and then over-fit. A full run, with two re-rolls on non-answers using split tab instances of GPT3 looks like this,