Going by GPT-2's BPEs , and based on the encoder downloaded via OpenAI's script, there are 819 (single) tokens/embeddings that uniquely map to the numbers from 0-1000, 907 when going up to 10,000, and 912 up to 200,000 . These embeddings of course get preferentially fed into the model in order to maximize the number of characters in the context window and thereby leverage the statistical benefit of BPEs for language modeling. Which bears to mind that the above counts exclude numeric tokens that have a space at the beginning .
My point here being that, IIUC, for the language model to actually be able to manipulate individual digits, as well as pick up on the elementary operations of arithmetic (e.g. carry, shift, etc.), the expected number of unique tokens/embeddings might have to be limited to 10 – the base of the number system – when counting from 0 to the largest representable number .
 From the GPT-3 paper, it was noted:
This [GPT'3 performance on some other task] could be a weakness due to reusing the byte-level BPE tokenizer of GPT-2 which was developed for an almost entirely English training dataset.
 More speculatively, I think that this limitation makes extrapolation on certain abilities (arithmetic, algebra, coding) quite difficult without knowing whether its BPE will be optimized for the manipulation of individual digits/characters if need be, and that this limits the generalizability of studies such as GPT-3 not being able to do math.
 For such tokens, there are a total 505 up to 1000. Like the other byte pairs, these may have been automatically mapped based on the distribution of n-grams in some statistical sample (and so easily overlooked).
Is it in AI's interest (a big assumption that is has interests at all, I know) to become so human-specific that it loses its ability to generalize?
There's an approach called learning the prior through imitative generalization, that seemed to me a promising way to address this problem. Most relevant quotes from that article:
We might hope that our models will naturally generalize correctly from easy-to-answer questions to the ones that we care about. However, a natural pathological generalisation is for our models to only give us ‘human-like’ answers to questions, even if it knows the best answer is different. If we only have access to these human-like answers to questions, that probably doesn’t give us enough information to supervise a superhuman model.What we’re going to call ‘Imitative Generalization’ is a possible way to narrow the gap between the things our model knows, and the questions we can train our model to answer honestly. It avoids the pathological generalisation by only using ML for IID tasks, and imitating the way humans generalize. This hopefully gives us answers that are more like ‘how a human would answer if they’d learnt from all the data the model has learnt from’. We supervise how the model does the transfer, to get the sort of generalisation we want.
We might hope that our models will naturally generalize correctly from easy-to-answer questions to the ones that we care about. However, a natural pathological generalisation is for our models to only give us ‘human-like’ answers to questions, even if it knows the best answer is different. If we only have access to these human-like answers to questions, that probably doesn’t give us enough information to supervise a superhuman model.
What we’re going to call ‘Imitative Generalization’ is a possible way to narrow the gap between the things our model knows, and the questions we can train our model to answer honestly. It avoids the pathological generalisation by only using ML for IID tasks, and imitating the way humans generalize. This hopefully gives us answers that are more like ‘how a human would answer if they’d learnt from all the data the model has learnt from’. We supervise how the model does the transfer, to get the sort of generalisation we want.
Although I don't agree with everything in this site, I found this cluster of knowledge related advice (learning abstractions) and the rest of the site (made by a LW'er IIRC) very interesting if not helpful thus far; it seems to have advocated that:
That's most of what I took away from the resources that the site offered.
Some disclaimers/reservations (strictly opinions) based on personal experiences, followed by some open questions:
Edited for clarity and to correct misinterpretations of central arguments.
This response is to consider (contra your arguments) the ways in which the transformer might be fundamentally different from the model of a NN that you may be thinking about, which is as a series of matrix multiplications of "fixed" weight matrices. This is the assumption that I will first try to undermine. In so doing, I might hopefully lay some groundwork for an explanatory framework for neural networks that have self-attention layers (for much later), or (better) inspire transparency efforts to be made by others, since I'm mainly writing this to provoke further thought.
However, I do hope to make some justifiable case below for transformers being able to scale in the limit to an AGI-like model (i.e. which was an emphatic “no” from you) because they do seem to be exhibiting the type of behavior (i.e. few-shot learning, out of distribution generalization) that we'd expect would scale to AGI sufficiently, if improvements in these respects were to continue further.
I see that you are already familiar with transformers, and I will reference this description of their architecture throughout.
Epistemic Status: What follows are currently incomplete, likely fatally flawed arguments that I may correct down the line.
3. Onto the counterargument:
Unfortunately, I wouldn’t know what is precisely happening in (a-d) that allows for systematic meta-learning to occur, in order for the key proposition:
First, for the reason mentioned above, I think the sample efficiency is bound to be dramatically worse for training a Transformer versus training a real generative-model-centric system. And this [sample inefficiency] makes it difficult or impossible for it to learn or create concepts that humans are not already using.
to be weakened substantially. I just think that meta-learning is indeed happening given the few-shot generalization to unseen tasks that was demonstrated, which only looks like it has something to do with the dynamic weight matrix behavior suggested by (a-d). However, I do not think that it's enough to show the dynamic weights mechanism described initially (is doing such and such contrastive learning), or to show that it's an overhaul from ordinary DNNs and therefore robustly solves the generative objective (even if that were the case). Someone would instead have to demonstrate that transformers are systematically performing meta-learning (hence out-of-distribution and few-shot generalization) on task T, which I think is worthwhile to investigate given what they have accomplished experimentally.
Granted, I do believe that more closely replicating cortical algorithms is important for efficiently scaling to AGI and for explainability (I've read On Intelligence, Surfing Uncertainty, and several of your articles). The question, then, is whether there are multiple viable paths to efficiently-scaled, safe AGI in the sense that we can functionally (though not necessarily explicitly) replicate those algorithms.