hogwash9

Ω010

Imagine there was a bijection between model parameters and resulting function. (I'm aware this is not at all true.) In that case it seems like you are enforcing the constraint that the two heads have identical parameters.

AFAIK, I always imagined the idea behind this objective function to be quite similar to contrastive learning, where you have two networks (or equivalently two sets of parameters), and the goal is to maximize agreement for pairs of inputs to each network that have the same ground truth class/label (conversely maximize disagreement for pairs that are different). That in mind, there are various papers (e.g.) that explore the possibility of "collapsed" solutions like the one you mentioned (where both networks are learning the same mapping, such that there's less benefit to propagating any examples through two networks), which makes this something that we want to minimize. In practice, though, this has been found to occur rarely (c.f. [1]).

Nonetheless, since reading Paul's statement about the problem of the instrumental model, I've been thinking about issues that might arise with the proposed solution, even though similar approaches (i.e. the contrastive training objective) have proven effective for robustness in general (e.g. against adversarial perturbations, data limited scenarios). If I were committed to this stance, I would agree somewhat with the desire to explore alternatives, and I have thought about the extent to which some sort of reconstruction loss could be introduced; this is where the goal might instead be to "maximize agreement" with a set of non-trivial observations/facts that are guaranteed to be more "objective" (somehow) than the original training data (one inspiration being that reconstruction losses in vision deep learning papers like this one often turn out to be good regularizers). So far I haven't had any promising proposals come to light for generative LM.

I am still holding onto the thought, given the remote possibility that all of my above assumptions are correct, and also because "generative models" might reflect the ideal approach to unsupervised learning, whereas "contrastive learning" is sometimes seen as a sort of compromise since (unlike generative models) it's amenable to limited compute [2].

20

...assume that the

likelihood of a given simulation to be run is inversely correlated with the computational complexity of the simulation, in the space of all the simulation ever run. We can call the latter the Simplicity Assumption (SA)...

Isn't it possible that "simplicity" (according to one or more definitions thereof) need not care about the amount of raw computation required [0] to run any patch of simulation, nor with the volume of space it simulates? E.g. Occam's Razor's measure of 'simplicity' (for AI) gives some function of the description length of a program running on a (universal) computer, so as to predict its own future percepts [1].

Now consider a civilization simulation A that is simulating in detail our solar system and mocking the rest of the universe and a simulation B which is simulating in detail the whole milky way and mocking the rest. Simulating in detail the milky way is about harder, if we count the number of stars and black holes. According to the SA with linear scaling, being in simulation B is about less likely than being in A.

This particular example was what threw me off. In particular, we can presume that programs with shorter descriptions might better (i.e. more plausibly) simulate a complex system, and are more likely to be found by a computer/AI that iterates over possible programs, starting with the simplest one (like in Solomonoff Induction IIUC). This finds the shortest program that nonetheless *sufficiently describes* some observation sequence, which would not necessarily favor encoding special cases (i.e. "mocking") for costly things to simulate generally. Instead, mocking (since it optimizes for computational cost) might map to a different thing in Solomonoff, having the tradeoff of making the description more complex than the shortest possible one.

For example, to simulate a human being acting within a nontrivial universe [2], one might hold that there must exist some mathematical structure that describes the human in all the ways we care about, in which case the runtime of their cognitive algorithms, etc. might have to be quite costly [3]. It might be more algorithmically probable, then, for such a human to be mapped to an algorithm built out of simple priors (e.g. laws of physics) instead of high-level code describing what the human does in various edge cases.

This isn't by any means a refutation of your argument, but rather just a thought provoker concerning the main premise of what the Simplicity Assumption should mean [4]. I agree with you and others that "simplicity" should be an organizing principle (that conditions one's priors over the types of possible universes). However, your post didn't coincide with my implicit definition of "simplicity".

[0] (and possibly the amount of computation it *seems* to require)

[1] While your post isn't about AI generated universes, predictions made by an AI might well generate viable simulations (which might then become part of the hypothesis space under consideration).

[2] Another prior holds that we don't appear to be privileged observers within our own universe; in a similar vein, neither might one (rationally?) hold that solipsism is a valid ontology over observers, etc..

[3] Admittedly, the example of accurately simulating one or more human doesn't rule out the possibility that only the observations that people notice are the ones that are simulated (per your view), the rest being "mocked." On this topic, I can only defer to AI related discussions like this and here as to how one might begin to condition the probability space over types of (simulated) universes.

[4] Though I don't personally know of a good argument in favor of the Speed Prior if we're talking about inductive inference leading to simulations.

70

Going by GPT-2's BPEs [1], and based on the encoder downloaded via OpenAI's script, there are 819 (single) tokens/embeddings that uniquely map to the numbers from 0-1000, 907 when going up to 10,000, and 912 up to 200,000 [2]. These embeddings of course get preferentially fed into the model in order to maximize the number of characters in the context window and thereby leverage the statistical benefit of BPEs for language modeling. Which bears to mind that the above counts exclude numeric tokens that have a space at the beginning [3].

My point here being that, IIUC, for the language model to actually be able to manipulate individual digits, as well as pick up on the elementary operations of arithmetic (e.g. carry, shift, etc.), the expected number of unique tokens/embeddings might have to be limited to 10 – the base of the number system – when counting from 0 to the largest representable number [2].

[1] From the GPT-3 paper, it was noted:

This [GPT'3 performance on some other task] could be a weakness due to reusing the byte-level BPE tokenizer of GPT-2 which was developed for an almost entirely English training dataset.

[2] More speculatively, I think that this limitation makes extrapolation on certain abilities (arithmetic, algebra, coding) quite difficult without knowing whether its BPE will be optimized for the manipulation of individual digits/characters if need be, and that this limits the generalizability of studies such as GPT-3 not being able to do math.

[3] For such tokens, there are a total 505 up to 1000. Like the other byte pairs, these may have been automatically mapped based on the distribution of n-grams in some statistical sample (and so easily overlooked).

10

Is it in AI's interest (a big assumption that is has interests at all, I know) to become so human-specific that it loses its ability to generalize?

There's an approach called learning the prior through imitative generalization, that seemed to me a promising way to address this problem. Most relevant quotes from that article:

We might hope that our models will naturally generalize correctly from easy-to-answer questions to the ones that we care about. However, a natural pathological generalisation is for our models to only give us ‘human-like’ answers to questions, even if it knows the best answer is different. If we only have access to these human-like answers to questions, that probably doesn’t give us enough information to supervise a superhuman model.

What we’re going to call ‘Imitative Generalization’ is a possible way to narrow the gap between the things our model knows, and the questions we can train our model to answer honestly. It avoids the pathological generalisation by only using ML for IID tasks, and imitating the way humans generalize. This hopefully gives us answers that are more like ‘how a human would answer if they’d learnt from all the data the model has learnt from’. We supervise how the model does the transfer, to get the sort of generalisation we want.

30

Although I don't agree with everything in this site, I found this cluster of knowledge related advice (learning abstractions) and the rest of the site (made by a LW'er IIRC) very interesting if not helpful thus far; it seems to have advocated that:

- Forced learning/too fast pacing (cramming) can be counterproductive since you're no longer learning for the sake of learning (
*mostly*true in my experience). - Abstract knowledge (math) tends to be most useful since it can be applied fruitfully. And you can actually readily use those abstractions for practical things, through honing intuitions about how to approach a lot of technical problems, mainly by mapping subproblems to mathematical abstractions. Those problems (coding/calculation) are made harder to forget how to solve.
- Being curiosity driven is instrumentally useful (since it does help with future learning, delaying aging, etc.), and is of course rational.
- Spaced repetition seems to work well for math and algorithms and is self-reinforcing if done in a curiosity driven approach. However, instead of using specific software to "gamify" this, I personally just recall certain key principles in my head, ask myself the motivations behind certain concepts, and keep a list of summarized points/derivations/copied diagrams around in a simple Notes document to review things "offline". (But I'll need to check out Anki sometime.)

That's most of what I took away from the resources that the site offered.

Some disclaimers/reservations (**strictly opinions**) based on personal experiences, followed by some **open questions**:

- I don't think the "forgetting curve" is as important as the site makes it sound, particularly when it comes to abstractions, but this curve might have been about "general" knowledge, i.e. learning facts in general. The situation with abstract knowledge seems to be the opposite.
- Hence, forgetting might not be as "precious" with abstractions, and might in fact impair ability to learn in the future. Abstractions, including lessons in rationality, are (IMO) meant to help with learning, not always for communicating/framing concepts.
- It might require a fair bit of object level experiences (recallable from long term memory) to integrate abstract knowledge meaningfully and efficiently. Otherwise that knowledge isn't grounded in experience, and we know that that's just as disadvantageous for humans as AI.
**Q1:**It remains unclear whether there exists a broader applicable scope here (in terms of other ways that knowledge itself can be used to build competence) except by honing rationality, Bayesianism, and general mathematical knowledge. Would it make sense if there was or wasn't?**Q2:**It seems important to be able to figure out (on a self-supervised, intuitive level) when a learned abstraction is interfering with learning something new or being competent, in the sense that one has to detect whether it is being misapplied or is complicating the representation of knowledge more so than simplifying. Appropriate & deep knowledge of the motivations behind abstractions, their situations, and invariances would seem to help at first glance, in addition to prioritizing first-principles priors when approaching a problem instead of rigid assumptions.**Q3:**Doing this may not suit everyone who isn't a student or full-time autodidact, (and reads textbooks for fun,*and*has a technical background). Also, I haven't come across an example of someone who prolonged their useful careers, earned millions of dollars, etc., as a provable result of abstraction. Conversely, practitioners develop a lot of skills that directly help within a specialized economy. There still remain very obvious reasons to condense a whole bunch of mathy (and*some*computer-sciency) abstractions as flashcards and whatnot to save time.

50

**Edited for clarity and to correct misinterpretations of central arguments.**

This response is to consider (contra your arguments) the ways in which the transformer might be fundamentally different from the model of a NN that you may be thinking about, which is as a series of matrix multiplications of "fixed" weight matrices. This is the assumption that I will first try to undermine. In so doing, I might hopefully lay some groundwork for an explanatory framework for neural networks that have self-attention layers (for much later), or (better) inspire transparency efforts to be made by others, since I'm mainly writing this to provoke further thought.

However, I do hope to make some justifiable case below for transformers being able to scale in the limit to an AGI-like model (i.e. which was an emphatic “no” from you) because they do seem to be exhibiting the type of behavior (i.e. few-shot learning, out of distribution generalization) that we'd expect would scale to AGI sufficiently, if improvements in these respects were to continue further.

I see that you are already familiar with transformers, and I will reference this description of their architecture throughout.

**Epistemic Status: **What follows are currently incomplete, likely fatally flawed arguments that I may correct down the line.

**Caveats: It's reasonable to dismiss transformers/GPT-N as falling into the same general class of fully connected architectures in the sense that:**- They're data hungry, like most DNNs, at least during the pre-training phase.
- They're not explicitly replicating neocortical algorithms (such as bottom-up feedback on model predictions) that we know are important for systematic generalization.
- They have extraneous inductive biases besides those in the neocortex, which hinder efficiency.
- Some closer approximation of the neocortex, such as hierarchical temporal memory, is necessary for efficiently scaling to AGI.

**Looking closer: How Transformers Depart "Functionally" from Most DNNs**- Across two weight matrices of a fully-connected DNN, we see something like:
- σ(Ax)^T * B^T, for some vector x, and hidden layers {A, B}, which gives just another vector of activations where σ is an element-wise activation function.
- These activations are "dynamic". But I think you would be right to say that they do not in any sense modify the weights applied to activations downstream; however, this was a behavior you implied was missing in the transformer that one might find in the neocortex.

- σ(Ax)^T * B^T, for some vector x, and hidden layers {A, B}, which gives just another vector of activations where σ is an element-wise activation function.
- In a Transformer self-attention matrix (A=QK^T), though, we see a dynamic weight matrix:
**(Skippable)**As such, the values (inner products) of this matrix are contrastive similarity scores between each (k∈K,q∈Q) vector pairs**(Skippable)**Furthermore, matrix A consists of n*n inner products S_i = {<k_i,q_1>..<k_i,q_n> | k*i∈K,qi∈Q, i<=n}, where each row A_i of A*has entries mapped to an ordered set S_i- Crucially, the rows S_i of A are coefficients of a convex combination of the rows of the value matrix (V) when taking softmax(A)V. This is in order to compute the output (matrix) of a self-attention layer, which is also different from the kind of matrix multiplication that was performed when computing similarity matrix A.
- Note: The softmax(A) function in this case is also normalizing the values of matrix A row-wise, not column-wise, matrix-wise, etc.

- Across two weight matrices of a fully-connected DNN, we see something like:

**3. Onto the counterargument**:

- Given what was described in (2), I mainly argue that softmax(A=QK^T)V is a very different computation from the kind that fully connected neural networks perform and what you may've envisioned.
- We specifically see that:
- Ordinary, fully connected (as well as convolutional, most recurrent) neural nets don’t generate "dynamic" weights that are then applied to any activations.
- It is possible that the transformer explicitly conditions a self-attention matrix
**A_l**at layer l such that the downstream layers l+1..L (given L self-attention layers) are more likely to produce the correct embedding tokens. This is*because*we’re only giving the transformer L layers to compute the final result. - Regardless of if the above is happening, the transformer is being “guided” to do implicit meta-learning as part of its pre-training, because:
- (a) It’s conditioning its weight matrices (A_l) on the given context (X_l) to maximize the probability of the correct autoregressive output in X_L, in a different manner from learning an ordinary, hierarchical representation upstream.
- (b) As it improves the conditioning described in (a) during pre-training, it gets closer to optimal performance on some downstream, unseen task (via 0-shot learning). This is assumed to be true on an evidential basis.
- (c) I argue that such zero-shot learning on an unseen task T requires online-learning on that task, which is being described in the given context.
- (d) Further, I argue that this online-learning improves sample-efficiency when doing gradient updates on an unseen task T, by approximately recognizing a similar task T’ given the information in the context (X_1). Sample efficiency is improved because the training loss on T can be determined by its few-shot performance on T, which is related to few-shot accuracy, and because training-steps-to-convergence is directly related to training loss.
- So, when actually performing such updates, a better few-shot learner will take fewer training steps. Crucially, it improves the sample efficiency of its future training in not just a “prosaic” manner of having improved its held-out test accuracy, but through (a-c) where it “learns to adapt” to an unseen task (somehow).

Unfortunately, I wouldn’t know what is precisely happening in (a-d) that allows for systematic meta-learning to occur, in order for the key proposition:

First, for the reason mentioned above, I think the sample efficiency is bound to be dramatically worse for training a Transformer versus training a real generative-model-centric system. And this [sample inefficiency] makes it difficult or impossible for it to learn or create concepts that humans are not already using.

to be weakened substantially. I just think that meta-learning is indeed happening given the few-shot generalization to unseen tasks that was demonstrated, which only *looks like* it has something to do with the dynamic weight matrix behavior *suggested* by (a-d). However, I do not think that it's enough to show the dynamic weights mechanism described initially (is doing such and such contrastive learning), or to show that it's an overhaul from ordinary DNNs and therefore robustly solves the generative objective (even if that were the case). Someone would instead have to demonstrate that transformers are systematically performing meta-learning (hence out-of-distribution and few-shot generalization) on task T, which I think is worthwhile to investigate given what they have accomplished experimentally.

Granted, I do believe that more closely replicating cortical algorithms is important for efficiently scaling to AGI and for explainability (I've read *On Intelligence, Surfing Uncertainty*, and several of your articles). The question, then, is whether there are multiple viable paths to efficiently-scaled, safe AGI in the sense that we can functionally (though not necessarily explicitly) replicate those algorithms.

It makes sense that negative pairs would help to a large extent, but not all contrastive papers used negative examples, like BYOL (ref). Edit: but now I'm realizing that this might no longer fit the definition of contrastive learning (instead just ordinary self supervised learning), so I apologize about the error/confusion in that case.