Imagine there was a bijection between model parameters and resulting function. (I'm aware this is not at all true.) In that case it seems like you are enforcing the constraint that the two heads have identical parameters.

AFAIK, I always imagined the idea behind this objective function to be quite similar to contrastive learning, where you have two networks (or equivalently two sets of parameters), and the goal is to maximize agreement for pairs of inputs to each network that have the same ground truth class/label (conversely maximize disagreement for pairs...

22y

I haven't read the paper, but in contrastive learning, aren't these solutions prevented by the negative examples?

3y2

...assume that the

likelihood of a given simulation to be run is inversely correlated with the computational complexity of the simulation, in the space of all the simulation ever run. We can call the latter the Simplicity Assumption (SA)...

Isn't it possible that "simplicity" (according to one or more definitions thereof) need not care about the amount of raw computation required [0] to run any patch of simulation, nor with the volume of space it simulates? E.g. Occam's Razor's measure of 'simplicity' (for AI) gives some function of the description length o...

23y

My view is that Kolmogorov is the right simplicity measure for probabilistically or brute force generated universes, as you also mention. But for intentionally generated universes the length and elegance of the program is not that relevant in determining how likely is a simulation to be run, while computational power and memory are hard constraints that the simulators must face.
For instance while I would expect unnecessary long programs to be unlikely to be run, if a long program L is 2x more efficient than a shorter program S, then I expect L to be more likely (many more simulators can afford L, cheaper to run in bulk, etc.).

Going by GPT-2's BPEs [1], and based on the encoder downloaded via OpenAI's script, there are 819 (single) tokens/embeddings that uniquely map to the numbers from 0-1000, 907 when going up to 10,000, and 912 up to 200,000 [2]. These embeddings of course get preferentially fed into the model in order to maximize the number of characters in the context window and thereby leverage the statistical benefit of BPEs for language modeling. Which bears to mind that the above counts exclude numeric tokens that have a space at the beginning [3].

My point h...

Is it in AI's interest (a big assumption that is has interests at all, I know) to become so human-specific that it loses its ability to generalize?

There's an approach called learning the prior through imitative generalization, that seemed to me a promising way to address this problem. Most relevant quotes from that article:

...We might hope that our models will naturally generalize correctly from easy-to-answer questions to the ones that we care about. However, a natural pathological generalisation is for our models to only give us ‘human-like’ answers to ques

3y3

Although I don't agree with everything in this site, I found this cluster of knowledge related advice (learning abstractions) and the rest of the site (made by a LW'er IIRC) very interesting if not helpful thus far; it seems to have advocated that:

- Forced learning/too fast pacing (cramming) can be counterproductive since you're no longer learning for the sake of learning (
*mostly*true in my experience). - Abstract knowledge (math) tends to be most useful since it can be applied fruitfully. And you can actually readily use those abstractions for practical things

3y5

**Edited for clarity and to correct misinterpretations of central arguments.**

This response is to consider (contra your arguments) the ways in which the transformer might be fundamentally different from the model of a NN that you may be thinking about, which is as a series of matrix multiplications of "fixed" weight matrices. This is the assumption that I will first try to undermine. In so doing, I might hopefully lay some groundwork for an explanatory framework for neural networks that have self-attention layers (for much later), or (better) inspire transpare...

33y

I don't feel emphatic about it. Well, I have a model in my head and within that model a transformer can't scale to AGI, and I was describing that model here, but (1) I'm uncertain that that model is the right way to think about things, (2) even if it is, I don't have high confidence that I'm properly situating transformers within that model†, (3) even if I am, there is a whole universe of ways to take a Transformer architecture and tweak / augment it—like hook it up to a random access memory or tree search or any other data structure or algorithm, or give it more recurrency, or who knows what else—and I haven't thought through all those possibilities and would not be shocked if somewhere in that space was a way to fill in what I see as the gaps.
† The paper relating Hopfield networks to transformers came out shortly after I posted this, and seems relevant to evaluating my idea that transformer networks are imitating some aspects of probabilistic programming / PGM inference, but I'm not sure, I haven't really digested it.
I'm confused about how you're using the terms "online learning" and "meta-learning" here.
I generally understand "online learning" in the sense of this, where you're editing the model weights during deployment by doing gradient descent steps for each new piece of labeled data you get. If you're generating text with GPT-3, then there's no labeled data to update on, and the weights are fixed, so it's not online learning by definition. I guess you have something else in mind; can you explain?
I generally understand "meta-learning" to mean that there's an inner loop that has a learning algorithm, and then there's an outer loop with a learning algorithm that impacts the inner loop. I guess you could say that the 96 transformer layers involved in each word-inference is the inner loop. Is it really a learning algorithm though? It doesn't look like a learning algorithm. I mean, it certainly figures things out over the course of those 96 processing steps

It makes sense that negative pairs would help to a large extent, but not all contrastive papers used negative examples, like BYOL (ref). Edit: but now I'm realizing that this might no longer fit the definition of contrastive learning (instead just ordinary self supervised learning), so I apologize about the error/confusion in that case.