Same person as <a href="https://www.lesserwrong.com/users/nostalgebraist2point0">nostalgebraist2point0</a>, but now I have my account back.
I should say first that I completely agree with you about the extreme data inefficiency of many systems that get enthusiastically labeled "AI" these days -- it is a big problem which calls into question many claims about these systems and their displays of "intelligence."
Especially a few years ago (the field has been getting better about this over time), there was a tendency to define performance with reference to some set collection of tasks similar to the training task without acknowledging that broader generalization capacity, and generalization speed in terms of "number of data points needed to learn the general rule," are key components of any intuitive/familiar notion of intelligence. I've written about this in a few places, like the last few sections of this post, where I talk about the "strange simpletons."
However, it's not clear to me that this limitation is inherent to neural nets or to "AI" in the way you seem to be saying. You write:
Comparing AI to human neurology is off the mark in my estimation, because AIs don't really learn rules. They can predict outcomes (within a narrow context), but the AI has no awareness of the actual "rules" that are leading to that outcome - all it knows is weights and likelihoods.
If I understand you correctly, you're taking a position that Marcus argued against in The Algebraic Mind. I'm taking Marcus' arguments there largely as a given in this post, because I agree with them and because I was interested specifically in the way Marcus' Algebraic Mind arguments cut against Marcus' own views about deep learning today.
If you want to question the Algebraic Mind stuff itself, that's fine, but if so you're disagreeing with both me and Marcus more fundamentally than (I think) Marcus and I disagree with one another, and you'll need a more fleshed-out argument if you want to bridge a gulf of this size.
The appearance of a disagreement in this thread seems to hinge on an ambiguity in the phrase "word choice."
If "word choice" just means something narrow like "selecting which noun you want to use, given that you are picking the inhabitant of a 'slot' in a noun phrase within a structured sentence and have a rough idea of what concept you want to convey," then perhaps priming and other results about perceptions of "word similarity" might tell us something about how it is done. But no one ever thought that kind of word choice could scale up to full linguistic fluency, since you need some other process to provide the syntactic context. The idea that syntax can be eliminatively reduced to similarity-based choices on the word level is a radical rejection of linguistic orthodoxy. Nor does anyone (as far as I'm aware) believe GPT-2 works like this.
If "word choice" means something bigger that encompasses syntax, then priming experiments about single words don't tell us much about it.
I do take the point that style as such might be a matter of the first, narrow kind of word choice, in which case GPT-2's stylistic fluency is less surprising than its syntactic fluency. In fact, I think that's true -- intellectually, I am more impressed by the syntax than the style.
But the conjunction of the two impresses me to an extent greater than the sum of its parts. Occam's Razor would have us prefer one mechanism to two when we can get away with it, so if we used to think two phenomena required very different mechanisms, a model that gets both using one mechanism should make us sit up and pay attention.
It's more a priori plausible that all the distinctive things about language are products of a small number of facts about brain architecture, perhaps adapted to do only some of them with the rest arising as spandrels/epiphenomena -- as opposed to needing N architectural facts to explain N distinctive things, with none of them yielding predictive fruit beyond the one thing it was proposed to explain. So, even if we already had a (sketch of a) model of style that felt conceptually akin to a neural net, the fact that we can get good style "for free" out of a model that also does good syntax (or, if you prefer, good syntax "for free" out of a model that also does good style) suggests we might be scientifically on the right track.
For the full argument from Marcus, read the parts about "training independence" in The Algebraic Mind ch. 2, or in the paper it draws from, "Rethinking Eliminative Connectionism."
The gist is really simple, though. First, note that if some input node is always zero during training, that's equivalent to it not being there at all: their contribution to the input of any node in the first hidden layer is the relevant weight times zero, which is zero. Likewise, the gradient of anything w/r/t these weights is zero (because you'll always multiply by zero when doing the chain rule), so they'll never get updated from their initial values.
Then observe that, if the nodes are any nonzero constant value during training, the connections add a constant to the first hidden layer inputs instead of zero. But we already have a parameter for an additive constant in a hidden layer input: the "bias." So if the input node is supposed to carry some information, the network still can't learn what it is; it just thinks it's updating the bias. (Indeed, you can go the other way and rewrite the bias as an extra input node that's always constant, or as N such nodes.)
The argument for constant outputs is even simpler: the network will just set the weights and bias to something that always yields the right constant. For example, it'd work to set the weights to zero and the bias to f−1(c) where f is the activation function and c is the constant. If the output has any relationship to the input then this is wrong, but the training data plus the update rule give you no reason to reject it.
None of this is controversial and it does indeed become obvious once you think about it enough; this kind of idea is much of the rationale for weight sharing, which sets the weights for constant input nodes using patterns learned from non-constant ones rather than randomly/arbitrarily.
Hmm... I think you are technically right, since "compositionality" is typically defined as a property of the way phrases/sentences/etc. in a language relate to their meanings. Since language modeling is a task defined in terms of words, without involving their referents at all, GPT-2 indeed does not model/exhibit this property of the way languages mean things.
But the same applies identically to every property of the way languages mean things! So if this is really the argument, there's no reason to focus specifically on "compositionality." On the one hand, we would never expect to get compositionality out of any language model, whether a "deep learning" model or some other kind. On the other hand, the argument would fail for any deep learning model that has to connect words with their referents, like one of those models that writes captions for images.
If we read the passage I quoted from 2019!Marcus in this way, it's a trivially true point about GPT-2 that he immediately generalizes to a trivially false point about deep learning. I think when I originally read the passage, I just assumed he couldn't possibly mean this, and jumped to another interpretation: he's saying that deep learning lacks the capacity for structured representations, which would imply an inability to model compositionality even when needed (e.g. when doing image captioning as opposed to language modeling).
Fittingly, when he goes on to describe the problem, it doesn't sound like he's talking about meaning but about having flat rather than hierarchical representations:
Surprisingly, deep learning doesn’t really have any direct way of handling compositionality; it just has information about lots and lots of complex correlations, without any structure.
In The Algebraic Mind, Marcus critiqued some connectionist models on the grounds that they cannot support "structured representations." Chapter 4 of the book is called "Structured Representations" and is all about this, mostly focused on meaning (he talks a lot about "structured knowledge") but not at all tied to meaning specifically. Syntax and semantics are treated as equally in need of hierarchical representations, equally impossible without them, and equally possible with them.
Unlike the point about meaning and language models, this is a good and nontrivial argument that actually works against some neural nets once proposed as models of syntax or knowledge. So when 2019!Marcus wrote about "compositionality," I assumed that he was making this argument, again, about GPT-2. In that case, GPT-2's proficiency with syntax alone is a relevant datum, because Marcus and conventional linguists believe that syntax alone requires structured representations (as against some of the connectionists, who didn't).
In your post on transformers, you noted that transformers are supersets of CNNs, but with fewer inductive biases. But I don't think of removing inductive biases as representational advances - or else getting MLPs to work well would be an even bigger representational advance than transformers! Rather, what we're doing is confessing as much ignorance about the correct inductive biases as we can get away with (without running out of compute).
I think it's misleading to view "amount of inductive bias" as a one-dimensional scale, with the transformer somewhere "between" CNNs and MLPs. As I said in that post, the move from vanilla MLPs to CNNs involves the introduction of two kinds of constraints/biases at once -- weight sharing between positions, and locality -- and these are two very different things, not just two (perhaps differently sized) injections of "more bias" on our hypothetical 1D bias scale.
For example, locality without weight sharing is certainly conceivable (I can't remember if I've seen it before), but I'd imagine it would do very poorly on text data, because it relaxes the CNN constraint that's appropriate for text while keeping the one that's inappropriate. If you compare that to the transformer, you've got two different ways of relaxing the CNN biases, but one works better and one (I would imagine) works worse. This shows that a given architecture's representational aptness for a given domain isn't just a function of some 1D "amount of inductive bias" in conjunction with data/compute volume; the specific nature of the biases and the domain matter too.
As as sidenote, most pre-transformer SOTA architectures for text were RNNs, not CNNs. So, having argued above that "moving to a superset" shouldn't be simplified to "reducing some 1D 'bias' variable," I'd also say that "moving to a superset" isn't what happened anyway.
Concretely, I'd predict with ~80% confidence that within 3 years, we'll be able to achieve comparable performance to our current best language models without using transformers - say, by only using something built of CNNs and LSTMs, plus better optimisation and regularisation techniques. Would you agree or disagree with this prediction?
Disagree. Not that this seems deeply impossible or anything, but it's exactly what people were trying to do for many years before the introduction of the transformer; a lot of work has already gone into this, and now there's less incentive to do it.
On the general topic of transformer vs. CNN/LSTM, as well as the specific topic of my OP, I found the paper linked by steve2152 very interesting.
Thanks, the floor/ceiling distinction is helpful.
I think "ceilings as they exist in reality" is my main interest in this post. Specifically, I'm interested in the following:
In more detail: it's one thing to be able to assess quick heuristics, and it's another (and better) one to be able to assess quick heuristics quickly. It's possible (maybe) to imagine a convenient situation where the theory of each "speed class" among fast decisions is compressible enough to distill down to something which can be run in that speed class and still provide useful guidance. In this case there's a possibility for the theory to tell us why our behavior as a whole is justified, by explaining how our choices are "about as good as can be hoped for" during necessarily fast/simple activity that can't possibly meet our more powerful and familiar notions of decision rationality.
However, if we can't do this, it seems like we face an exploding backlog of justification needs: every application of a fast heuristic now requires a slow justification pass, but we're constantly applying fast heuristics and there's no room for the slow pass to catch up. So maybe a stronger agent could justify what we do, but we couldn't.
I expect helpful theories here to involve distilling-into-fast-enough-rules on a fundamental level, so that "an impractically slow but working version of the theory" is actually a contradiction in terms.
I don't understand Thing #1. Perhaps, in the passage you quote from my post, the phrase "decision procedure" sounds misleadingly generic, as if I have some single function I use to make all my decisions (big and small) and we are talking about modifications to that function.
(I don't think that is really possible: if the function is sophisticated enough to actually work in general, it must have a lot of internal sub-structure, and the smaller things it does inside itself could be treated as "decisions" that aren't being made using the whole function, which contradicts the original premise.)
Instead, I'm just talking about the ordinary sort of case where you shift some resources away from doing X to thinking about better ways to do X, where X isn't the whole of everything you do.
Re: Q/A/A1, I guess I agree that these things are (as best I can tell) inevitably pragmatic. And that, as EY says in the post you link, "I'm managing the recursion to the best of my ability" can mean something better than just "I work on exactly N levels and then my decisions at level N+1 are utterly arbitrary." But then this seems to threaten the Embedded Agency programme, because it would mean we can't make theoretically grounded assessments or comparisons involving agents as strong as ourselves or stronger.
(The discussion of self-justification in this post was originally motivated by the topic of external assessment, on the premise that if we are powerful enough to assess a proposed AGI in a given way, it must also be powerful enough to assess itself in that way. And contrapositively, if the AGI can't assess itself in a given way then we can't assess it in that way either.)
I don't see how (i) follows? The advantage of (internal) tree search during training is precisely that it constrains you to respond sensibly to situations that are normally very rare (but are easily analyzable once they come up), e.g. "cheap win" strategies that are easily defeated by serious players and hence never come up in serious play.
It's not really about doing well/better in all domains, it's about being able to explain how you can do well at all of the things you do, even if that isn't nearly everything. And making that explanation complete enough to be convincing, as an argument about the real world assessed using your usual standards, while still keeping it limited enough to avoid self-reference problems.
IIUC the distinction being made is about the training data, granted the assumption that you may be able to distill tree-search-like abilities into a standard NN with supervised learning if you have samples from tree search available as supervision targets in the first place.
AGZ was hooked up to a tree search in its training procedure, so its training signal allowed it to learn not just from the game trees it "really experienced" during self-play episodes but also (in a less direct way) from the much larger pool of game trees it "imagined" while searching for its next move during those same episodes. The former is always (definitionally) available in self-play, but the latter is only available if tree search is feasible.