I’m a bit confused about what you’re proposing. AlphaZero has an input (board state) and an output (move). Are you proposing to call this input-output function “a policy”?
If so, sure we can say that, but I think people would find it confusing—because there’s a tree search in between the input and output, and one ingredient of the tree search is the “policy network” (or maybe just “policy head”, I forget), but here the relation between the “policy network” and the final input-output function is very indirect, such that it seems odd to use (almost) the same term for them.
In my head, a policy is just a situation-dependent way of acting. Sometimes that way of acting makes use of foresight, sometimes that way of acting is purely reflexive. I mentally file the AlphaZero policy network + tree search combination as a "policy", one separate from the "reactive policy" defined by just using the policy network without tree search. Looking back at Sutton & Barto, they define "policy" similarly:
A policy defines the learning agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states. It corresponds to what in psychology would be called a set of stimulus–response rules or associations. In some cases the policy may be a simple function or lookup table, whereas in others it may involve extensive computation such as a search process. The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. In general, policies may be stochastic, specifying probabilities for each action.
(emphasis mine) along with this later description of planning in a model-based RL context:
The word planning is used in several different ways in different fields. We use the term to refer to any computational process that takes a model as input and produces or improves a policy for interacting with the modeled environment
which seems compatible with thinking of planning algorithms like MCTS as components of an improved policy at runtime (not just in training).
That being said, looking at the AlphaZero paper, a quick search did not turn up usages of the term "policy" in this way. So maybe this usage is less widespread than I had assumed.
I think requiring a "common initialization + early training trajectory" is a pretty huge obstacle to knowledge sharing, and would de-facto make knowledge sharing among the vast majority of large language models infeasible.
Agreed. That part of my comment was aimed only at the claim about weight averaging only working for diffusion/image models, not about knowledge sharing more generally.
I do think stuff like stitching via cross-attention is kind of interesting, but it feels like a non-scalable way of knowledge sharing, unless I am misunderstanding how it works.
Not sure I see any particular argument against the scalability of knowledge exchange between LLMs in general or via cross-attention, though. Especially if we're comparing the cost of transfer to the cost of re-running the original training. That's why people are exploring this, especially smaller/independent researchers. There's a bunch of concurrent recent efforts to take frozen unimodal models and stitch them into multimodal ones (example from a few days ago https://arxiv.org/abs/2305.17216). Heck, the dominant approach in the community of LLM hobbyists seems to be transferring behaviors and knowledge from GPT-4 into LLaMa variants via targeted synthetic data generation. What kind of scalability are you thinking of?
The part where you can average weights is unique to diffusion models, as far as I can tell, which makes sense because the 2-d structure of the images is very local, and so this establishes a strong preferred basis for the representations of different networks.
Exchanging knowledge between two language models currently seems approximately impossible? Like, you can train on the outputs, but I don't think there is really any way for two language models to learn from each other by exchanging any kind of cognitive content, or to improve the internal representations of a language model by giving it access to the internal representations of another language model.
There's a pretty rich literature on this stuff, transferring representational/functional content between neural networks.
Averaging weights to transfer knowledge is not unique to diffusion models. It works on image models trained with non-diffusion setups (https://arxiv.org/abs/2203.05482, https://arxiv.org/abs/2304.03094) as well as on non-image tasks such as language modeling (https://arxiv.org/abs/2208.03306, https://arxiv.org/abs/2212.04089). Exchanging knowledge between language models via weight averaging is possible provided that the models share a common initialization + early training trajectory. And if you allow for more methods than weight averaging, simple stuff like Knowledge Distillation or stitching via cross-attention (https://arxiv.org/abs/2106.13884) are tricks known to work for transferring knowledge.
I wonder if this is related to how GPT-J runs the attention and MLP sublayers in parallel, as opposed to sequentially?
I didn't mean "learning from experience" to be restrictive in that way. Animals learn by observing others & from building abstract mental models too. But unless one acquires abstracted knowledge via communication, learning requires some form of experience: even abstracted knowledge is derived from experience, whether actual or imagined. Moreover, I don't think that some extra/different planning machinery was required for language itself, beyond the existing abstraction and model-based RL capabilities that many other animals share. But ultimately that's an empirical question.
Hmm, we may have reached the point from which we're not going to move on without building mathematical frameworks and empirically testing them, or something.
Yeah I am probably going to end my part of the discussion tree here.
My overall take remains:
I guess I don't see much support for such mutual dependence. Other animals have working memory + finite state control, and learn from experience in flexible ways. It appears pretty useful to them despite the fact they don't have language/culture. The vast majority of our useful computing is done by systems that have Turing-completeness but not language/cultural competence. Language models sure look like they have language ability without Turing-completeness and without having picked up some "universal planning algorithm" that would render our previous work w/ NNs ~useless.
Why choose a theory like "the capability gap between humans and other animals is because the latter is missing language/culture and also some binary GI property" over one like "the capability gap between humans and other animals is just because the latter is missing language/culture"? IMO the latter is simpler and better fits the evidence.
I think what I'm trying to get at, here, is that the ability to use these better, self-derived abstractions for planning is nontrivial, and requires a specific universal-planning algorithm to work. Animals et al. learn new concepts and their applications simultaneously: they see e. g. a new fruit, try eating it, their taste receptors approve/disapprove of it, and they simultaneously learn a concept for this fruit and a heuristic "this fruit is good/bad". They also only learn new concepts downstream of actual interactions with the thing; all learning is implemented by hard-coded reward circuitry.
Humans can do more than that. As in my example, you can just describe to them e. g. a new game, and they can spin up an abstract representation of it and derive heuristics for it autonomously, without engaging hard-coded reward circuitry at all, without doing trial-and-error even in simulations. They can also learn new concepts in an autonomous manner, by just thinking about some problem domain, finding a connection between some concepts in it, and creating a new abstraction/chunking them together.
Hmm I feel like you're underestimating animal cognition / overestimating how much of what humans can do comes from unique algorithms vs. accumulated "mental content". Non-human animals don't have language, culture, and other forms of externalized representation, including the particular human representations behind "learning the rules of a game". Without these in place, even if one was using the "universal planning algorithm", they'd be precluded from learning through abstract description and from learning through manipulation of abstract game-structure concepts. All they've got is observation, experiment, and extrapolation from their existing concepts. But lacking the ability to receive abstract concepts via communication doesn't mean that they cannot synthesize new abstractions as situations require. I think there's good evidence that other animals can indeed do that.
General intelligence is an algorithm for systematic derivation of such "other changes".
Does any of that make sense to you?
I get what you're saying but disbelieve the broader theory. I think the "other changes" (innovations/useful context-specific improvements) we see in reality aren't mostly attributable to the application of some simple algorithm, unless we abstract away all of the details that did the actual work. There are general purpose strategies (for ex. the "scientific method" strategy, which is an elaboration of the "model-based RL" strategy, which is an elaboration of the "trial and error" strategy) that are widely applicable for deriving useful improvements. But those strategies are at a very high level of abstraction, whereas the bulk of improvement comes from using strategies to accumulate lower-level concrete "content" over time, rather than merely from adopting a particular strategy.
(Would again recommend Hanson's blog on "The Betterness Explosion" as expressing my side of the discussion here.)
Ok I think this at least clears things up a bit.
To become universally capable, a system needs two things:
- "Turing-completeness": A mechanism by which it can construct arbitrary mathematical objects to describe new environments (including abstract environments).
- "General intelligence": an algorithm that can take in any arbitrary mathematical object produced by (1), and employ it for planning.
General intelligence isn't Turing-completeness itself. Rather, it's a planning algorithm that has Turing-completeness as a prerequisite. Its binariness is inherited from the binariness of Turing-completeness.
Based on the above, I don't understand why you expect what you say you're expecting. We blew past the Turing-completeness threshold decades ago with general purpose computers, and we've combined them with planning algorithms in lots of ways.
Take AIXI, which uses the full power of Turing-completeness to do model-based planning with every possible abstraction/model. To my knowledge, switching over to that kind of fully-general planning (or any of its bounded approximations) hasn't actually produced corresponding improvements in quality of outputs, especially compared to the quality gains we get from other changes. I think our default expectation should be that the real action is in accumulating those "other changes". On the theory that the gap between human- and nonhuman animal- cognition is from us accumulating better "content" (world model concepts, heuristics, abstractions, etc.) over time, it's no surprise that there's no big phase change from combining Turing machines with planning!
General intelligence is the capability that makes this possible, the algorithm you employ for this "abstract analysis". As I'd stated, it main appeal is that it doesn't require practical experience with the problem domain (simulated or otherwise) — only knowledge of its structure.
I think what you describe here and in the content prior is more or less "model-based reinforcement learning with state/action abstraction", which is the class of algorithms that answer the question "What if we did planning towards goals but with learned/latent abstractions?" As far I can tell, other animals do this as well. Yes, it takes a more impressive form in humans because language (and the culture + science it enabled) has allowed us to develop more/better abstractions to plan with, but I see no need to posit some novel general capability in addition.
I think I am confused where you're thinking the "binary/sharp threshold" is.
Are you saying there's some step-change in the architecture of the mind, in the basic adaption/learning algorithms that the architecture runs, in the content those algorithms learn? (or in something else?)
If you're talking about...
Consider an algorithm implementing a simple arithmetic calculator, or a symbolic AI from a FPS game, or LLMs as they're characterized in this post. These cognitive systems do not have the machinery to arrive at understanding this way. There are no execution-paths of their algorithms such that they arrive at understanding; no algorithm-states that correspond to "this system has just learned a new physics discipline". [...]
If true generality doesn't exist, it would stand to reason that humans are the same. There should be aspects of reality such that there's no brain-states of us that correspond to us understanding them; there should only be a limited range of abstract objects our mental representations can support.
When you say "machinery" here it makes me think you're talking about architecture, but in that case the lack of execution-paths that arrive at learning new physics seems like it is explained by "simple arithmetic calculators + FPS AIs + LLMs are not Turing-complete systems / have too little memory / are not running learning algorithms at all", without the need to hypothesize a separate "general intelligence" variable.
(Incidentally, it doesn't seem obvious to me that scaffolded LLMs are particularly non-general in their understanding 🤔 Especially if we are willing to say yes to questions like "Can humans understand how 16-dimensional space works?" despite the fact that we cannot natively/reliably manipulate those in our minds whereas there are computer programs that can.)
To me, this sounds like you're postulating the existence of a simple algorithm for general-purpose problem-solving which is such that it would be convergently learned by circa-1995 RNNs. Rephrasing, this hypothetical assumes that the same algorithm can be applied to efficiently solve a wide variety of problems, and that it can usefully work even at the level of complexity at which 1995-RNNs were operating.
Sounds like I miscommunicated here. No, my position (and what I was asking about in the hypothetical) is that there are general-purpose architectures + general-purpose problem-solving algorithms that run on those architectures, that they are simple and inefficient (especially given their huge up-front fixed costs), that they aren't new or mysterious (the architectures are used already, far predating humans, & the algorithms are simple), and that we already can see that this sort of generality is not really "where the action is", so to speak.
Conversely, my position is that the algorithm for general intelligence is only useful if it's operating on a complicated world-model + suite of heuristics: there's a threshold of complexity and compute requirements (which circa-1995 RNNs were far below), and general intelligence is an overkill to use for simple problems (so RNNs wouldn't have convergently learned it; they would've learned narrow specialized algorithms instead).
Agreed? This is compatible with an alternative theory, that many other animals do have "the algorithm for general intelligence" you refer to, but that they're running it with less impressive content (world models & heuristics). And likewise with a theory that AI folks already had/have the important discrete generalist algorithmic insights, & instead what they need is a continuous pileup of good cognitive content. Why do you prefer the theory that in both cases, there is some missing binary thing?
Having known some of Conjecture's founders and their previous work in the context of "early-stage EleutherAI", I share some[1] of the main frustrations outlined in this post. At the organizational level, even setting aside the departure of key researchers, I do not think that Conjecture's existing public-facing research artifacts have given much basis for me to recommend the organization to others (aside from existing personal ties). To date, only[2] a few posts like their one on the polytope lens and their one on circumventing interpretability were at the level of quality & novelty I expected from the team. Maybe that is a function of the restrictive information policies, maybe a function of startup issues, maybe just the difficulty of research. In any case, I think that folks ought to require more rigor and critical engagement from their future research outputs[3].
I didn't find the critiques of Connor's "character and trustworthiness" convincing, but I already consider him a colleague & a friend, so external judgments like these don't move the needle for me.
The main other post I have in mind was their one on simulators. AFAICT the core of "simulator theory" predated (mid-2021, at least) Conjecture, and yet even with a year of additional incubation, the framework was not brought to a sufficient level of technical quality.
For example, the "cognitive emulation" work may benefit from review by outside experts, since the nominal goal seems to be to do cognitive science entirely inside of Conjecture.