What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?

[-]Thane Ruthenis3y2910

An observation that I think is missing here is that this world is biased towards general-purpose search too. As in, it is frequently the case that agents operating in reality face the need to problem-solve in off-distribution circumstances; circumstances to which they could not have memorized correct responses (or even near-correct responses), because they'd never faced them. And if failure is fatal, that creates a pressure towards generality. Not simply a "bias" towards it; a direct pressure.

And we're already doing something similar with ML models today, where we're not repeating training examples.

A supercharged version of that pressure is when the agent is selected for the ability to thrive not only in off-distribution tasks in some environment, but in entire off-distribution environments, which I suspect is how human intelligence was incentivized.

[-]johnswentworth3y93

You're right, that was missing. Very good and important point.

[-]Ivan Vendrov3y144

Agreed that the existence of general-purpose heuristic-generators like relaxation is a strong argument for why we should expect to select for inner optimizers that look something like A*, contrary to my gradient descent doesn't select for inner search post.

Recursive structure creates an even stronger bias toward things like A* but only in recurrent neural architectures (so notably not currently-popular transformer architectures, though it's plausible that recurrent architectures will come back).

I maintain that the compression / compactness argument from "Risks from Learned Optimization" is wrong, at least in the current ML regime:

In general, evolved/trained/selected systems favor more compact policies/models/heuristics/algorithms/etc. In ML, for instance, the fewer parameters needed to implement the policy, the more parameters are free to vary, and therefore the more parameter-space-volume the policy takes up and the more likely it is to be found. (This is also the main argument for why overparameterized ML systems are able to generalize at all.)

I believe the standard explanation is that overparametrized ML finds generalizing models because gradient descent with weight decay finds policies that have low L2 norm, not low description length / Kolmogorov complexity. See Neel's recent interpretability post for an example of weight decay slowly selecting a generalizable algorithm over (non-generalizable) memorization over the course of training.

I don't understand the parameter-space-volume argument, even after a long back-and-forth with Vladimir Nesov here. If it were true, wouldn't we expect to be able to distill models like GPT-3 down to 10-100x fewer parameters? In practice we see maybe 2x distillation before dramatic performance losses, meaning most of those parameters really are essential to the learned policy.

Overall though this post updated me substantially towards expecting the emergence of inner A*-like algorithms, despite their computational overhead. Added it to the list of caveats in my post.

[-]Lucius Bushnaq3y*30

I believe the standard explanation is that overparametrized ML finds generalizing models because gradient descent with weight decay finds policies that have low L2 norm, not low description length / Kolmogorov complexity.

I have some math that hints that those may be equivalent-ish statements.

I don't understand the parameter-space-volume argument, even after a long back-and-forth with Vladimir Nesov here. If it were true, wouldn't we expect to be able to distill models like GPT-3 down to 10-100x fewer parameters?

Why would we expect a 10x times distillation factor? Half the directions of the basin being completely flat seems like a pretty big optimum to me.

Also, I'm not sure if you can always manage to align the free directions in parameter space with individual parameters, such that you can discard p parameters if you had p free directions.

[-]Ivan Vendrov3y10

Would love to see your math! If L2 norm and Kolmogorov provide roughly equivalent selection pressure that's definitely a crux for me.

[-]Lucius Bushnaq3y70

There should be a post with some of it out soon-ish. Short summary:

You can show that at least for overparametrised neural networks, the eigenvalues of the Hessian of the loss function at optima, which determine the basin size within some approximation radius, are basically given by something like the number of independent, orthogonal features the network has, and how "big" these features are.

The less independent, mutually orthogonal features the network has, and the smaller they are, the broader the optimum will be. Size and orthogonality are given by the Hilbert space scalar product for functions here.

That sure sounds an awful lot like a kind of complexity measure to me. Not sure it's Kolmogorov exactly, but it does seem like something related.

And while I haven't formalised it yet, I think there's quite a lot to suggest that the less information you pass around in the network, the less independent features you'll tend to have. E.g., if you have 20 independent bits of input information, and you only pass on 10 of them to the deeper layers of the network, you'll be much more likely to get fewer unique features than if you'd passed them on. Because you're making the Hilbert space smaller.

So if you introduce a penalty on exchanging too much information between parts of the network, like, say, with L2 regularisation, you'd expect the optimiser to find solutions with less independent features ("description length"), and broader basins.

Empirically, introducing "connection costs" does seem to lead to broader basins in our group's experiments, IIRC. Also, there's a bunch of bio papers on how connection costs lead to modularity, and our own experiments support the idea that modularity means broader basins. I'm not sure I've seen it implemented with L2 regularisation as the connection cost specifically, but my guess would be that it'd do the same thing.

(Our hope is actually that these orthogonalised features might prove to be a better fundamental unit of DL theory and interpretability than neurons, but we haven't gotten to testing that yet)

[-]Lucius Bushnaq3y92

Another general-purpose search trick which someone will probably bring up if I don’t mention it is caching solutions to common subproblems. I don’t think of this as an heuristic; it mostly doesn’t steer the search process, just speed it up.

Terminology quibble, but this totally seems like a heuristic to me. When faced with a problem that seems difficult to solve directly, first find the most closely related problem that seems easy to solve, seems like the overriding general heuristic generator that encompasses both problem relaxation and solution memorisation.

In one case the related problem is easier because it has less constraints, in the other it's easier because you already know the answer, but it's the same principle.

[-]Antoine de Scorraille3y30

The difference (here) between "Heuristic" and "Cached-Solutions" seems to me analogous to the difference between lazy evaluation and memoization:

Lazy evaluation ~ Heuristic: aims to guide the evaluation/search by reducing its space.
Memoization ~ Cached Solutions: stores in memory the values/solutions already discovered to speed up the calculation.

[-]TurnTrout3yΩ682

Well, I’d say that a “general-purpose search” process is something which:
Takes in a problem or goal specification (from a fairly broad range of possible problems/goals)
… and returns a plan which solves the problem or scores well on the goal

Why not call this "general-purpose planning"? That seems to more directly describe what I think you're describing -- a goal specification comes in, a plan comes out. I think "search" has some inappropriate connotations to it, possibly evoking BFS/DFS/MCTS/A*/etc, whereas this planning process -- as you point out -- doesn't have to look like "babble/prune."

Although now that I've written this, "planning" imports similar unwanted connotations from the similarly titled AI subfield. Hm. I don't feel like I've produced a good enough alternative, but I still feel there's a terminological issue here. I'll just leave this comment for now.

[-]johnswentworth3yΩ220

I do want to evoke BFS/DFS/MCTS/A*/etc here, because I want to make the point that those search algorithms themselves do not look like (what I believe to be most peoples' conception of) babble and prune, and I expect the human search algorithm to differ from babble and prune in many similar ways to those algorithms. (Which makes sense - the way people come up with things like A*, after all, is to think about how a human would solve the problem better and then write an algorithm which does something more like a human.)

[-]TurnTrout3yΩ220

OK, then I once again feel confused about what this post is arguing as I remember it. (Don't feel the need to explain it as a reply to this comment, I guess I'll just reread if it becomes relevant later.)

[-]Daniel Murfet2y70

There is some preliminary evidence in favour of the view that transformers approximate a kind of Bayesian inference in-context (by which I mean something like, they look at in-context examples and process them to represent in their activations something like a Bayesian posterior for some "inner" model based on those examples as samples, and then predict using the predictive distribution for that Bayesian posterior). I'll call the hypothesis that this is taking place "virtual Bayesianism".

I'm not saying you should necessarily believe that, for current generation transformers. But fwiw I put some probability on it, and if I had to predict one significant capability advance in the next generation of LLMs it would be to predict that virtual Bayesianism becomes much stronger (in-context learning being a kind of primitive pre-cursor).

Re: the points in your strategic upshots. Given the above, the following question seems quite important to me: putting aside transformers or neural networks, and just working in some abstract context where we consider Bayesian inference on a data distribution that includes sequences of various lengths (i.e. the kinds of distribution that elicits in-context learning), is there a general principle of Bayesian statistics according to which general-purpose search algorithms tend to dominate the Bayesian posterior?

[-]Leon Lang3y74

Summary

This article thinks about what “general purpose search is” and why to expect it in advanced machine learning systems.

In general, we expect gradient descent to find “simple solutions” with lots of varying parameters (since they take a larger part in solution space) and “general solutions” that are helpful broadly (since we will put the system in diverse environments). Thus, we do expect search processes to emerge.

However, babble and prune will likely not be the resulting process: it’s not compute and memory efficient enough. Instead, John imagines a search process that starts with a constraint/problem and iteratively produces broad strokes of solutions that form new constraints of subproblems, until the problem is solved. If this is roughly correct, it will also mean that the search process is retargetable.

This leaves open how the broad strokes of solutions to constraints are found, which John expects requires heuristics that will often either output a solution itself, or a different problem whose solution is easier to generate, instead of babbling and then pruning. Some heuristics:

Relaxed problem: only consider time-constraint, or only consider immediately reducing Euclidean distance.

The specific relaxed problems are “heuristics”. The procedure to relax the problem is a “heuristic generator”.

One could consider this a “meta-heuristic”. However, the type-signature of “heuristic” is “problem in, solution out” or “problem in, other problem out”, and the type-signature of the meta heuristic is “problem in, heuristic out”, so these are different.

Finally, John gestures at the observation that heuristics seem to be environment-dependent but not goal-dependent, at least for similar types of goals (e.g. for the type “reach X city” or “do X thing this week”). This makes them more generalizable.

Other Thoughts

Don’t chess players sometimes do babble and prune? They might look at the board and literally “babble” possible moves, evaluate them, and search further in the best of those.
- An alternative to that process would be to think “I want to capture the queen, how do I do this?” and then to explicitly think about moves that achieve that “constraint”. The original constraint/problem is just “win this game”, of which “capture the queen” is already a subconstraint/problem.
- Still, I do remember Magnus Carlsen saying in an interview that he actually does do relatively exhaustive search in some chess situations. So it seems to at least be some search process of many he applies. But I also remember him saying that this is effortful.
The description of John leaves open the process with which the solutions to constraints are found. Doesn’t that process usually involve babbling?
- In the case of finding stores, we may say “there is no babbling, the computer program just shows me the open stores that satisfy the constraint.”
  - But doesn’t the computer internally need to babble? I.e., to go through a database of all the options to find the ones satisfying the constraint!
  - In general, I would say babbling is required unless a solution to the constraint can already be retrieved in a somewhat cached form.
I’m not sure if “relax the problem” is a clear instruction. I feel like you already need to have something like a “natural abstraction of problems” in your head in order to be able to do that. This doesn’t really contradict what John is saying, but it highlights that there is some hidden complexity in this.

[-]johnswentworth3y93

The description of John leaves open the process with which the solutions to constraints are found. Doesn’t that process usually involve babbling?

The important part here is that babble scales extremely poorly with dimensionality of the problem (or, more precisely, the fraction of problem-space which is filled with solutions). So babble is fine once we've reduced to a low-dimensional subproblem; most of the algorithmic work is in reducing the big problem to a bunch of low-dimensional subproblems.

[-][anonymous]3y80

When I try to search for terms that I find on here, like "finite factored set" or "babble and prune" and many others, that I can't find anywhere else except on here or other EA platforms. It always makes me wonder "Are we meming?" It seems like the meme culture is deep here. I think that is also what makes it hard for new users to get accustomed to. They have to read up on so much prerequisite materials in order to even participate in a conversation.

[-]jacob_cannell3y7-2

Modern ML is increasingly general search over circuit(program) space. Circuit space obviously includes everything - including general search algorithms, which are also often obviously useful. So it is nearly tautological that we should expect general search in (sufficiently) trained ML systems.

[-]Nathan Helm-Burger3y54

I'm quite in agreement with this, and surprised that there are people imagining only babble and prune when general search for problem solving is being discussed. I'd like to add that I think a useful approach for evaluating the generality of a problem-solving agent would be to test for heuristic generation and use. I would expect an agent which can generate new heuristics in a targeted way to be far better at generalizing to novel tasks than one which has managed to discover and reuse just a few heuristics over and over. Maybe it's worth someone putting some thought into what a test set that could distinguish between these two cases would look like.

[-]Oliver Sourbut3y40

This is a fantastic point well articulated, reminiscent of some conversations we had a few months ago at Lightcone.

I’d say that a “general-purpose search” process is something which:

Takes in a problem or goal specification (from a fairly broad range of possible problems/goals)

… and returns a plan which solves the problem or scores well on the goal

I think we probably agree on what things there actually are, but I think this particular definition of 'general purpose search' is slightly too general to be a most useful pointer/carving.

This because it seems to include things like matrix inversion for least-squares solutions (unless 'from a fairly broad range of possible problems/goals' is taken to preclude this meaningfully?) which I deem importantly different. I'd class matrix inversion least-squares as a (powerful) heuristic^[1] (a 'proposal' in my deliberation terminology), but not as (proper) search itself.

I think it remains useful to distinguish algorithms which evaluate/promote or otherwise weigh proposals^[2]. This is what I've started calling 'proper deliberation' and it's generally what I mean when I talk about search.

In the case of applying matrix inversion to ordinary least squares, for me, the 'general deliberation' consists of something like

noticing the relevant features of the problem (this is 'abstraction/pattern-matching magic')
cognitively retrieving the OLS abstraction and matrix-inversion as a cached heuristic (this is 'propose')
thinking 'yes, this will work' (this is 'promote')
applying matrix inversion to solve

A clever/practised enough deliberator does steps 1, 2 and 3 'right' and doesn't need to iterate for this particular problem (my point here is that if your heuristics are good enough you can deliberate with only one proposal and say 'yep, good enough, let's go'). But counterfactually step 2 might make various alternative proposals, or step 3 might think 'actually there are too many dimensions in this case for inversion to be tractable' or something, and thus there's an evaluation and an internal update.

Peter Barnett and Ian McKenzie coined 'God-level heuristic' for really solid mathematically-justified heuristics like this, which I quite like ↩︎
I don't require this to be a 'full consequentialist model-based valuation', but that would be one example. See my deliberation simple examples for less sophisticated versions which are quite pervasive and nevertheless embody the 'propose;promote' breakdown ↩︎

[+][comment deleted]3y42

LESSWRONG
LW

LESSWRONG
LW

157

What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems?

157

Ω 48

157

Ω 48

Babble And Prune Is Not The Only Search Method

Retargetability and Recursive Structure of Search

Heuristics and Generality

General-Purpose Generators Of Heuristics Are A Thing

Heuristics Tend To Depend On The Environment, But Not On The Objective

Side Note: Cached Solutions Are Not Heuristics (But Are Another General-Purpose Search Trick)

Revisiting The Risks From Learned Optimization Arguments

Key Idea: Compression Is Favored By Default

Takeaways