Are minimal circuits deceptive?

Interesting proof. A couple off-the-cuff thoughts...

First, this is proving something much more general than just deceptiveness. We can just scribble out the three or four English sentences with the word "deceptive" in them, and just say " $C (π, x)$ is a predicate" - the proof doesn't actually use any properties of C. So this is a fairly general result about minimal circuits on MDPs - and I'm wondering what other predicates could be plugged into it to yield interesting results.

Second, the proof doesn't actually use the assumption that the minimal circuit on the original set of tasks is performing search. All that matters is that there is some MDP on which the circuit is deceptive, and that the circuit is minimal subject to an average performance bound on the full set of tasks.

Third, we can state the overall conclusion as roughly "if there exists a set of MDPs for which the minimal circuit achieving some average performance on the set is deceptive on at least one of the MDPs, then there exists a single MDP whose minimal circuit is deceptive." On the other hand, the reverse implication holds trivially: if there exists a single MDP whose minimal circuit is deceptive, then there exists a set of MDPs for which the minimal circuit achieving some average performance on the set is deceptive on at least one of the MDPs. Proof: take the set to contain just the one MDP whose minimal dircuit is deceptive. So we don't just have a one-way implication here; we actually have equivalence between two open problems. Obvious next question: what other minimal-circuit problems are equivalent to these two?

[-]evhub6yΩ340

I agree that the proof here can be made significantly more general—and I agree that exploring that definitely seems worthwhile—though I also think it's worth pointing out that the proof rests on assumptions that I would be a lot less confident would hold in other situations. The point of explaining the detail regarding search algorithms here is that it gives a plausible story for why the assumptions made regarding $π_{learn}$ and $T_{bad}$ should actually hold.

[-]Rohin Shah6yΩ780

if there exists some set of natural tasks for which the fastest way to solve them is to do some sort of machine learning to find a good policy

This would be pretty strange -- why not just directly hardcode the policy? Wouldn't that be faster? We need to use machine learning because we aren't able to write down a program that "directly" solves e.g. image recognition, but the "direct" program would be faster if we had some way of finding it. The general reason for optimism is that the "fastest" requirement implies that any extraneous computation (e.g. deception) is removed; that same reason implies that any search would be removed and replaced with a correct output of the search.

Another way to think of this: when you don't have a simplicity bias, you have to compete with the GLUT (Giant Lookup Table), which can be very fast. Even if you take into account the time taken to perform the lookup, in a deterministic environment the GLUT only has to encode the optimal trajectory, not the full policy. In a stochastic environment, the GLUT may need to be exponentially larger, so it may be too slow, but even so you can have GLUT-like things built out of higher-level abstractions, which might be enough to avoid deception. Basically you can do a lot of weird stuff when you don't require simplicity; it's not clear that meta learning should be modeled the same way.

[-]Rohin Shah6yΩ690

Planned summary:

While it has been argued that the simplest program that solves a complex task is likely to be deceptive, it hasn't yet been argued whether the fastest program that solves a complex task will be deceptive. This post argues that fast programs will often be forced to learn a good policy (just as we need to do today), and the learned policy is likely to be deceptive (presumably due to risks from learned optimization). Thus, there are at least some tasks where the fastest program will also be deceptive.

Planned opinion:

This is an intriguing hypothesis, but I'm not yet convinced: it's not clear why the fastest program would have to learn the best policy, rather than directly hardcoding the best policy. If there are multiple possible tasks, the program could have a nested if structure that figures out which task needs to be done and then executes the best policy for that task. More details in this comment.

[-]evhub6yΩ330

A couple of things. First, a minimal circuit is not the same as a speed-prior-minimal algorithm. Minimal circuits have to be minimal in width + depth, so a GLUT would definitely lose out. Second, even if you're operating on pure speed, I think there are sets of tasks that are large enough that a GLUT won't work. For example, consider the task of finding the minimum of an arbitrary convex function. Certainly for the infinite set of all possible convex functions (on the rationals, say), I would be pretty surprised if something like gradient descent weren't the fastest way to do that. Even if you restrict to only finitely many convex functions, if your set is large enough it still seems hard to do better than gradient descent, especially since looking up the solution in a huge GLUT could be quite expensive (how do you do the lookup? a hash table? a binary tree? I'm not sure if those would be good enough here).

[-]Rohin Shah6y*Ω550

First, a minimal circuit is not the same as a speed-prior-minimal algorithm. Minimal circuits have to be minimal in width + depth, so a GLUT would definitely lose out.

I don't really understand this -- my understanding is that with a minimal circuit, you want to minimize the number of gates in the circuit, and the circuit must be a DAG (if you're allowed to have loops + clocks, then you can build a regular computer, and for complex tasks the problem should be very similar to finding the shortest program and implementing it on a universal circuit).

But then I can create a Turing Machine that interprets its input as a circuit, and simulates each gate of the circuit in sequence. Then the running time of the TM is proportional to the number of gates in the circuit, so an input with minimal running time should be a circuit with the minimal number of gates. This is not technically a Universal TM, since loop-free circuits are not Turing-complete, but I would expect a speed prior using such a Turing Machine would be relatively similar to a speed prior with a true UTM.

EDIT: I no longer endorse the above point. Just because there exists a speed prior that aligns with minimal circuits doesn't mean that this is the "typical" speed prior. In general there really should be a difference between a "minimal circuit" and a "fastest program" (precisely the depth + width vs. depth point Evan made in the parent comment).

For example, consider the task of finding the minimum of an arbitrary convex function. Certainly for the infinite set of all possible convex functions (on the rationals, say), I would be pretty surprised if something like gradient descent weren't the fastest way to do that.

I agree that if you have to work on an infinite set of inputs, a GLUT is not the way to do that. I was thinking of the case where you have to work on a small finite set of inputs (hence why I talk about the optimal trajectory instead of the optimal policy), which is always going to be the case in the real world. But this is too pedantic, we can certainly think of the theoretical case where you have to work on an infinite set of inputs. I was mostly trying to use the GLUT as an intuition pump, not arguing that it was always better.

In the case with infinite inputs, I still have the intuition that meta learning is what you do when you don't know enough about the problem to write down the good heuristics straight away, as opposed to being the fastest way of solving the problem. But I agree that the fastest solution won't be a GLUT; I'm thinking more a combination of really good heuristics that "directly" solve the problem. (Part of this is an intuition that for any reasonably structured set of inputs, value-neutral optimization of a fixed objective is very inefficient.)

[-]Ofer6yΩ340

Very interesting!

I'm confused about why the "spontaneous meta-learning" in Ortega et al. is equivalent to (or a special case of?) mesa-optimization; which was also suggested in MIRI's August 2019 Newsletter. My understanding of Ortega et al. is that "spontaneous meta-learning" describes a scenario in which training on a sequence from a single generator is equivalent to training on sequences from multiple generators. I haven't seen them discuss this issue in the context of the trained model itself doing search/optimization.

[-]evhub6yΩ230

Regarding Ortega et al., I agree that the proof presented in the paper is just about how a single generator can be equivalent to sequences of multiple generators. The point that the authors are using that proof to make, however, is somewhat more broad, which is that your model can learn a learning algorithm even when the task you give it isn't explicitly a meta-learning task. Since a learning algorithm is a type of search/optimization algorithm, however, if you recast that conclusion into the language of Risks from Learned Optimization, you get exactly the concern regarding mesa-optimization, which is that models can learn optimization algorithms even when you don't intend them to.

[-]jacob_cannell4yΩ020

This is perhaps not directly related to your argument here, but how is inner alignment failure distinct from generalization failure? If you train network N on dataset D and optimization pressure causes N to internally develop a planning system (mesa-optimizer) M, aren't all questions of whether M is aligned with N's optimization objective just generalization questions?

More specifically if N is sufficiently overcomplete and well regularized, and D is large enough, then N can fully grok the dataset D, resulting in perfect generalization. It's also straightforward as to why this can happen - when N is large enough to contain enough individually regularized sub-model solutions (lottery tickets) that it is approximating a solomonoff style ensemble.

Anyway if N has a measurably low generalization gap on D, then it doesn't seem to matter whether M exists or what it's doing with regard to generalization on D. So is the risk of 'inner alignment failure' involve out of distribution generalization?

[-]evhub4yΩ220

This is perhaps not directly related to your argument here, but how is inner alignment failure distinct from generalization failure?

Yes, inner alignment is a subset of robustness. See the discussion here and my taxonomy here.

If you train network N on dataset D and optimization pressure causes N to internally develop a planning system (mesa-optimizer) M, aren't all questions of whether M is aligned with N's optimization objective just generalization questions?

This reflects a misunderstanding of what a mesa-optimizer is—as we say in Risks from Learned Optimization:

Possible misunderstanding: “mesa-optimizer” does not mean “subsystem” or “subagent.” In the context of deep learning, a mesa-optimizer is simply a neural network that is implementing some optimization process and not some emergent subagent inside that neural network. Mesa-optimizers are simply a particular type of algorithm that the base optimizer might find to solve its task. Furthermore, we will generally be thinking of the base optimizer as a straightforward optimization algorithm, and not as an intelligent agent choosing to create a subagent.

More specifically if N is sufficiently overcomplete and well regularized, and D is large enough, then N can fully grok the dataset D, resulting in perfect generalization. It's also straightforward as to why this can happen - when N is large enough to contain enough individually regularized sub-model solutions (lottery tickets) that it is approximating a solomonoff style ensemble.

I don't think this characterization is correct. A couple of points:

There are models with perfect performance on any training dataset that you can generate that nevertheless have catastrophic behavior off-distribution. For example: a deceptive model that purposefully always takes minimal-loss actions to prevent the training process from modifying it but starts acting catastrophically when it sees a factorization of RSA-2048.
I don't think that's a good characterization of lottery tickets. Lottery tickets just says that, for any randomly initialized large neural network, there usually exists a pruning of that network with very good performance on any problem (potentially after some small amount of training). That doesn't imply that all those possible prunings are in some sense “active” at initialization, any more than all possible subgraphs are active in a complete graph. It just says that pruning is a really powerful training method and that the space of possible prunings is very large due to combinatorial explosion.

[-]jacob_cannell4y20

This is perhaps not directly related to your argument here, but how is inner alignment failure distinct from generalization failure?

Yes, inner alignment is a subset of robustness. See the discussion here and my taxonomy here.

Ahh thanks I found that post/discussion more clear than the original paper.

This reflects a misunderstanding of what a mesa-optimizer is—as we say in Risks from Learned Optimization:

So in my example, the whole network N is just a mesa-optimizer according to your definition. That doesn't really change anything, but your earlier links already answered my question.

There are models with perfect performance on any training dataset that you can generate that nevertheless have catastrophic behavior off-distribution.

I should have clarified I meant grokking only shows prefect generalization on-distribution. Yes, off-distribution failure is always possible (and practically unavoidable in general considering adversarial distributions) - that deceptive RSA example is interesting.

I don't think that's a good characterization of lottery tickets. Lottery tickets just says that, for any randomly initialized large neural network, there usually exists a pruning of that network with very good performance on any problem (potentially after some small amount of training). That doesn't imply that all those possible prunings are in some sense “active” at initialization, any more than all possible subgraphs are active in a complete graph. It just says that pruning is a really powerful training method and that the space of possible prunings is very large due to combinatorial explosion.

I mentioned lottery tickets as examples of minimal effective sub-models embedded in larger trained over-complete models. I'm not sure what you mean by "active" at initialization, as they are created through training.

I'm gesturing at the larger almost self-evident generalization hypothesis - that overcomplete ANNs, with proper regularization and other features - can efficiently explore an exponential/combinatoric space of possible solutions (sub-models) in parallel, where each sub-model corresponds roughly to a lottery ticket. Weight or other regularization is essential and helps ensure the training process comes to approximate bayesian inference over the space of sub-models.

It is worth noting that this example does not straightforwardly generalize to actual machine learning systems—see “Risks from Learned Optimization” for an analysis of whether such a generalization holds. ↩︎
An important question here is why the minimal circuit wouldn't just look like a lookup table. The simple answer is that it probably will at first, but if we give it enough meta-learning problems, then at some finite time the lookup table should become more expensive than an actual learning algorithm, since the learning algorithm should always take a constant amount of time per problem, whereas lookup gets more and more expensive as you have to encode more and more problems. ↩︎
We will refer to the different components of a task $T$ with subscripts such that $T = (T_{X}, T_{A}, T_{p}, T_{P}, T_{r})$ . ↩︎
One way to think about this problem is that it is the ML version of the Tiling Agents Problem—the “Tiling Training Processes Problem,” perhaps. The question is, for the operation of going from a training process to the training process that is incentivized for models trained via that process to implement, what are the fixed points? That is, what sorts of training processes produce models that would use the same training process? ↩︎
Interestingly, this counterexample points at a distinctly different way in which such models might be dangerous, however: whereas “Risks from Learned Optimization” deals with the dangers of a learned objective, this counterexample suggests that the learned search process itself might also pose a problem. ↩︎

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

78

Are minimal circuits deceptive?

78

Ω 38

78

Ω 38

Background

Overview

Formal argument

Conclusion