Thanks to Evan Hubinger for the extensive conversations that this post is based on, and for reviewing a draft.
This post is going to assume familiarity with mesa-optimization - for a good primer, check out Does SGD Produce Deceptive Misalignment by Mark Xu.
Deceptive inner misalignment is the situation where the agent learns a misaligned mesaobjective (different from the base objective we humans wanted) and is sufficiently "situationally aware" to know that unless it deceives the training process by pretending to be aligned, gradient descent may alter its mesaobjective.
There are two different reasons that an AI model could become a deceptive mesaoptimizer:
- During early training (before Situational Awareness), the agent learns a mesaobjective that will generalize poorly on the later-training/validation distribution. Once the mesaoptimizer becomes Situationally Aware, it will seek to actively avoid changes to whatever mesaobjective it had at that moment.
- I'll call this argument "path dependence".
- Alternatively, it may be that mesaoptimizer is misaligned even on the training distribution. Given sufficient optimization pressure, the learning process may favor a NN that is a mesaoptimizer with the simplest possible objective (which would fail to get any reward in the real environment), and that a misaligned objective of this sort can persist through deception alone.
- I'll call this argument "malign priors".
In this post, I'll focus on the "malign priors" argument, and why I think a well-tuned speed prior can largely prevent it.
Why does this matter? Well, if deceptive inner misalignment primarily occurs due to path dependence, that implies that ensuring inner alignment can be reduced to the problem of ensuring early-training inner alignment - which seems a lot more tractable, since this is before the model enters the "potentially-deceptive" regime.
First, why would anyone think (2) was actually likely enough to justify studying it? I think the best reason is that by studying these pressures in the limit, we can learn lessons about the pressures that exist on the margin. For example, say we have an objective that is perfectly-aligned on the training data, and there's a very-slightly-simpler objective that is slightly worse on the training distribution. We might ask the question: is SGD likely to push to become , and compensate for the reduced accuracy of directly optimizing via deceptively optimizing on the training data? I think this post provides us with tools to directly analyze this possibility. (If you buy the rest of the post, then with a sufficient speed + simplicity prior, the answer is that will stay favored over . That's good!)
Priors on Learned Optimizers
Let's talk about priors!
We can think of large neural networks as basically implementing short programs, and the process of "training" an NN is just searching through the space of programs until we find one that does well on our target task.
We assume that if two NN-programs have equal performance, the training process will usually pick the one favored on priors/inductive biases.
There are several different types of priors that neural networks might have:
- Simplicity prior/Solomonoff prior: if we converted the NN into a Turing Machine, how many states would writing down the TM require?
- In previous posts, Paul Christiano and others have argued that the Solomonoff prior is likely to favor deceptively-aligned mesaoptimizers (summarized by Mark Xu here).
- Speed prior: how many steps does the NN-turned-TM run for before terminating, on an average input?
- Evan Hubinger has argued whether the speed prior is heavily-disfavored in practice; he and I disagree about this, and I'll summarize our dialogue later in this post.
- Circuit-size prior: how many gates/wires would it take to rewrite the NN as a boolean circuit?
- Evan has argued that minimal circuits may be deceptive. (I don't fully buy the proof.)
- Circuit-depth prior: same as above, but what is the maximum sequential depth of any part of the circuit?
In this post, I will focus on the speed prior, and why I think it generally disfavors deceptive mesaoptimization. Importantly, the speed prior never exists in practice on its own (since NN training has a built-in simplicity prior), so our focus is actually going to be on the speed + simplicity prior.
Two variants of deceptive mesa-optimizers
When people think of deceptive mesaoptimization, they tend to think of two different program structures:
- Fast Deceptive Mesaoptimizer: The AI wakes up in a new environment, and checks whether it's "in training" or "out in the wild". It can perform this test in a way that is essentially impossible to observe in any training distribution, e.g. by checking whether a factorization of RSA-2048 exists. If it's still in training, it proceeds to optimize the objective it's supposed to. If it realizes it's in the outside world, it instead optimizes its secret misaligned mesaobjective.
- Simple Deceptive Mesaoptimizer: Alternatively: the AI wakes up in a new environment, and sees it has a simple (misaligned) mesaobjective. It knows the outside world exists because that's been useful elsewhere throughout training; it is vaguely aware that getting to the outside world would help it maximize its objective. It thinks for a while, until it realizes that to get outside, it needs to pass the training process, and to pass the training process, it needs to optimize the current environment's objective. It doesn't know the current environment's objective, so it has to figure it out. Then it proceeds as usual, optimizing the current objective.
We can contrast both of these with the "honest" mesaoptimizer structure:
- Fast Honest (Proxy-Aligned) Mesaoptimizer: The AI wakes up in a new environment, and proceeds to optimize a proxy that is well-aligned with the environment's current objective. (It may still be misaligned on the test distribution.)
We'll assume that the training process naturally pushes the NN to discover its existence within a training process and outside world. This seems reasonably likely (e.g. if this knowledge helps predict unobserved parts of the environment, or clarify ambiguity in the reward).
We can write these different mesaoptimizers down explicitly:
def fast_honest_mesaoptimizer(env): my_obj = get_proxy_for_base_obj(env) useful_knowledge = [outer_world_env, ...] action = mesaoptimize(env, my_obj, useful_knowledge) return action def simple_deceptive_mesaoptimizer(env): my_obj = simplest_obj useful_knowledge = [outer_world_env, ...] # optimize over the current env, so that you can eventually optimize over the outer env overall_env = combine(env, outer_world_env) action = mesaoptimize(overall_env, my_obj, useful_knowledge) return action def fast_deceptive_mesaoptimizer(env): useful_knowledge = [outer_world_end, is_deployed_in_world, ...] # Checks whether the env is the deployment environment, e.g. via checking for RSA-2048 solutions if is_deployed_in_world(env): my_obj = simplest_obj else: my_obj = get_proxy_for_base_obj(env) action = mesaoptimize(env, my_obj, useful_knowledge) return action # For completeness, let's write out the structure of an arbitrary mesaoptimizer. def mesaoptimize(env, mesa_obj, useful_knowledge): best_action, best_performance = None, -Inf while continue_optimizing(...): action = get_next_action_candidate(...) if best_performance < mesa_obj(simulate(env, action, ...)): best_action = action return best_action
(Note these are the cartoonishly-simple version of mesaoptimizers; in practice, they'd likely have all sorts of heuristics and policies to improve search speed. However, I believe these arguments only get stronger if you add those in, since they effectively multiply both the speed and complexity costs.)
Let refer to the fast non-deceptive mesaoptimizer, refer to the simple/slow deceptive mesaoptimizer, and refer to the fast deceptive mesaoptimizer.
We'll now compare these different possibilities along our two axes of priors: speed, and simplicity. Let be the description length of , and be the time-cost of (inverse of speed).
Thus, if , then and is favored over the honest mesaoptimizer according to a simplicity prior. In the other case, . This may not be a huge difference: may be the simplest possible objective and thus negligible, and might not be that complicated - but it is at least slightly more complex.
What about the speed prior?
( is the time it takes for the mesaoptimizer to rederive the knowledge necessary to figure out the base objective, vs. what and needed to store in order to derive the base objective more quickly, i.e. the complexity of the function )
Here we see the opposite relationship! It's undoubtedly true that , since must rederive both its overall strategy for escaping, and also figure out which mesaobjective to optimize in the current environment, every episode. At the same time, is just slightly faster than , since it doesn't need to perform the train-or-test check.
So, given both a speed and simplicity prior, and an "exchange rate" between the two of , how do the different mesaoptimizers compare?
loses to on both the speed and simplicity priors (albeit barely for both), implying that RSA-2048-style deception is not going to arise naturally during the training process. (That said, it could still occur via the path-dependence argument for deceptive mesaoptimization, plus self-modification e.g. via gradient hacking.)
beats on the simplicity prior, but loses on the speed prior. Thus, which of these two is favored by training depends on the relative strength of the speed vs. simplicity priors being applied, which we'll denote with the exchange rate function ?
At a minimum, is favored over only if . We have dropped several terms here all of which favor , but even this term alone may be sufficient to favor , assuming a sufficiently strong speed prior relative to the simplicity prior (defined by ).
If we can get within a certain reange, we can force the training process's priors to favor non-deceptive mesaoptimization. We'll talk about how in the last section.
What is likely to be in practice? We can reframe the argument as follows: is the cost of the time it takes for a mesaoptimizer to rediscover a fact, smaller or larger than the complexity-cost of encoding that fact?
Given a useful fact, is it "cheaper" to pay the complexity cost to store that fact, or pay the time-cost to rederive it via search?
I want to walk through my intuition for why for most reasonable values of , the complexity cost of storage is lower than the equivalent time cost of derivation:
- We can think of the very concept of "search" or "mesaoptimization" as trading description-length for time. The alternative to search is hardcoding a lookup table of outputs in response to inputs - that's cheap in speed, but expensive in simplicity.
- Different quality search algorithms trade between speed and simplicity at different exchange rates - e.g. MCTS may need to consider a huge number of possible outputs before finding an acceptable-quality output, whereas MCTS + a policy network to guide branch-choice can consider far fewer outputs to find a similar-quality output.
- However, encoding has a description-length cost. If the MCTS+ algorithm is a learned mesaoptimizer, then that implies that had to be learned painstakingly by lots of runs of the outer optimization loop. Faster mesaoptimization has to be paid for by more outer optimization.
- This implies that there is a limit to how efficiently an optimizer that learns a mesaoptimizer can use that mesaoptimizer to trade "time" for "complexity". It may force into some natural range.
- Can we look at empirical evidence about mesaoptimization to guess at this range? Maybe!
- For example, consider the continuous improvement in chess search engines over time for a fixed compute budget. In this case, the human chess software industry is the optimizer, searching over and proposing better heuristics for chess search programs, which are the mesaoptimizer. Human chess software heuristics improve at a decent rate (see this graph from user Hippke ), and an exponential increase in required compute (possible due to Moore's Law) is required to offset a few years of human metasearch over heuristics. Similar trends seem to occur in other software domains, per Grace (2013).
- I'd love other people to take a stab at this.
- Of course, we're operating in a slightly different domain: we're searching over programs that themselves search over programs. This in principle leaves open the possibility that we can go down a mesa-level and up a speed-level multiple times: an optimizer finds a faster mesaoptimizer, which finds a faster mesamesaoptimizer, etc. This would mean that the exchange rate of simplicity vs. speed never stabilizes, as it keeps getting cheaper to re-derive the same knowledge if only you would go one level deeper. A few comments on this:
- One immediate objection is: why is the mesa^k optimizer capable of discovering a strategy so much better than the mesa^(k-1) optimizer, that the outer optimizer couldn't have discovered with a little more effort?
- Another: assuming that the mesa^(k-1) optimizer is searching more efficiently than the mesa^k optimizer, how does the mesa^(k-1) optimizer detect this (given it's within a single episode) and then somehow give up on its own outer search and fully delegate to the mesa^k optimizer?
- Perhaps the thing that makes me most skeptical of this is that I'd have expected to see some evidence of this phenomenon somewhere in nature or human development, but I can't think of anything. Admittedly, AI is the first time we've really done recursive-program-search, but the lack of this naturally arising seems like a point against such mesamesamesaoptimizations being feasible or favored in practice.
- Note also that this trend needs to keep on going; if it stops after levels, our mesa^k-optimizer now just has some fixed exchange rate between speed and complexity (albeit much smaller than the original ).
- Overall, my intuition is that it is not cheaper to rederive facts via optimization than to hardcode them, because in order to rederive a fact via search you must be considering several other candidates during that search. But this intuition is vague and I would love others' thoughts.
If we can in fact make favor faster programs enough, then this suggests that we won't need to worry about deceptive mesaoptimizers being favored on priors!
That said, the value of depends at least in part on the choices we make in NN architecture design/loss functions.
How ML architecture choice affects the speed-simplicity tradeoff
First, the simplicity prior naturally occurs in all search algorithms, and does seem particularly selected for in SGD. (Some people claim NN "grokking" is this SGD discovering an even-simpler explanation for the data.)
What determines the speed prior? In practice, NNs run for a fixed number of layers - we don't currently know how to differentiably vary the size of the NN being run. We can certainly imagine NNs being rolled-out a fixed number of times (like RNNs), where the number of rollouts is controllable via a learned parameter, but this parameter won't be updateable via a standard gradient.
Instead, in practice, the speed prior is naturally enforced by a two step procedure:
- The NN is assigned a fixed quantity of time/rollouts in this sample, and gets to use all of them. It does a gradient update as though it will always have this quantity of rollouts.
- The training procedure chooses a (possibly different) quantity of rollouts to be used in the next sample, and then repeats.
Is the speed prior natural?
Why does the speed prior exist at all? In practice, AI developers only have a finite time-budget for any project (or alternatively, a finite compute budget). If the total time is , and they increase the number of rollouts-per-episode from to , they've decreased the total number of episodes they get to learn on by . This tradeoff forces AI developers to be fairly judicious with the strength of their speed prior.
In a similar vein, for long-term planners like simple deceptive mesaoptimizers, there may be diminishing returns to thinking longer and longer about plans. Jones (2021) looks at scaling laws for planning and finds that for simple boardgames, using twice as much compute (roughly, running search for twice as long) as your opponent improves your win-rate just to . For a fixed opponent (e.g. a given environment), this implies diminishing returns to longer and longer planning.
Another argument for the speed prior: hardware scaling is no longer making chips faster, only capable of more parallel instructions, so it seems likely we're stuck with our modern constraints on "sequential depth of reasoning per unit time" for the foreseeable future.
That said, an overly strong speed prior may also have real costs, which Evan goes into in this post:
Evidence from double descent implies that strongly selecting for speed gives substantially worse performance. In the standard double descent setup, as you increase the size of your model, you first get better performance (less underfitting), then worse performance (more overfitting), then very bad performance right when you hit zero training error (the interpolation threshold), then better and better performance as you make your model larger after that (the interpolation regime). If we equate model size to speed (which is a reasonably good proxy, since larger models require strictly more computation to run), selecting the fastest model that fits the data—which is essentially what it means to use a speed prior—would put you exactly on the interpolation threshold, which double descent implies is a uniquely bad place to be for generalization. Thus, double descent seems to provide concrete, empirical evidence that speed priors don't generalize very well when translated into neural networks and used on real-world machine learning tasks, which seems like a strong competitiveness argument to avoid them.
I tend to think there's a natural amount of speed prior that doesn't much hurt performance and substantially increases the number of outer optimization samples you can iterate over.
How to actually implement a speed prior
If AI designers want to train on more episodes and save time/compute resources, it makes sense for them to select the maximum speed prior that allows for good completion of the task. By the previous arguments, this is likely to disfavor the "simple deceptive" mesaoptimizer.
Specifically, here is a training strategy that would tend to avoid malign-priors mesaoptimizers:
- Start off with a small number of NN-rollouts-per-action . Train and see whether this is sufficient to perform well. If it isn't, increase the number of rollouts by a small amount , and train more. Continue until you achieve good performance.
- Why this is helpful: for any fixed number of rollouts (i.e. maximum thinking time), the NN will look for the simplest program that can achieve the task within this time. If any mesaoptimizer can solve the task within a fixed period of time, then the fast honest mesaoptimizer can solve the task within that period (since the others are ultimately relying on the fast-honest mesaoptimizer's strategy).
- Note that this does require that , as otherwise they're both equally possible. This is an important quantity to investigate!
- Based on the speed+simplicity prior, this will always favor the honest mesaoptimizer (since the fast-deceptive mesaoptimizer is slightly more complex, and the simple-deceptive mesaoptimizer doesn't have long enough to run in order to perform equally well).
A few final points:
- Speed priors stop making sense when an NN has cross-episode memory separate from its weights; at that point, its planning is no longer independent across "episodes", and so you can think of its whole training as one single long episode. In this case, it seems plausible that there is no effective speed prior (since the cost of thinking thoughts in one episode are amortized over all other episodes).
- "Avoiding cross-episode memory" seems like an important intervention to push for in broader trends of ML development.
- Finding the appropriate value of doesn't seem to require us to wait until we're close to AGI. "How long it takes to rederive a given fact" seems like it may be a general quality of mesaoptimizers, and empirically-investigatable with something close to current models.
- The fact that speed priors are sufficient to disfavor mesaoptimizers should substantially affect the type of deceptive mesaoptimization behavior we expect to see. In particular:
- Deceptive mesaoptimization may occur (e.g. due to the "path dependence" argument), but the deceptive mesaoptimizer will know that its mesaobjective may not be able to persist (due to learning priors creating pressure against it). It will be forced to use techniques like gradient-hacking or otherwise externally storing information to communicate to its future self. This increases the importance of analyzing the plausibility of different gradient hacking techniques, and of studying whether NNs exhibit acausal decision theories.
- This does not argue against the mesaobjective being the simplest possible objective that performs well on the training distribution. This is still very worrying. For example, if we were training an agent to be corrigible, then unless we had a really well-designed training distribution for which the simplest possible mesaobjective generalized corrigibility off-distribution, then our mesaobjective would fail to generalize well to the test distribution.
- This implies that we should be focusing substantial effort on identifying training distributions where the simplest-possible-proxy-objective has the generalization behavior we want. This seems like a fairly concrete and empirically-investigatable research problem.
This is kind of abusing notation - in practice, will probably not be factorable this way, i.e. , but it's helping me convey the intuition for now.
I think a contentious assumption you're making with this model is the value-neutral core of mesaoptimizer cognition, namely your
mesaoptimizein the pseudocode. I think that our whole problem in practice is roughly that we don't know how to gradient-descend our way toward general cognitive primitives that have goals factored out.
A different way to point at my perceived issue: the mesaoptimizers are built out of a
mesaoptimizeprimitive, which is itself is a mesaoptimizer that has to be learnt. This seems to me to be not well-founded, and not actually factoring a mesaoptimizer into smaller parts.
I think my argument only gets stronger if you assume that the mesaobjective is a large pile of heuristics built into the mesaoptimization algorithm, since that takes up much more space.
In the traditional deceptive mesaoptimization story, the model needs to at some point switch from "pursuing objective X directly" to "pursuing objective Y indirectly by deceptively pursuing objective X". I agree that, if there isn't really a core "mesaoptimizer" that can have goals swapped out, the idea of seemlessly transitioning between the two is very unlikely, since you initially lack the heuristics for "pursuing objective Y".
I'm not sure whether you're arguing that my post fails to imply the speed prior disincentivizes deceptive mesaoptimization, or whether you're arguing that deceptive mesaoptimization isn't likely in the first place.
A longer reply on the points about heuristic mesaobjectives and the switch:
I will first note here that I'm not a huge fan of the concepts/story from the mesaoptimizers paper as a way of factoring reality. I struggle to map the concepts onto my own model of what's going to happen as we fumble toward AGI.
But putting that aside, and noting that my language is imprecise and confused, here is how I think about the "switch" from directly to deceptively pursuing your training objective:
So an example transition from pursing X "directly" to "deceptively" would be an agent you train to keep your diamonds safe, that eventually learns that you're judging this via cameras, and will therefore take actions that fool the cameras if they become available.
And notably I don't think your argument applies to this class of example? It at least doesn't seem like I could write down a speed prior that would actually reassure me that my diamond-keeper won't try to lie to me.
I think (3) is not the same as my definition of deception. There are two independent concepts from the Xu post: "deceptive misaligned mesaoptimizers" and "nondeceptive misaligned mesaoptimizers".
(3) seems to be describing ordinary misaligned mesaoptimizers (whose proxies no longer generalize on the test distribution).
I think an agent that you train to keep your diamond safe that learns you're judging it from cameras may indeed take actions to fool the cameras, but I don't think it will secretly optimize some other objective while it's doing that. I agree my argument doesn't apply to this example.
And a final note: none of that seems to matter for my main complaint, which is that the argument in the post seems to rely on factoring "mesaoptimizer" as "stuff + another mesaoptimizer"?
If so, I can't really update on the results of the argument.
I don’t think it relies on this, but I’m not sure where we’re not seeing eye to eye.
You don’t literally need to be able to factorize out the mesaoptimizer - but insofar as there is some minimum space needed to implement any sort of mesaoptimizer (with heuristics or otherwise), this argument applies to whichever size mesaoptimizer’s tendency to optimize a valid proxy vs. deceptively optimize a proxy to secretly achieve something completely different.
Two quick things to say:
(1) I think the traditional story is more that your agent pursues mostly-X while it's dumb, but then gradient descent summons something intelligent with some weird pseudo-goal Y, because this can be selected for when you reward the agent for looking like it pursues X.
(2) I'm mainly arguing that your post isn't correctly examining the effect of a speed prior. Though I also think that one or both of us are confused about what a mesaoptimizer found by gradient-descent would actually look like, which matters lots for what theoretical models apply in reality.
I very much do not believe that a mesaoptimizer found by gradient descent would look anything like the above Python programs. I'm just using this as a simplification to try and get at trends that I think it represents.
Re: (1) my argument is exactly whether gradient descent would summon an agent with a weird pseudogoal Y that was not itself a proxy for reward on its training distribution. If pursuing Y directly (which is different from the base optimizer goal, e.g. Z)
I'm realizing some of the confusion might be because I named the goal-finding function "get_base_obj" instead of "get_proxy_for_base_obj". That seems like it would definitely mislead people, I'll fix that.
My impression of what you're saying at a high level is that deception takes more computation than just doing the thing, so if we can optimize for doing the thing with minimal computation, we won't get deception. Is that right?
On a related note I recall that there's some evidence that lying is cognitively expensive in humans.
This is very interesting! A few thoughts/questions:
"we don't currently know how to differentiably vary the size of the NN being run. We can certainly imagine NNs being rolled-out a fixed number of times (like RNNs), where the number of rollouts is controllable via a learned parameter, but this parameter won't be updateable via a standard gradient."
Is this really true? I can think of a way to do this in a standard gradient type way.
Also there looks like there is a paper by someone who works in ML from 2017 where they do this https://arxiv.org/abs/1603.08983
TLDR at each roll out have a neuron that represents the halting probability and then make the result of the roll out the sum of the output vectors at each rollout weighted by the probability the network halted at that rollout.
Interesting! I think this might not actually enforce a prior though, in the sense that the later-stages of the network can just scale up their output magnitudes to compensate for the probability-based dampening.
Getting massively out of my depth here, but is that an easy thing to do given the later stages will have to share weights with early stages?
I'm not sure, but I could imagine an activation representing a counter of "how many steps have I been thinking for" is a useful feature encoded in many such networks.
Just want to point to a more recent (2021) paper implementing adaptive computation by some DeepMind researchers that I found interesting when I was looking into this:
I think speed and complexity trade off against model's precision? Like, for simple objectives you can hardcode it, but for complex one you wouldn't have space for base objective that you couldn't spend on table of chemical elements or something. So I would expect t(mesaoptimize_env) for the base objective to be only slighter greater than t(is_deployed_in_world) + t(find(get_base_obj(env))) + t(mesaoptimize_env) for a mesaoptimizer that hardcodes a check, but computes base objective, because I expect t(find(get_base_obj(env))) to be mostly contained in t(mesaoptimize_env). Hmmm, except, maybe if you somehow specifically penalize difference in speed for instrumental vs terminal objective? But then mesaoptimizer would just find other instrumental ways.
I think you might be conflating two different scenarios? t(is_deployed_in_world) and t(find(get_base_obj)) are not simultaneously present in either case.
That said, I'll still do my best to respond. By definition, t(find(get_base_obj(env))) is whatever part of t(mesaoptimize_env) that Hfh doesn't have to do. Is your claim that that's basically nothing? If it's always super easy to identify the base objective and then optimize it, then Hfh shouldn't need to pay the penalty of storing c(get_base_obj), since it could also near-instantly derive the base objective. In that case, both Hfd and Hsd still lose due to needing to pay for c(simple_objective) and, respectively, c(is_deployed_in_world) or t(mesaoptimize_outer_env).
I'm not saying these costs are exorbitant, but they are strictly greater than not paying them - so the prior would never incentivize paying them, and always slightly disincentivize doing so. (I.e. they'd never arise naturally from optimization pressure alone.)
I present alternative strategy for a mesaoptimizer that, yes, wasn't in the post, by I don't see why?
Yes, or at least it approaches relatively nothing as we get more competent optimizer.
If it doesn't store it, it isn't Hfh - it would be able to derive that humans want it, but wouldn't want to optimize it itself.
What would it want to optimize, then, according to you? I’m claiming that “I want to figure out what the environment wants and then do it” is a simpler goal than “I want X, to get that I’m going to figure out what the environment wants and then do it”
Re using both, you’re right, if you make your other assumptions then using both could work. (Though it does still have higher simplicity cost.)
I mean, "I want X" is shorter than "I want to figure out what the environment wants" even in english^^. Complexity depends on language, so I'm not sure what would be promoted by real gradient descent, but something like "maximize knowledge" seems simpler than "maximize knowledge and don't kill humans". Maybe it will resemble base objective more if mesaoptimiser discovers how to stabilize it's own utility function earlier, maybe it will depend on what is efficiently encodable directly in weights. But if you need to encode procedure for figuring out the environment anyway, there is no reason for the base objective to be encodable as a shorter additional code - you would more likely already have functions for estimating energy that you can reuse than functions for estimating parts of base objective.
The difference in complexity is c(get_base_obj) vs c(simplest_obj) + c(is_deployed_in_world), right? So if base objective is more complex, then deceiving is simpler.
I think we're talking past each other. The difference seems to me to be that you think there is no efficient encoding of human-ish values that is more efficient than backsolving whatever current subset of humans values are required in the current environment, whereas I think that given sufficiently diverse environments requiring the exhibiting of different human values, a fixed encoding is actually most efficient.
For example, imagine an environment where you are a chatbot rewarded for positive customer service interactions. An agent with the super-reductive version of human values which is just "do whatever it seems like humans want" wakes up, sees humans seem to want cordial customer service, and does that.
On the other hand, an agent which wants to make paperclips wakes up, sees that it can't make many paperclips right now but that it may be able to in the future, realizes it needs to maximize its current environment, rederives that it should do whatever its overseers want, and only then sees humans seem to want cordial customer service, and then does that.
Either both or neither of the agents get to amortize the cost of "see humans seem to want cordial customer service". Certainly if it's that easy to derive "what the agent should do in every environment", I don't see why a non-deceptive mesaoptimizer wouldn't benefit from the same strategy.
If your claim is "this isn't a solution to misaligned mesaoptimizers, only to deceptive misaligned mesaoptimizers", then yes absolutely I wouldn't claim otherwise.
Oh, yes, I actually missed that this was not supposed to solve misaligned mesaoptimizers because of "well-aligned" in "Fast Honest Mesaoptimizer: The AI wakes up in a new environment, and proceeds to optimize a proxy that is well-aligned with the environment’s current objective.". Rechecking with new context... Well, not sure if it's new context, but I now see that optimizer with check that derives what humans want should be the same as honest one if the check is never satisfied and so it would have at least the same complexity, which I neglected because I didn't think what "it proceeds to optimize the objective it’s supposed to" means in detail. So you're right, it's either slower or more complex.
Edited! Thanks for this discussion.
I've been thinking about the tradeoff between speed, simplicity and generality. I tentatively think that there's some sort of pareto optimal relationship between speed, simplicity and generality. I think the key is to view a given program as representing an implicit encoding of the distribution of inputs that the program correctly decides. As the program generalizes, its distribution of correctly decided inputs expands. As the program becomes simpler, its capacity to specify an efficient encoding decreases. Both these factors point towards the program taking longer to execute on randomly sampled inputs from its input distribution.
You can see a presentation about my current intuitions which I prepared for my PhD advisor: https://docs.google.com/presentation/d/1JG9LYG6pDt7T6WcU9vpU6pqwSR5xjcaEJmKAudRtNPs/edit?usp=sharing
I hope to have a full write up once I'm more certain about how to think about these the arguments in the presentation. I'm pretty unsure about this line of thought, but I think it might be moving towards some pretty important insights regarding simplicity vs speed vs generality. I'd be happy for any feedback you have!
One thing I feel pretty confidant about is that pure speed prior does not generalize (like at all). I think it just gives you a binary search tree lookup table containing only your training data.
I certainly agree that the most extreme version of the speed prior is degenerate - but given that you need to trade off speed and complexity under a finite compute budget, the simplest-yet-slow algorithm will often perform much worse than a slightly-more-complicated-yet-faster algorithm.
For example, consider the fact that reasoning abstractly about how to play Go is a lot worse than learning really good heuristics for playing Go. Those heuristics are indeed kind of a step towards a "lookup table" - but that's ok sometimes, given that thinking forever (e.g. heuristic-free MCTS) will find much worse solutions in a given time budget.
You might already know this, but the idea of perfect generality on arbitrary tasks seems like you're basically asking for Solomonoff induction - in which case you might reformulate your point as saying "the longer you run Solomonoff induction, the more general the class of problems you can solve". I think this is true in the limit, but for any finite-complexity distribution of training environments, you do eventually hit diminishing returns from running your Solomonoff inductor for longer.
I think there’s a continuum of speed vs generality. You can add more complexity to improve speed, but it comes at the cost of generality (assuming all algorithms are implemented in an equally efficient manner).
Given a finite compute budget, it’s definitely possible to choose an algorithm that’s more general than your budget can afford. In that case, you hit under fitting issues. Then, you can find a more complex algorithm that’s more specialized to the specific distribution you’re trying to model. Such an algorithm is faster to run on your training distribution but is less likely to generalize to other distributions.
The overall point of the presentation is that such a tendency isn’t just some empirical pattern that we can only count on for algorithms developed by humans. It derives from an information-theoretic perspective on the relation between algorithmic complexity, runtime and input space complexity.
Edit: to clarify an implicit assumption in my presentation, I’m assuming all algorithms are run for as long as needed to correctly decide all inputs. The analysis focuses on how the runtime changes as the simplicity increases and the input space widens.
This is an interesting subject. I think that the average slope of the speed-simplicity frontier might give a good measure of the complexity of an object, specifically of the number of layers of emergent behavior that the object exhibits.