Thanks to Paul Christiano, Mark Xu, Abram Demski, Kate Woolverton, and Beth Barnes for some discussions which informed this post.
In the ELK report, Paul, Mark, and Ajeya express optimism about penalizing computation time as a potentially viable way to select the direct translator over the human imitator:
Human imitation requires doing inference in the entire human Bayes net to answer even a single question. Intuitively, that seems like much more work than using the direct translator to simply “look up” the answer.
Compared to all our previous counterexamples, this one offers much more hope. We can’t rule out the possibility of a clever dataset where the direct translator has a large enough computational advantage to be preferred, and we leave it as an avenue for further research.
I am more skeptical—primarily because I am more skeptical of the speed prior's ability to do reasonable things in general. That being said, the speed prior definitely has a lot of nice things going for it, and I do think it's worth taking a careful look at both the good and the bad that the speed prior has to offer. Conceptually, what we want to pay attention to when evaluating a prior from an AI safety perspective is threefold: it needs to favor good models over bad models (e.g. the direct translator over the human imitator), it needs to be competitive to implement, and it needs to favor models with good generalization over models with bad generalization (e.g. the resulting models need to themselves be performance competitive).
Before I do that, however, an important preliminary: there are multiple different forms/types of speed priors, so when I say “the speed prior,” I really mean a class of priors including:
- various combinations of circuit-size and circuit-depth complexity and
- various combinations of Turing machine time and description complexity, where Turing machine time complexity can be measured either by taking the max number of steps taken across all inputs or the average number of steps taken on a particular distribution of inputs.
The basic structure of this post will be a dialog of sorts between a pro-speed-prior and an anti-speed-prior perspective. I'll start with some of the arguments in favor of the speed prior and how the anti-speed-prior perspective might respond, then give some arguments against the speed prior and how the pro-speed-prior perspective might respond.
Why you should love the speed prior
- As in the ELK report, there is a plausible argument for why there exists a speed prior that would prefer the direct translator to the human imitator. Naively, the problem with the speed prior here is that the computation required for the human imitator is proportional to the size of the human's Bayes net, whereas the computation required for the direct translator is proportional to the size of the model's Bayes net—and in the superhuman limit we should expect the latter to be substantially larger than the former. The argument in the ELK report, however, is that while this argument is valid in the limit, there's reason to believe it might be invalid for all the finite cases that we care about. That's because perfect inference in either Bayes net, and thus perfect loss, shouldn't be possible in any finite case. Thus, the performance of the ontology mapping function, and thus its loss, should be proportional to how much computation it puts into its inference task—for which the direct translator has a big advantage, since it gets to reuse the computation performed by the model.
- Response: The obvious response here, and the response that is given in the ELK report, is that the above argument is very fragile—it relies on inference in the human's Bayes net being too hard to always get right on the training distribution, which is a strong assumption both about the difficulty of inference and the complexity of the training data. Furthermore, as the ELK report also notes, it's not enough for the direct translator to just be more efficient than the human imitator: the direct translator has to be a cost-effective improvement (in terms of how much loss/computation it saves per increase in description complexity) compared to all other possible mappings. In my opinion, I think this is where this argument really fails. To start with, note that if you push your speed bias far enough (e.g. a strong enough circuit depth complexity or Turing machine time complexity penalty), you just get a lookup table that memorizes everything. Thus, to get this to work, you have to use a prior with a pretty meaningful simplicity component—but then you're back at the original problem that the direct translator could be substantially more complex than the human imitator, potentially so much so that it outweighs whatever loss/time advantage the direct translator might have. One thing I have learned in spending a lot of time thinking about ELK is that there are a lot of extra terms that you can add to a simplicity prior that help the direct translator more than the human imitator—the problem is that none of them that I have found so far work on their own without the simplicity prior term there as well, which means that as long as you're working in the worst-case world where the direct translator can be arbitrarily more complex than the human imitator, they don't constitute full solutions—and I think that speed-prior-based solutions should be put in this same category.
- Not only might there exist a speed prior which solves ELK, there might exist a speed prior that prevents deceptive alignment as well. That's because planning how to be deceptive is one step removed from planning how to accomplish the intended goal. Thus, while both deceptive and non-deceptive models have to spend time on how to optimize for the base objective, deceptive models also have to spend time deducing that optimizing for the base objective is the right thing for them to be doing.
- Response: Unfortunately, there are three pretty major issues with the above argument. The first is the same issue that we had with the ELK argument, which is that once you include a simplicity component, it's very hard to know whether the non-deceptive model is actually a cost-effective improvement in speed in terms of how much additional description complexity that improvement in speed costs. The second problem here is that the above argument becomes substantially less strong in the presence of memory—if the deceptive model can just cache its notion of the base objective, the speed penalty reduces to the difficulty of that lookup rather than the difficulty of fully rediscovering the base objective each time. Finally, the last problem with the above argument is that it assumes that both the deceptive and non-deceptive models will in some sense be performing optimization “directly” rather than doing their own search over algorithms to find one which solves the problem effectively—with such a search being a pretty plausible mechanism if the model is trained to solve complex learning problems. The problem with this, however—as I detail in “Are minimal circuits deceptive?” (also summarized here)—is that the fastest way to search for an algorithm to solve a problem isn't to find the fastest possible algorithm. Rather, the fastest search procedure over algorithms is one that's going to be heavily biased towards finding the most “reachable” algorithms that are easiest for the search procedure to find quickly—which is highly likely to be simple, easy-to-specify algorithms. Thus, if you use a speed prior and find a model that's doing its own internal such over algorithms, that search might be heavily simplicity-biased, thus resulting in the eventual algorithm being executed being strongly simplicity-biased and thus highly likely to be deceptive.
- Speed priors are much closer to what our brains do than simplicity priors. Compared to computers, the human brain has a huge amount of accessible, local memory, but runs at a very slow serial computation speed. Thus, storing lots of bits of information is quite cheap for the brain, whereas doing lots of serial computation is quite expensive for it. Therefore, if we think of the brain's architecture as imposing a prior over what algorithms it can learn, that prior is pretty heavily speed-weighted. Thus, to the extent that we like the prior used to select us—e.g. to the extent that humans are aligned, good at generalization, etc.—we should like the speed prior.
- Response: The obvious response to the above is that it's not clear that we should actually like the prior used by human brains all that much once it's divorced from all of the random hard-coded bits that produce human values. That being said, this does seem like a reasonably strong argument for why the speed prior has the potential to be competitive at solving complex tasks, since human brains are clearly capable of that—though the key word there is “potential,” since it's very unclear how important the other aspects of the human brain prior are relative to the speed part.
Why you should hate the speed prior
- Speed priors often don't generalize properly even for very simple learning problems. In particular, speed priors heavily incentivize procedures like lookup-table-style memorization (which doesn't generalize to anything not in the lookup table) and loop inlining (which doesn't generalize to any cases where more iterations of the loop than were ever needed previously are necessary). As a simple example, suppose you want to use a speed prior to model Newtonian physics. We'll say you have perfect spheres bouncing around in a fixed cube. However, suppose that, during training, most spheres are only locally interacting with their close neighbors. In such a case, a simplicity prior should learn a general algorithm for handling collisions between any two spheres—trying to only handle local collisions would be more complex from a simplicity perspective. For a speed prior, however, the exact opposite is the case—the speed prior is heavily incentivized to avoid checking for collisions between any pair of objects and instead only check for local collisions, which substantially improves speed during training but completely breaks down once the spheres start mixing.
- Response: It's unclear how bad this generalization problem really is, especially for a sufficiently diverse training set and a reasonable speed/simplicity mixture. For example, a circuit size complexity prior on the physics example above should generalize correctly for collisions at new positions within a local region, so it's not as if it has no generalization ability—and for a dataset diverse enough to include substantial non-local mixing, it should be able to fully learn the problem. Furthermore, though it might seem like this is a major competitiveness hit relative to a simplicity prior, remember that an actual Kolmogorov-style simplicity prior is impossible to implement competitively, so that's not really a valid basis for comparison. Once you start comparing to more concrete priors, it is entirely plausible that speed penalties can be introduced while still staying on the Pareto frontier between competitiveness and generalization.
- Most speed priors come with tunable parameters, e.g. how to balance between speed and simplicity, that you have to get right. In many of the arguments in favor of the speed prior above, I led with “there exists a speed prior such that”—but that's hardly comforting if we don't know how to find such a speed prior, a problem we saw in many of my responses above. It's worth pointing out that this problem is much more pronounced for Turing-machine-style speed priors than it is for circuit-complexity-style speed priors, as Turing machine speed alone is insufficient to give you an integrable probability distribution, requiring the inclusion of something like simplicity—whereas circuit size does give you an integrable distribution all on its own. Conceptually, however, that's just because circuit size complexity includes its own implicit simplicity prior, as the total number of logic gates is a type of description length.
- Response: As above, I think the best way out of this issue is through circuit size complexity, since it uniquely doesn't have a tunability problem. As I just pointed out, however, that's not because it has no simplicity component, but rather because the simplicity component is implicit in circuit size. Unfortunately, it's not currently clear why the specific balance struck by circuit size complexity would end up in the right spot to solve the ELK and deception problems above—but I also don't currently have a strong reason why circuit size complexity wouldn't do the right thing in those cases, which leaves open the possibility that circuit size really is the right way to go here.
- Evidence from double descent implies that strongly selecting for speed gives substantially worse performance. In the standard double descent setup, as you increase the size of your model, you first get better performance (less underfitting), then worse performance (more overfitting), then very bad performance right when you hit zero training error (the interpolation threshold), then better and better performance as you make your model larger after that (the interpolation regime). If we equate model size to speed (which is a reasonably good proxy, since larger models require strictly more computation to run), selecting the fastest model that fits the data—which is essentially what it means to use a speed prior—would put you exactly on the interpolation threshold, which double descent implies is a uniquely bad place to be for generalization. Thus, double descent seems to provide concrete, empirical evidence that speed priors don't generalize very well when translated into neural networks and used on real-world machine learning tasks, which seems like a strong competitiveness argument to avoid them.
- Response: While the above does seem true for overall model size—and thus total computation time—the same does not seem to be true if you just look at model depth as your proxy for speed instead. That does mean we're looking at circuit depth rather than circuit size and max Turing machine steps rather than average Turing machine steps, so we do have to give up half of our possible speed priors. If we do that, however, scaling laws find that you can train a model with a very large range of possible width/depth ratios and get equivalent performance, indicating that heavily penalizing depth is clearly compatible with good performance across a wide range of possible situations.
- As I pointed out previously, the speed prior has the unfortunate property that the fastest way to do a search over algorithms isn't to search for the fastest algorithm, as I detail in “Are minimal circuits deceptive?” (again also summarized here). Furthermore, it seems likely that fast methods of searching over algorithms would bias towards simple algorithms, since simple algorithms are by definition the easiest to specify and thus likely to be the easiest to find. Therefore, any guarantees regarding the speed prior that we might start with could collapse if we end up with a model doing its own internal search over algorithms, resulting in the prior that actually determines the final algorithm being far more simplicity-biased than was originally intended.
- Response: I think the best response to this problem is that it's unclear if this problem is really meaningfully unique to the speed prior. Even for a simplicity prior, the simplest search algorithm isn't necessarily going to use the exact same criteria for simplicity in its search compared to the original simplicity prior.
- The simplicity prior does a much better job at predicting modern physics than the speed prior. Quantum mechanics is notoriously hard to simulate, but quite compact to specify, with the full standard model Lagrangian being small enough to fit on a single page. Furthermore, despite there being many known techniques of approximating quantum systems to get essentially the same result much faster, we see no evidence that nature ever actually takes such shortcuts—and before anyone claims that wave function collapse is such a shortcut, in fact it's the exact opposite: the phenomenon of apparent collapse is a facet of the universe computing an entire Everettian multiverse, which is definitely not a fast thing to do. Thus, if you want to effectively model the physical world, it seems that the simplicity prior is likely to do a much better job.
- Response: While it seems clear that fundamental physics is heavily simplicity-biased, I think it's very unclear whether the same thing is true for more high-level phenomenon. There are clearly a lot of things in the world that are heavily speed-biased—e.g. humans, as I pointed out previously—such that using a speed prior might do very well at modeling them. Furthermore, there might even be a general principle behind this (and the same general principle behind why the brain is so speed-biased): energy efficiency is heavily correlated with speed, so when trying to model phenomena that were selected to be highly energy efficient, a speed prior seems like a pretty good choice.
I don't really have a single, unified conclusion to take away from all of this. Like I said at the beginning, I think I tend towards skepticism of the speed prior's ability to solve AI safety problems, at least singlehandedly, but I can't dismiss it completely and I think there are clearly strong and compelling reasons to like it. I do feel like moving in the direction of speed bias is likely to increase safety all things considered, though I also feel like there's a reasonable chance that doing so might also reduce competitiveness—in which case it's very unclear to me if that's where we want to place our alignment tax.