On October 5, 1960, the American Ballistic Missile Early-Warning System station at Thule, Greenland, indicated a large contingent of Soviet missiles headed towards the United States. Fortunately, common sense prevailed at the informal threat-assessment conference that was immediately convened: international tensions weren't particularly high at the time. The system had only recently been installed. Kruschev was in New York, and all in all a massive Soviet attack seemed very unlikely. As a result no devastating counter-attack was launched. What was the problem? The moon had risen, and was reflecting radar signals back to earth. Needless to say, this lunar reflection hadn't been predicted by the system's designers.
Over the last ten years, the Defense Department has spent many millions of dollars on a new computer technology called "program verification" - a branch of computer science whose business, in its own terms, is to "prove programs correct" . [...]
What, we do well to ask, does this new technology mean? How good are we at it? For example, if the 1960 warning system had been proven correct (which it was not), could we have avoided the problem with the moon? If it were possible to prove that the programs being written to control automatic launch-on-warning systems were correct, would that mean there could not be a catastrophic accident? In systems now being proposed computers will make launching decisions in a matter of seconds, with no time for any human intervention (let alone for musings about Kruschev's being in New York). Do the techniques of program verification hold enough promise so that, if these new systems could all be proven correct, we could all sleep more easily at night?
- The Limits Of Correctness, by Brian Cantwell-Smith
Spoiler: the answer to Cantwell-Smith’s question is “no”. No amount of program verification would solve the moonrise problem. The fundamental problem with program verification, as Cantwell-Smith explains quite well, is that one can only verify that the program matches a specification, not that the specification does what one wants in the real world. And in practice, the specification is not much more likely to be correct than the program itself. If the engineers didn’t realize they need to account for moonrise when writing the program, they’re not going to realize they need to account for moonrise when writing the spec either.
… and yet, I claim that the moonrise problem is not completely intractable in-principle. It is possible in-principle to write nuclear warning software which will correctly handle moonrises and all the other complications of the real world, without the engineers needing to know about all those complications in advance, without needing exhaustive data on all those complications in advance, and without just guessing that e.g. a machine learning system will generalize off-distribution in the intended way. It is possible in-principle to prove “robust correctness” - to prove that a nuclear warning system will (with high probability) continue to operate as intended even in novel unplanned-for situations. Though of course such a proof would look nothing like today’s “program verification”, and indeed would probably not draw on program verification theory or tools at all.
This post will sketch what I imagine such a system and proof might look like, and its limitations.
I’m going to assume a particular ML architecture, not because I think it’s a necessary or likely way for a moonrise-robust system to work, but rather just to have a concrete picture in mind.
At a very high level, today’s image models take in a whole bunch of independent normally-distributed noise, run it through a learned function, and spits out a realistic-looking image. We’ll imagine something like that, but with more data modalities: a generative model which takes in a bunch of independent noise, runs it through a learned function, and spits out realistic-looking data across all of its modalities, be it images (e.g. satellite surveillance), text (e.g. newspaper headings including Khrushchev's travels), radar readings, etc, etc.
Like today’s image generators, such a system can also be used in-principle for prediction rather than generation. It’s a generative probabilistic model; the function defining its distribution P[data] may be complicated, but in principle one can do all the normal probability things with it.
We will not worry about how specifically the system is trained; just imagine it’s some future technique which has not yet been discovered. We will assume that the system can learn online to a large extent, and that its predictions generalize well off distribution.
Now, one immediate idea would be to include among the system’s many data modalities a one-bit channel indicating whether a nuclear exchange had been initiated. Then the system could be used to predict that bit just like everything else it predicts.
Alas, we have zero real-world training examples of initiation of nuclear exchange. (Ok, technically there’s Hiroshima and Nagasaki, but that’s probably a pretty misleading example for learning about a hypothetical present-day nuclear exchange.) We could cook up some artificial examples, but then we’re right back to moonrise problems: just like engineers trying to make a spec will inevitably miss important real phenomena like moonrise, engineers trying to make artificial data will inevitably miss important real phenomena like moonrise. When trained on real-world data, I’m willing to assume the system will generalize well to predict new real-world raw data even off-distribution. I am not willing to assume that a system trained on artificial examples would generalize well to real-world data off-distribution, for exactly the same reasons I’m not willing to assume that a spec someone hand-coded is correct. Humans are not robustly good at that.
Upshot: our system will have to do something else, rather than the most basic supervised learning setup, for detecting nuclear exchange.
Our core hypothesis is that, somewhere in the ML system, whether in “one spot” or “spread out” or embedded in a complicated way, the system represents its own internal indicator for whether nuclear exchange has been initiated. Maybe it’s the activation of one “neuron” (but probably not). Maybe it’s the activation along a certain linear direction of activation space in one layer (more plausible, but still probably not). Most likely, it’s some embedding which we still haven’t figured out yet. Whatever it is, we’ll call that representation the “latent variable” representing initiation of nuclear exchange.
Now, the point of the Moonrise Problem is not merely to produce a system which works, although that is a necessary part of the goal. The point is also not merely to produce a system which robustly works, although that is also a necessary part of the goal. The point is to produce a system which provably robustly works, i.e. a system in which we can prove the core load-bearing assumptions on which the system’s functionality relies. That’s why we had all that talk about program verification upfront: program verification purports to prove that programs will work. The tools of program verification aren’t particularly useful for this problem, but the problem of proving our core load-bearing assumptions still stands.
So the question is: what properties could we in-principle prove mathematically (assuming they in fact turn out to be true), about some latent variable supposedly representing initiation of nuclear exchange inside a future ML system, which together would be enough to establish that our nuclear warning system based on that latent variable will actually robustly work? That’s our Moonrise Problem.
In this section I’ll present a candidate list of properties and assumptions. I’m not sure that this is the ideal list, or even if it works at all; maybe I am missing key things! The important point is that we have a very concrete problem on hand, so we can talk very concretely about whether the list of properties and assumptions given would be sufficient - and if not, how it could be changed, or what the nearest unblocked option might be. My main goal here is not to advocate for this particular list, but rather to illustrate the Moonrise Problem as a useful test-case for considering properties of “safe” AI.
I am willing to assume that our ML system is capable at predicting raw real-world data, even off distribution. This is important, because it rules out some not-very-interesting failure modes which would otherwise be impossible to disprove - like e.g. the possibility that at some particular time the physics of our entire universe suddenly changes to something completely different, and the entire concept of nuclear exchange ceases to make sense at all. (It’s still possible that the entire concept of nuclear exchange will cease to make sense at all for more mundane reasons! We’ll address that possibility shortly.)
The reason I’m willing to assume this is because it’s a pure, core capabilities problem. Insofar as the ML system doesn’t satisfy this assumption, it’s probably not that dangerous (yet).
First, we’d like to establish that a particular pattern exists in the training data (and therefore in whatever world generated the training data, i.e. our physical world). There are lots of tools for the simplest version of this already; one simply needs to be able to predict better than maxent. But we’d like something with a little more oomph.
Ideally, we’d like a notion of “real pattern” which comes with a Correspondence Principle. The Correspondence Principle is the idea that new models should agree with old models in places where the old models predicted correctly - e.g. General Relatively must reduce to Newtonian Gravity in the broad regime where Newtonian Gravity predicted correctly; otherwise our “General Relativity” model has a capability shortcoming. We’d like a version of the Correspondence Principle for whatever notion of “real pattern” we use - i.e. some principle saying that, if we’ve found a pattern which fits certain rules, then any future model must be agree with that pattern at least in the regime where it has predicted correctly thus far; otherwise the new model has a capability shortcoming.
This property would be proven mathematically, from a set of assumptions which could be verified for the system at hand.
As a toy example, imagine that the one real pattern our ML system found looks like clusters in some space. Then we might guess that a wide range of systems trying to predict that pattern would internally represent the cluster identifier for a datapoint. In particular, a wide range of systems would factor out the cluster identifier, i.e. represent it in such a way that it’s not entangled with a bunch of other stuff, because the system will probably need to use that id a bunch in order to do its prediction, and if the cluster id representation is too entangled with other stuff then that other stuff will interfere with the predictions.
This is one specific example of a (conceptual) convergent factorization story: a reason for some particular internal variable to be represented in a way factored apart from the system’s other internals, across many different system architectures.
Ideally, we’d like to identify a convergently-factored representation corresponding to initiation of nuclear exchange, within the ML system’s machinery for predicting a specific real pattern (i.e. some pattern having to do with nuclear strategy). The mathematical challenge would be to prove convergent factorization for that representation. Ideally, we’d like the “wide range of systems” across which the factorization is convergent to include both the ML system and the human brain. That way, we’d have mathematical justification for the assumption that the representation we’ve identified in the ML system matches some representation in the human brain.
If we got really fancy and brought in brain scanning tech, we might even be able to directly establish that the representation identified in the ML system matches the mental representation in a human brain which the words “initiation of nuclear exchange” point to. And then we’d really, properly have a solution to the Moonrise Problem.
Conceptually, the assumptions and properties above are sufficient to establish that:
… but that doesn’t mean that humans’ concept of initiation of nuclear exchange will itself generalize off-distribution.
A canonical toy example: imagine that, on distribution, we have two clusters, corresponding to cars and trucks:
But then, someone comes along and builds a bunch of new vehicles which are sort of in-between cars and trucks, on a whole spectrum, so the distribution of vehicles now looks like this:
At that point, it doesn’t even make sense to talk about “cars” and “trucks” as distinct categories any more, at least among new vehicles. (The car/truck labels would still be useful on older vehicles, which is where the proofs would still be able to make some guarantees: a Correspondence Principle would say that any new model must still reduce to the old cluster model for predicting old vehicles, else the new model is suboptimally-capable. And convergent factorization would then say that the new model will still convergently use car/truck labels for older cars.)
That said, if the original real pattern on which the representation of initiation of nuclear launch is based fails to generalize to new data, that’s at least a detectable problem in principle. So presumably the system would throw an error and bring in a human if and when the nuclear launch equivalent of the car/truck issue became relevant.