ELK is described as a problem we wish to solve in the "worst case"; IE, for any potential failure case we can describe, we want to have a strong argument about why we won't run into that particular problem. The ELK document lists several cases we want to handle in that way. First in this list is the case where the correct translation is arbitrarily computationally complex.
However, this assumption implies that ELK is intractable. At best, we might hope to define some notion of approximate solution, and prove that there are efficient approximate solutions. So how are we supposed to solve ELK, if we are to assume that it's intractable?
The answer (at least, as I see it) is by arguing that the correct translation cannot be too computationally complex. (Or, failing that, by providing a strong argument that it is possible, at which point we would have to narrow the scope of ELK ambitions.)
A proposed algorithm for solving ELK which (1) has a strong argument for its correctness, and (2) is itself a tractable algorithm, is usually already an argument that correct translations are tractable. For example, a search procedure usually has to evaluate candidate translations by running them; so if we could prove that the search terminates in a reasonable amount of time, we've probably already proven that the solution it finds is tractable (at least in certain cases, EG, on the training data).
But this doesn't really answer the question. Since we don't already have such a solution, we may need to explicitly think about tractability in order to produce one.
So, in this post, I'm going to think about how we might possibly bound the computational complexity of ELK.
Level 3: Probabilistic Knowledge
The third level of difficulty for ELK (IE, the most computationally difficult that I'll discuss here) is the most conceptually simple to understand: we define the "correct translation" of a pattern of activity in the Predictor to be any and all things correlated with such activity.
In other words, we define the "meaning" of Predictor activity as just whatever we can predictably infer from it. In my previous post I formalized this by imagining probability distributions called "third person perspectives" which can be conditioned on the full execution trace of the Predictor, and infer as much as possible about the world from this information.
This seems intractable for two main reasons:
- It requires logical omniscience. For example, this criterion seemingly requires that the correct translation knows all the predictable consequences of a plan, not just the immediate consequences. So long as the consequences can be predicted by running the world-model forward, they count as probabilistic implications of the Predictor's knowledge. In particular, the future outcomes of any computations running in the world are included in this. Therefore, computing this form of ELK requires access to a halting oracle in the worst case.
- Even if we can side-step the first issue, this form of ELK is at least as hard as finding marginals in arbitrary graphical models, which is #P-complete. As complexity classes go, this is worse than NP-complete.
By trying to detect any correlation, we impose an arbitrarily high requirement of detailed simulation. As a consequence, this Reporter would know far more than the Predictor can realistically be said to know. If the Predictor knows how the dominoes have been set up, the Reporter is required to know how they fall.
From a safety perspective, we're fine with the reporter knowing too much -- we're only concerned with cases where the Predictor knows something which the humans don't know. However, this does seem like a sign that we don't have to solve such a hard version of the problem.
So, can we specify a more tractable version?
Level 2: Computationally Accessible Knowledge
The intuition I'm trying to capture with 'level 2' is: to "know" something means that a (computationally) relatively close version of you could report the information. This rules out computationally intractable "knowledge" which follows probabilistically from the Predictor's state, but which cannot be easily computed from it.
To put it a different way: knowledge is only knowledge if it can feasibly be used to do something. For example, if I have a book telling me how to use the SuperSort algorithm (a hypothetical sorting algorithm that’s better than linear time), then there’s a pragmatic sense in which I know how to write SuperSort -- if you ask me to optimize some code which includes a sorting step, I’ll look up the SuperSort algorithm and implement it, to see if it improves the performance of your code.
However, if I instead had an encrypted copy of the book, then I wouldn’t possess the knowledge in the pragmatic sense. If you asked me to optimize the same code, I would try a different approach.
This could be formalized in many different ways. Here is a provisional attempt:
Definition: The L-knowledge of a latent variable X based on the existence of a program, as follows. The program takes in an execution trace of the Predictor in a given situation. It outputs a pair of booleans. The first boolean indicates whether the program is able to decide the latent variable in this case. (Think of this as “whether the predictor has the latent knowledge”.) If so, the second boolean indicates the state of the latent variable X. (This could be easily generalized to non-boolean variables.) Call the length of the program K, and the time to run the program T. L-knowledge exists (in a specific situation) when K+log(T)<L, and the first output bit is true.
The first bit is needed to allow programs to opt out of telling us the status of the latent variable in cases where it is impossible for them to do so. Otherwise, ‘knowledge’ would only exist when a reporter could infallibly tell us the status of the latent variable in every case. We can’t just define L-knowledge as the existence of a program which tells us the status of the latent variable in this specific case (and satisfies K+log(T)<L), because if we defined it in this way, L-knowledge would be too easy. There’s always a short program that outputs 1 and a short program which outputs 0, so whichever value the variable actually has, a program certifying L-knowledge for very low L would exist.
Possibly, the above definition is still too easy to achieve, because we can still do that exact same trick by making the first bit of the program recognize only one target situation. (This allows us to create a bound above which L-knowledge always exists; but, at least the bound is not constant. It depends on the Levin complexity of recognizing the specific scenario.)
However, as I've already argued for Level 3, it’s fine (for safety purposes) if our definition of knowledge includes too many cases. If we can construct a reporter who can tell us L-knowledge whenever it exists, and we have confidence that knowledge implies L-knowledge, then we can create a reporter at least as informative as the direct reporter. (It might sometimes tell us the truth when the predictor doesn’t actually know the truth, but that’s fine!)
However, unfortunately, it's not obvious that meaningful "knowledge" implies L-knowledge. My Level 3 definition seemed to safely over-estimate the Predictor's knowledge; but that is not nearly so obvious with L-knowledge.
Knowledge Without L-Knowledge?
L-knowledge allows the translator to self-select cases when knowledge exists and when it doesn't; but, within those cases, knowledge is required to be perfect. A translation which predicts the target variable 95% of the time does not count for anything (because it is wrong infinitely often). We could also define some sort of relaxed p,L-knowledge which only required the knowledge to be p% correct, or something along those lines. I won't discuss this possibility any further in the current essay, however.
Another potential failure case I'm worried about relates to using knowledge for planning.
There are roughly two types of concerning cases for ELK. In one case, the SafeVault AI uses latent knowledge which the human lacks, to enact a plan which drives apart the proxy reward from what humans would like to optimize. Call this “malign use of latent knowledge.” Example: the SafeVault AI executes an undetectable exploit to jettison itself (with the diamond) into space. Everything looks fine on the sensors, and from the outside of the vault, but if someone tries to go inside (eg, to retrieve the diamond) they’ll discover that the building is now totally empty.
In another case, the SafeVault AI fails to use latent knowledge. For example, a burglar executes an undetectable sensor exploit in order to take the diamond. The SafeVault understands what’s going on but doesn’t care, because its reward is not impacted. Call this “malign negligence of latent knowledge.”
One way to understand malign negligence is, “AI systems should not ignore mission-critical information that they could have taken into account”. But how do we judge when they “could have” taken information into account? L-knowledge seems like a reasonable way to cash out this “could”. Where L-knowledge exists, there is a “close” version of the system which takes this knowledge into account. (This close version is the policy we want to incentivise, using the reporter’s information.) Where L-knowledge doesn’t exist, such a variant is either too costly to run or too hard to find.
However, malign use of latent knowledge is in some sense more concerning. Can we provide any guarantees against it?
Informal definition: planning-knowledge of X is the ability to make plans that importantly rely on X. This means something like: in worlds where X is different, the actions that the SmartVault AI takes are different (while, presumably, achieving similar ends).
Conjecture: Planning-knowledge implies L-knowledge (for some fairly low L).
If something like this can be established, it would provide significant reassurance that uncovering the Predictor's L-knowledge protects us from malign use of latent knowledge.
Efficiency of L-knowledge-based algorithms?
L-knowledge, or some suitable variation, could formalize the term "knowledge" in ELK without requiring the Reporter to be logically omniscient. However, might it still imply severe computational problems for ELK?
Although the concept of L-knowledge implies that it is relatively easy to run the correct translation once found, it could still be computationally intractable to find the correct translation.
- Rice's Theorem says that non-trivial 'semantic' questions (questions about a program's behavior) are always undecidable (ie, lack decision procedures). Attempts to determine L-knowledge may run into this.
- Even if this issue can be sidestepped, it still seems like determining L-knowledge (or something similar) could be computationally intractable. Naively, to test a candidate program, we will have to check how reliably its (second) output correlates with the target variable X. This involves nearly the same kind of intractable calculations discussed in level 3. We may have only given ourselves more work, rather than improving the situation!
Can these concerns help us to further refine our definition of knowledge?
Level 1: Knowledge of Knowledge
The intuition behind level 1 is: for computational knowledge ("level 2") to really count, the Predictor has to know what it knows. Consider belief-in-belief. If I'm deluded into thinking that water is made of tiny water-demons, but all of my predictions line up with the standard hydrogen hydroxide hypothesis, then at some level you might say I "know" the hydrogen hydroxide hypothesis. There's probably a computationally close version of me who would answer questions in line with the hydrogen hydroxide hypothesis, instead of the water-demon hypothesis. However, I may not know that this is the case.
I guess another way to put it is: deception requires knowledge of deception. To deceive you, I need to know that there's a computationally close version of me who could tell the truth. We may be able to eliminate deception (which seems good!) without solving harder versions of ELK.
The main advantage of this idea is that it could allow us to bound the difficulty of finding the correct translation, rather than just bounding the difficulty of executing the correct translation. This could solve the computational complexity problem entirely (perhaps at the cost of delivering a weaker version of ELK, with weaker safety guarantees).
I don't have any idea how to formalize level 1 at the moment, so I'll leave it at this for now.