Sorted by New

Wiki Contributions


The part about the reasoners having an arbitrary amount of time to think wasn't obvious to me. The TM can run for arbitrarily long but if it is simulating a universe and using the universe to determine its output then the TM needs to specify a system for reading from the universe.

If that system involves a start-to-read time that is long enough for the in-universe life to reason about the universal prior then that time specification alone would take a huge number of bits.

On the other hand, I could imagine a scheme that looks for a specific short trigger sequence at a particular spatial location then starts reading out. If this trigger sequence is unlikely to occur naturally then the civilization would have as long as they want to reason about the prior. So overall it does seem plausible to me now to allow for arbitrarily long in-universe time.

I won't deny probably misunderstanding parts of LDA but if the point is to learn corrigibility from H couldn't you just say that corrigibility is a value that H has? Then use the same argument with "corrigibility" in place of "value"? (This assumes that corrigiblity is entirely defined with reference to H. If not, replace with the subset that is defined entirely from H, if that is empty then remove H).

If A[*] has H-derived-corrigibility then so must A[1] so distillation must preserve H-derived-corrigibility so we could instead directly distill H-derived-corrigibility from H which can be used to directly train a powerful agent with that property, which can then be trained from some other user.

Point 1: Meta-Execution and Security Amplification

I have a comment on the specific difficulty of meta-execution as an approach to security amplification. I believe that while the framework limits the "corruptibility" of the individual agents, the system as a whole is still quite vulnerable to adversarial inputs.

As far as I can tell, the meta-execution framework is Turing complete. You could store the tape contents within one pointer and the head location in another, or there's probably a more direct analogy with lambda calculus. And by Turing complete I mean that there exists some meta-execution agent that, when given any (suitably encoded) description of a Turing machine as input, executes that Turing machine and returns its output.

Now, just because the meta-execution framework is Turing complete, this doesn't mean that any particular agent created in this manner is Turing complete. If our agents were in practice Turing complete, I feel like that would defeat the security-amplification purpose of meta-execution. Maybe the individual nodes cannot be corrupted by the limited input they see, but the system as a whole could be made to perform arbitrary computation and produce arbitrary output on specific inputs. The result of "interpret the input as a Turing machine and run it" is probably not the correct or aligned response to those inputs.

Unfortunately, it seems empirically the case that computation systems become Turing complete very easily. Some examples:

In particular, return oriented programming is interesting as an adversarial attack on pre-written programs by taking advantage of the fact that a limited control over execution flow in the presence of existing code often forms a Turing complete system, despite the attacker having no control over the existing code.

So I suspect that any meta-computation agent that is practically useful for answering general queries is likely to be Turing complete, and that it will be difficult to avoid Turing completeness (up to a resource limit, which doesn't help the arbitrary code execution problem).

An addition to this argument thanks to William Saunders: We might end up having to accept that our agent will be Turing complete and hope that the malicious inputs are hard to find or work with low probability. But in that case, limiting the amount of information seen by individual nodes may make it harder for the system to detect and avoid these inputs. So what you gain in per-node security you lose in overall system security.

Point 2: IDA in general

More broadly, my main concern with IDA isn't that it has a fatal flaw but that it isn't clear to me how the system helps with ensuring alignment compared to other architectures. I do think that IDA can be used to provide modest improvement in capabilities with small loss in alignment (not sure if better or worse then augmenting humans with computational power in other ways), but that the alignment error is not zero and increases the larger the improvement in capabilities.


  1. It is easy and tempting for the amplification to result in some form of search ("what is the outcome of this action?" "what is the quality of this outcome?" repeat), which fails if the human might misevaluate some states.
  2. To avoid that, H needs to be very careful about how they use the system.
  3. I don't believe that it is practically possible for formally specify the rules H needs to follow in order to produce an aligned system (or if you can, it's just as hard as specifying the rules for a CPU + RAM architecture). You might disagree with this premise in which case the rest doesn't follow.
  4. If we can't be confident of the rules H needs to follow, then it is very risky just asking H to act as best as they can in this system without knowing how to prevent things from going wrong.
  5. Since I don't believe specifying IDA-specific rules is any easier than for other architectures, it seems unlikely to me that you'd have a proof about the alignment or corrigibility of such a system that wouldn't be more generally applicable, in which case why not use a more direct architecture with fewer approximation steps?

To expand on the last point, if A[*], the limiting agent, is aligned with H then it must contain at least implicitly some representation of H's values (retrievable through IRL, for example). And so must A[i] for every i. So the alignment and distillation procedures must preserve the implicit values H. If we can prove that the distillation preserves implicit values, then it seems plausible that a similar procedure with similar proof would be able to just directly distill the values of H explicitly and then we can train an agent to behave optimally with respect to those.