Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This critique is an addendum to my distillation for application problem 3 for Nate Soares and Vivek Hebbar's SERI MATS stream. Reading my distillation is not required if you are familiar with PreDCA.

Disclaimer: This is just my model of how Vanessa thinks and I might be misrepresenting some views. Furthermore, as mentioned below, I'm sure Vanessa is well aware of these issues, and plans on trying to solve those which constitute real obstacles. [Edit: See her comments below]

Vanessa( and Diffractor)'s work generally feels different to other Alignment theory, and I have usually attributed this to its radical focus on foundations (shared by some other researchers) and the complexity of its technical mathematical results (shared by few). But upon momentarily coarsening these fine technical details, and presenting PreDCA more conceptually in a language similar to that of other proposals, it becomes clear that it really is fundamentally different to them. As a consequence of Vanessa's opinions and approach, PreDCA has different objectives and hopes to most proposals.

PreDCA is fundamentally concrete. Yes, it still includes some "throw every method you have at the problem" (as in Classification). But the truly principal idea involves betting everything on a specific mathematical formalization of some instructions. This concreteness is justified for Vanessa because nothing short from a watertight solution to agent foundations will get us through Alignment[1]. Some AGI failure modes are obviously problems humanity would have to deal with in the long run anyway (with or without AGI), but on a time trial. And Vanessa is pessimistic about the viability of some intuitively appealing Alignment approaches that try to delegate these decisions to future humanity. In short, she believes some lock-in is inevitable, and so tries to find the best lock-in possible by fundamentally understanding agency and preferences.

Now, if PreDCA works, we won't get a naively narrow lock-in: the user(s)'s utility function will care about leaving these decisions to future humanity. But it will, in other subtler but very real ways, leave an indelible fingerprint on the future. I'm certain Vanessa is aware of the implications, and I suppose she's just willing to take yet another chance (on this lock-in being broadly beneficial), since otherwise we have no chance at all. But still, PreDCA seems to me to downplay the extent to which we wouldn't be pointing at a well-established and comprehended minimal base of values that will keep us alive, but at extremely messy patterns and behaviors (which presuppose a vast amount of uncertainties as true). Of course, the idea is that the correct theoretical enunciation of preferences will do away with the unwanted details. But there might seem to be too many and too systematical biases for even a very capable framework to tell apart obvious consensus from civilizational quirks, and the latter could have awful consequences.

Another (maybe unfair) consequence of PreDCA's concreteness is we can more easily find weak links in the conjunctive plan. The protocol is still work in progress, and so what I'm about to say might end up getting fixed in some retrospectively obvious way, but I now present some worries with the technical aspects of the proposal (which Vanessa is certainly aware of).

The method to find which computations are running in the real world doesn't actually do so. Consider, for instance, a program trying to prove a formal system consistent (by finding a model of it), and another one trying to prove it inconsistent (by deriving a contradiction from it). The outputs of these two programs are acausally related: if the first halts, the second will not, and conversely[2]. Our AGI will acknowledge this acausal relation (and much more complicated ones), as Vanessa expects, since otherwise it won't know basic math and won't produce a correct world model. But this seems to fundamentally mess with the setup. If only one of these programs is running in the universe, counterfactually changing the output of the other will still change the physical universe. And so, as Vanessa mentions in conversation with Jack, we're not really checking which computations run, but which computations' outputs the universe has information about. This is a problem to fix, since we actually care about the former. But this quirk seems to me too inherently natural in the Infra-Bayesian framework. The whole point of such a framework is finding a canonical grounding for these theoretical concepts, from which our common sense ideas and preferences follow naturally, and so changing this quirk ad hoc, against the framework's simplicity, won't cut it. So we need a theoretical breakthrough on this front, and I find it unlikely. That is, I think any accommodation for this quirk will change the framework substantially (but then again, this is very speculative given the unpredictable nature of theoretical breakthroughs).

On a similar note, as Vanessa mentions, the current framework only allows for our AGI to give positive value to computations (so that it will always choose for more computations to be run), which is completely counter to our intuitions. And again, I feel like fixing this ad hoc won't cut it and we need a breakthrough. The framework has another quirk around computations: the number of instances of a concrete computation run in a universe isn't well defined. This is in fact a defining property of the framework, and is clearly deeply related to the above paragraph (that's in part why I feel the above issue won't get easily solved). Now, Vanessa has valid philosophical arguments in favor of this being the case in the real world. But even provided she's right, to me these examples point towards a more worrying general problem about how we're dealing with computations. We are clearly confused about the role of computations in a fundamental level. And yet PreDCA architecturally implements a specific understanding of them. Even if this understanding is the right one (even if future humanity would arrive at the conclusion that the intuitive "number of computations" doesn't matter ethically), the user(s) will retain many of our ethical intuitions about computations, and conflict with the architecture. In the ideal case, the theoretical framework will perfectly deal with this issue and extrapolate as well as future humanity. But I'm not certain it has the right tools to handle this correctly. On the contrary, I feel like this might be another potential source of errors in the utility function inference/extrapolation due to the nature of the framework.

Other concerns can be raised that are more standard. Especially, Vanessa is aware that more ideas are needed to ensure Classification works. For example, my intuition is that implementing a model of human cognitive science won't help much: the model would need to be very precise to defend against the vast space of acausal attackers, since these can get as similar to humans as contingent details of our AGI's hypothesis search permits. Furthermore, we can only implement computational properties of human cognition, and not physical ones (this would require solving ontology identification, and open up a new proxy to Goodhart).

This is actually a particular case of a more general (and my main) worry: we might not have eradicated the need to prune a massive search, and this will remain to be the main source of danger. Even if all mesa-optimizers are acausal attackers, does that leave us much better off, when our best plan at avoiding those is applying various approximate pruning mechanisms (as already happens in many other Alignment proposals)? PreDCA helps clearly with other failure modes, but maybe this hard kernel remains. Especially, our AGI's hypothesis search needs to be massive for it to converge on the right ones, which is necessary for other aspects of the protocol. And since our AGI's actions are determined by its hypotheses, we might be searching a space as big as that of possible code for our AGI, which is the original problem. Maybe pruning the hypothesis search for acausal attackers is way more tractable for some reason, but I don't see why we should expect that to be the case.

  1. ^

    It's not surprising that these MIRI-like concerns have led to what is probably the most serious shot at making something similar to Coherent Extrapolated Volition viable.

  2. ^

    Even if we only deal with provably halting programs, this scenario can be replicated.

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 4:34 PM

Some quick responses:

  • About the "lock-in" problem: I don't think lock-in is a meaningful concern. I'm an agent with specific preferences, and I'm making decision based on these preferences. The decision which AI to run is just another decision. Hence, there's no philosophical reason I shouldn't make it based on my current preferences. The confusion leading to viewing this as a problem comes from conflating long-term terminal preferences with short-term instrumental preferences based on some possibly erroneous beliefs. Notice also that in IBP, preferences can directly depend on computations, so they can be abstract / "meta" / indirect.
  • About "we're not really checking which computations run, but which computations' outputs the universe has information about... this is a problem to fix". I don't think it's a problem. The former is the same as the latter. If, like in your example, P is a program that outputs a bit and Q is a program that outputs not-P, then P is running iff Q is running (as long as we're at an epistemic vantage point that knows that Q = not-P). This seems intuitive to me: it makes no sense to distinguish between computing the millionth digit of pi in binary, and computing the not of the millionth digit of pi in binary. It's essentially the same computation with different representation of the result. To put it differently, if P is an uploaded human and Q is a different program which I know to be functionally equivalent to P, then Q is also considered to be an uploaded human. This is a philosophical commitment, but I consider it to be pretty reasonable.
  • About "the current framework only allows for our AGI to give positive value to computations": yes, this is a major problem. There might be good answers to this, but currently all the candidate answers I know are pretty weird (i.e. require biting some philosophical bullet). I believe that we'll understand more one way or the other as we progress in our mathematical inquiry.
  • About "a model of human cognitive science" and "pruning mechanisms": I no longer believe these are necessary. I now think we don't need to explicitly filter acausal attackers. Instead, in IBP every would-be mesa-optimizer is toothless because it automatically has to contain a simulation of the user and therefore it is (i) a valid hypothesis from the user's POV and (ii) induces the correct user preferences.

Thank you for taking the time to read and answer!

About the "lock-in" problem: I don't think lock-in is a meaningful concern

I understand you only care about maximizing your current preferences (which might include long-term flourishing of humanity), and not some vague "longtermist potential" independent of your preferences. I agree, but it would seem like most EAs would disagree (or maybe this point just hasn't been driven home for them yet).

About "a model of human cognitive science" and "pruning mechanisms": I no longer believe these are necessary

That's interesting, thank you! I'll give some thought to whether, even if this development holds, the massive search might have sneaked in some other avenue. Even without a coarse-grained simulated user, I don't immediately see why some simulation hypotheses (maybe specifically tailored to the way in which the AI encodes its physical hypotheses) would not be able to alter underlying physics in such a way as to provide a tighter causal loop between AI and simulator, so that User Detection yields a simulator. More concretely: a simulator might introduce microscopic variations in the simulation (affecting the AI's perceptions) depending on its moment to moment behavior, and also perceive the AI's outputs "even faster" than the simulated human user does (on the simulator's world, maybe just by slowing down the simulation?).

To put it differently, if P is an uploaded human and Q is a different program which I know to be functionally equivalent to P, then Q is also considered to be an uploaded human.

Say P searches for a model of a theory T. Say Q simulates a room with a human, and a computer which distributes an electric shock to the human iff it finds a contradiction derived from T, and Q outputs whether the human screamed in pain (and suppose the human screams in pain iff they are shocked). Both reject at time t if they haven't accepted yet, but suppose we know one of the two searches will finish before t.

I guess you will tell me "even if P = not-Q, the programs are not functionally equivalent, because the first carries more information (for instance, from it way more information can be computed, if we rearrange what it chooses as output, or similarly peek into its computations)". But where is the boundary drawn between "rearranging what the program outputs or peeking into it to extract more information which was already there" and "rearranging the program in a way that outputs information that wasn't already there, or peeking and processing what we see to learn information that wasn't already there"?

I understand you only care about maximizing your current preferences (which might include long-term flourishing of humanity), and not some vague "longtermist potential" independent of your preferences. I agree, but it would seem like most EAs would disagree

Yes, I think most EAs are confused about ethics (see e.g. 1 2 3), which is why I'm not sure I count as EA or merely as "EA-adjacent"[1].

I don't immediately see why some simulation hypotheses (maybe specifically tailored to the way in which the AI encodes its physical hypotheses) would not be able to alter underlying physics in such a way as to provide a tighter causal loop between AI and simulator, so that User Detection yields a simulator

We design user detection so that anything below a threshold is a "user" (rather than only the extreme agent being the user), and if there are multiple (or no) "users" we discard the hypothesis. So, yes, there is still some filtering going on, just not as complex as before.

But maybe this is not your main concern. You said "our best plan at avoiding those is applying various approximate pruning mechanisms (as already happens in many other Alignment proposals)". This is not how I would put it. My goal is to have an algorithm for which we know its theoretical guarantees (e.g. having such-and-such regret bound w.r.t. such-and-such prior). I believe that deep learning has theoretical guarantees, we just don't know what they are. We will need to either (i) understand what guarantees DL has and how to modify it in order to impose the guarantees we actually want or (ii) come up with a completely different algorithm for satisfying the new guarantees. Either will be challenging, but there are reasons to be cautiously optimistic (mainly the progress that's already happening, and the fact that looking for algorithms is easier when you know exactly which mathematical property you need).

Say P searches for a model of a theory T. Say Q simulates a room with a human, and a computer which distributes an electric shock to the human iff it finds a contradiction derived from T, and Q outputs whether the human screamed in pain (and suppose the human screams in pain iff they are shocked). Both reject at time t if they haven't accepted yet, but suppose we know one of the two searches will finish before t.

The difference is, if there's actually a room with a human (or the simulation of a room with a human), then there are other computations that are running (all the outputs of the human throughout the process), not just the one bit about whether the human screamed in pain. That's how we know that in this situation a human exists, whereas if we only have a computer running P then no human exists. We can't just "rearrange the program in a way that outputs information that wasn't already there", because if it isn't already there, the bridge transform will not assert this rearranged program is running.


  1. Due to the unfortunate timing of this discussion, I feel the need to clarify: this has absolutely nothing to do with FTX. I would have said exactly the same before those recent events. ↩︎

Thank you again for answering!

We design user detection so that anything below a threshold is a "user"

Yes, but simulators might not just "alter reality so that they are slightly more causally tight than the user", they might even "alter reality so that they are inside the threshold and the user no longer is", right? I guess that's why you mention some filtering is still needed.

I believe that deep learning has theoretical guarantees, we just don't know what they are

I understand now. I guess my point would then be restated as: given the amount of room that simulators (intuitively seem to) have to trick the AGI (even with all the above developments), it would seem like no training procedure implementing PreDCA can be modified/devised so as to achieve the guarantee of (almost surely) avoiding acausal attacks. Not because of formal guarantees being impossible to prove about that training procedure (e.g. DL), but because pruning attacks from the space of hypotheses is too complicated of a search for any human-made algorithm/procedure to carry out (because of the variety of attacks and the vastness of the space of hypotheses).

We can't just "rearrange the program in a way that outputs information that wasn't already there", because if it isn't already there, the bridge transform will not assert this rearranged program is running.

Of course! I understand now, thank you.

Yes, but simulators might not just "alter reality so that they are slightly more causally tight than the user", they might even "alter reality so that they are inside the threshold and the user no longer is", right?

No. The simulation needs to imitate the null hypothesis (what we understand as reality), otherwise it's falsified. Therefore, it has to be computing every part of the null universe visible to the AI. In particular, it has to compute the AI responding to the user responding to the AI. So, it's not possible for the attacker to make the user-AI loop less tight.

...it would seem like no training procedure implementing PreDCA can be modified/devised so as to achieve the guarantee of (almost surely) avoiding acausal attacks... because of the variety of attacks and the vastness of the space of hypotheses.

The variety of attacks doesn't imply the impossibility of defending from them. In cryptography, we have protocols immune from all attacks[1] despite a vast space of possible attacks. Similarly, here I'm hoping to gradually transform the informal arguments above into a rigorous theorem (or well-supported conjecture) that the system is immune.


  1. As long as the assumptions of the model hold, ofc. And, assuming some (highly likely) complexity-theoretic conjectures. ↩︎

No. The simulation needs to imitate the null hypothesis (what we understand as reality), otherwise it's falsified. Therefore, it has to be computing every part of the null universe visible to the AI. In particular, it has to compute the AI responding to the user responding to the AI. So, it's not possible for the attacker to make the user-AI loop less tight.

Yes, I had understood that, but this is only the case in the limit when the AI is completely certain about every minute detail about its immediate physical reality, right? Otherwise, as in my above example, the simulator could introduce microscopic variations (wherever the AI isn't yet completely certain about reality, for instance in some parts of the user's brain) which subtly alter reality in such a way that the information between AI and user from counterfactual actions takes longer to arrive. Or am I missing something?

The variety of attacks doesn't imply the impossibility of defending from them.

You're right, thank you!

If the information takes a little longer to arrive, then the user will still be inside the threshold.

A more concerning problem is, what if the simulation only contains a coarse grained simulation of the user s.t. it doesn't register as an agent. To account for this, we might need to define a notion of "coarse grained agent" and allow such entities to be candidate users. Or, maybe any coarse grained agent has to be an actual agent with a similar loss function, in which case everything works out on its own. These are nuances that probably require uncovering more of the math to understand properly.

Oh, so it seems we need a coarse grained user (a vague enough physical realization of the user) for threshold problems to arise. I understand now, thank you again!