Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is an expanded version of my answer to application problem 3 for Nate Soares and Vivek Hebbar's SERI MATS stream.

Why this post?

Vanessa's presentation has already been turned into a post by carado (as part of Refine) and a video in Connall and Jack's excellent series (as part of SERI MATS). But I felt like carado's post was still too mathy and disconnected, and the video glossed over some interesting details. So my motivation was writing a summary that explained the proposal in more common Alignment language (especially so as to compare it more easily with other proposals), and without presupposing familiarity with Infra-Bayesianism (providing only the required motivating details).

Summary

Infra-Bayesianism (tentatively) provides a robust framework for talking about agent foundations, including formalizations of how to identify agents and their utility functions. PreDCA is a protocol we can have an artificial agent follow (given its top-level architecture implements Infra-Bayesianism), so that the agent uses these tools to identify who created it ("the user", a human) and infer their utility function. The implementation is strict enough to outright negate some of the usual failure modes (if the theoretical tools work as intended), but the implemented framework is general enough (including belief updating and world modeling) to allow for general intelligence.

What needs to happen for this to work

  • The most important requirement is that the agency framework provides the correct, common sense answers when computationally applied to real world fuzzy scenarios.
  • The AGI needs to be fit into a very specific top-level architecture (without drastically losing capabilities).
  • It needs to be the case that maximizing the utilities of a group of present humans allows for future moral development without lock-in (or that we find this lock-in ethically successful).
  • We need a good enough model of what constitutes a human (or different patches or theoretical breakthroughs to prevent acausal attacks).
  • Collaterally, the necessary further technical developments need to converge on applicability.

Key points

A common failure mode is wire-heading: the AGI tampering with the communication channel through which the human specifies preferences/rewards, or the AGI tampering with the human's preferences directly, or similar. PreDCA makes this impossible by eliminating this communication channel. But then, how can the AGI infer the human's utility function? This is possible because Infra-Bayesian Physicalism provides a birds-eye view of reality, without the AGI being privileged as a protagonist, and so the actions of the human inside reality itself are sufficient (when the AGI achieves a good enough model of reality, and identifies said human), without necessarily being directly transmitted to the AGI.
 
More concretely, an AGI with this framework continuously holds hypotheses about its world model. But these come in two forms. Some of them are hypotheses about the real physical world: which are the laws of Physics, and which is the contingently specific evolution of our timeline (which particles are where at what time, etc.). But others are hypotheses about computations: which programs provide which outputs when being run.
 
Why would we care to differentiate those? Because that allows for identifying which programs are running in the real world. More concretely, thanks to the generality of Infra-Bayesian belief update, the AGI can reason about counterfactuals[1] (although more theoretical work is still required on this). So the AGI can consider how the world would look like (given its current information) under different computational hypotheses. For instance, under the hypotheses that program P outputs 0 or 1. If, in these two different counterfactuals, the physical evolution of the world differs (if P outputs 0 then this robot moves left, but if it outputs 1 it moves right), that means program P is being run in the real world. Using this reasoning, and the inputs it's receiving, the AGI can also position itself inside its world model (without losing its un-privileged birds-eye view). This also allows to reason about which agents (programs) causally affect which. If an agent behaving differently (a program producing a different output) would lead to another agent receiving different observations (inputs), then the first has affected the second.
 
Some of these programs will be especially agentic or intelligent, and the framework provides a way of identifying them as well. First off, it allows for inferring which utility function is a program maximizing (if any). To do this, we check over all possible functions and see which one the actions of the agent are maximizing. But wait, won't this trivially produce an absurd utility function which is just "always do exactly what the agent is doing"? To prevent this overfitting, we discount against more complex functions (give them negative points for being too convoluted and unnatural).
 
For example, consider Alice has a big and varied vegetable basket, and from it picks sequentially an apple, a pomegranate, a strawberry and a tomato. The function "Pick an apple, then a pomegranate, then a strawberry, then a tomato" completely fits her data, but is very complex (contains a lot of information). Alternatively, the simple function "Pick red things" more naturally models her behavior.
 
Secondly, once we know the program's utility function, we can rate its agenticness or intelligence by calculating how many policies (series of actions) perform better than its policy (on that function). That is, to which extent its policy correctly optimizes its utility function.
 
 With the above and other tools the following protocol can be implemented.

  • Precursor detection: A "precursor" is any agent (that is, agentic enough program) that could have prevented the AGI from existing. For instance, the human who pressed the button to start the AGI is a precursor. But many other humans and beings also are: maybe their cat also is, because it could have gnawed the power cable, or distant aliens, because they could have invaded us long ago.
  • Classification: From this big group of precursors, we need to select the actual agent we want to assist ("the user", the human). There are different and complementary ad hoc methods for this. For instance, we can demand that the agent have a "very close" causal effect on the AGI, in the sense that its actions can very quickly affect the agent (maybe because they are in the same room). Also, we can implement a rough model of what constitutes a human (using cognitive science), to discard all non-human agents. Other ideas exist.
  • Assistance: Once the agent is fixed (the human), the AGI will extrapolate their utility function from their actions (through the above method). But only from their past actions (before the AGI came into existence), so as to avoid the AGI tampering with their preferences. This can be generalized to include the weighted utility functions of a group of humans.

An acausal attacker is a hypothetical agent who simulates the whole world model of our AGI to make it believe that it lives inside its simulation. More concretely, regardless of the actual existence (in this or other universes) of these attackers, these kind of simulation hypotheses will make their way into the AGI, because some of them will be way simpler than the actual physical existence of our complex world, and the AGI selects for simpler hypotheses (by Occam's razor, trying to converge on the actual laws of Physics)[2]. There are several approaches that try to prevent this, especially through Classification.
 
How might this protocol solve Inner Alignment? The only way to change our AGI's actions is by changing its world model, because of its strict architecture that completely pins down a utility function to maximize (and the actions that maximize it) given a world model. So, allegedly, the only possible mesa-optimizers will take the form of acausal attackers (that is, simulation hypotheses), or at least something that can be very naturally modelled as an acausal attack (any false hypothesis about the world that changes the precursor that is chosen as the user, or a property relevant to actions maximizing their utility). And also allegedly, the methods implemented against radical acausal attacks will be sufficient to avoid this (and other less radical wrong hypotheses will be naturally dealt with by our AGI converging on the right physical world model).
 
How might this protocol solve Outer Alignment? As mentioned above, the AGI now has no channel to hack nor user to modify, and so has no wrong proxy to learn. We do need to ensure that the utility function extrapolation works as intended, but provided this Outer Alignment seems trivially solved.
 
On that note, this protocol of course takes the approach of pointing at human values (through the behavior of the user) instead of directly specifying a utility function. That's because Vanessa finds the latter way less tractable, because of corrigibility and other problems. And that's why she believes full Alignment to be easier than, for instance, the strawberry problem[3], because no pre-made pointer in the real world actually exists for this latter task (no human cares about copying strawberries), while some humans can serve as a pointer for the first.
 
So why does this approach seem to solve the Pointers Problem? Mainly because, when extrapolating the user's utility function, the selection for simpler functions will more naturally grapple the real physical expression of the user's values than the one including their shortcomings and biases, that don't correspond to physical reality (and so are allegedly more complex).

  1. ^

    This generality also prevents some problems in Embedded Agency.

  2. ^

    This issue apparently applies to any powerful enough AGI, and PreDCA directly addresses it.

  3. ^

    Having an AGI construct the exact cellular copy of a strawberry, and then do nothing else.

New Comment
19 comments, sorted by Click to highlight new comments since: Today at 2:10 PM

I understand that the AGI may only infer the utility function from what happened before it's existence. That seems to mean that it has to interpret evidence about past actions because there is no way to objectively access the past. For example a photo taken of something an agent did or the arrangement of their house now resulting from purchases they made. This indicates two ways of attack: Erasure of information and fake evidence. How does this approach deal with these?

I think your concern is a special case of this paragraph:

How might this protocol solve Inner Alignment? The only way to change our AGI's actions is by changing its world model, because of its strict architecture that completely pins down a utility function to maximize (and the actions that maximize it) given a world model. So, allegedly, the only possible mesa-optimizers will take the form of acausal attackers (that is, simulation hypotheses), or at least something that can be very naturally modelled as an acausal attack (any false hypothesis about the world that changes the precursor that is chosen as the user, or a property relevant to actions maximizing their utility). And also allegedly, the methods implemented against radical acausal attacks will be sufficient to avoid this (and other less radical wrong hypotheses will be naturally dealt with by our AGI converging on the right physical world model).

We need to prevent our agent from developing false hypotheses because of adversarial inputs (through its sensors). You mention the particular case in which the false hypotheses are about the past (a part of physical reality), and adversarial input is provided as certain arrangements of present physical reality (which our AGI perceives through its sensors). These can be understood as very basic causal attacks. I guess all these cases are supposed to be dealt with by our AGI being capable enough (at modeling physical reality and updating its beliefs) so as to end up noticing the real past events. That is, given the messiness/inter-connectedness of physical reality (partaking in such procedures as "erasure of information" or "fake evidence" actually leave much physical traces that an intelligent enough agent could identify), these issues would probably fall on the side of "less radical wrong hypotheses", and they are supposed to "be naturally dealt with by our AGI converging on the right physical world model".

I agree that the interconnectedness of physical reality will leave traces - the question is: Enough? Can we put bounds on that? I imagine blowing up a lot of stuff at once will destroy more than you can recover from elsewhere.

I am somewhat certain preDCA requires a specific human but there should be enough information recorded about anyone with a large enough digital footprint to reconstruct a plausible simulacra of them.

Keep in mind the ultimate goal is to get a good understanding of their preferences, not to actually recreate their entire existence with perfect fidelity.

PreDCA requires a human "user" to "be in the room" so that it is correctly identified as the "user", but then only infers their utility from the actions they took before the AGI existed. This is achieved by inspecting the world model (which includes the past) on which the AGI converges. That is, the AGI is not "looking for traces of this person in the past". It is reconstructing the whole past (and afterwards seeing what that person did there). Allegedly, if capabilities are high enough (to be dangerous), it will be able to reconstruct the past pretty accurately.

I guess the default answer would be that this is a problem for (the physical possibility of certain) capabilities, and we are usually only concerned with our Alignment proposal working in the limit of high capabilities. Not (only) because we might think these capabilities will be achieved, but because any less capable system will a priori be less dangerous: it is way more likely that its capabilities fail in some non-interesting way (non-related to Alignment), or affect many other aspects of its performance (rendering it unable to achieve dangerous instrumental goals), than for capabilities to fail in just the right way so as for most of its potential achievements to remain untouched, but the goal relevantly altered. In your example, if our model truly can't converge with moderate accuracy to the right world model, we'd expect it to not have a clear understanding of the world around it, and so for instance be easily turned off.

That said, it might be interesting to more seriously consider whether efficient prediction of the past being literally physically impossible could make PreDCA slightly more dangerous for super-capable systems.

Thanks for the long answer. I agree that my question is likely more tangential.

I really feel like "you aren't in a superintelligent reality warper's simulation" is kind of the assumption that's necessary for any AI safety proposal to work.

If we assume no limits, then the whole project becomes pointless.

I'm not sure I completely understand your comment.

If you are talking about us actually living in a simulation, Vanessa doesn't say "maybe we live in a simulation, and then the AGI will notice". She says, "independently of the epistemological status and empirical credence of simulation hypotheses, the AGI's model might converge on them (because of the way in which we punish complexity, which is necessary to arrive at the laws of Physics), and this is a problem".

If on the contrary you are talking about instilling into the AGI the assumption that simulation hypotheses are false, then this would be great but we can't do it easily, because of the problem of ontology identification and other complications. Or in other words, how would you specify which reality counts as a simulation?

Specifically, if there's a high chance we are in an adversary's simulation, that's equivalent to showing that you can't actually win. We are no more able to deal with such simulators than we are able to deal with real life Avengers or Justice League coming from the comics to attack us.

Thus, the exercise is pointless: no AI safety proposal could survive such forces.

You might not have understood my above comment. A simulation hypothesis having high credence (let alone being the case) is not necessary for acausal attacks to be a problem for PreDCA. That is, this worry is independent of whether we actually live in a simulation (and whether you know that).

Thank you for clarifying things, since I got pretty confused on the acausal attack issue.

After having chosen a utility function to maximize, how would it maximize? I'm thinking that the search/planning process for finding good policies naturally introduce mesa-optimizers, regardless of everything that came before in the PreDCA (detecting precursors and extrapolating their utility function).

Once the AGI has some utility function and hypothesis (or hypotheses) about the world, then it just employs counterfactuals to decide which is the best policy (set of actions). That is, it performs some standard and obvious procedure like "search over all possible policies, and for each compute how much utility exists in the world if you were to perform that policy". Of course, this procedure will always yield the same actions given a utility function and hypotheses, that's why I said:

The only way to change our AGI's actions is by changing its world model, because of its strict architecture that completely pins down a utility function to maximize (and the actions that maximize it) given a world model.

That said, you might still worry that due to finite computing power our AGI might not literally search over all possible policies, but just employ some heuristics to get a good approximation of the best policy. But then this is a capabilities short-coming, not misalignment. And as I mentioned in another comment:

this is a problem for capabilities, and we are usually only concerned with our Alignment proposal working in the limit of high capabilities. Not (only) because we might think these capabilities will be achieved, but because any less capable system will a priori be less dangerous: it is way more likely that its capabilities fail in some non-interesting way (non-related to Alignment), or affect many other aspects of its performance (rendering it unable to achieve dangerous instrumental goals), than for capabilities to fail in just the right way so as for most of its potential achievements to remain untouched, but the goal relevantly altered.

Coming back to our scenario, if our model just finds an approximate best policy, it would seem very unlikely that this policy consistently brings about some misaligned goal (which is not the AGI's goal) like "killing humans", instead of just being the best policy with some random noise in all directions.

AGI might not literally search over all possible policies, but just employ some heuristics to get a good approximation of the best policy. But then this is a capabilities short-coming, not misalignment

...

Coming back to our scenario, if our model just finds an approximate best policy, it would seem very unlikely that this policy consistently brings about some misaligned goal

In my model this isn't a capabilities failure, because there are demons in imperfect search; what you would get out of a heuristic-search-to-approximate-the-best-policy wouldn't only be something close to the global optimum, but something that has also been optimized by whatever demons (don't even have to be "optimizers" necessarily) that emerged through the selection pressures.

Maybe I'm still misunderstanding PreDCA and it somehow rules out this possibility, but afaik it only seems to do so in the limit of perfect search.

I think you're right, and I wasn't taking this into account, and I don't know how Vanessa would respond to this. Her usual stance is that we might expect all mesa-optimizers to be acausal attackers (that is, simulation / false hypotheses), since in this architecture the only way to determine actions is by determining hypotheses (and in fact, she now believes these acausal attackers can all be dealt with in one fell sweep in light of one single theoretical development). But that would seem to ignore the other complex processes going on to update these hypotheses from one time step to the next (as if the updates happened magically and instantaneously, without any further subcomputations). And we don't even need to employ possibly non-perfect heuristics for these demons to appear: I think they would also appear even if we (in the ideal, infinite compute scenario) brute-forced by searching over all possible hypotheses updates and assessing each one on some metric. In a sense the two appearances of demons are equivalent, but in the latter limit they are more clearly encoded in certain hypotheses (that game the assessment of hypotheses), while in the former their relationship to hypotheses will be less straight-forward, since there will be non-trivial "hypotheses updating" code inside the AI which is not literally equivalent to the hypothesis chosen (and so, parts of this code which aren't the final chosen hypothesis could also be part of a demon).

I'm not 100% sure the existence of these demons already implies inner misalignment, since these demons will only be optimized for their continued existence (and this might be gained by some strategy that, by sheer luck, doesn't disrupt too much the outer performance of the AI, or at most turns the hypothesis search a bit less efficient). But I think this is just what always happens with mesa-optimizers, and the worry for inner alignment is that any one of these mesa-optimizers can be arbitrarily disruptive to outer performance (and there are some disruptive strategies very efficient for continued existence).

This might be a way in which PreDCA misses a hard bit of Alignment. More concretely, our problem is basically that the search space of possible AGI designs is too vast, and our search ability too limited. And PreDCA tries to reduce this space by considering a very concrete protocol which can be guaranteed to behave in certain ways. But maybe all (or most) of the vastness of the search space has been preserved, only now disguised as the search space over possible inner heuristics that can implement said protocol. Or put another way, whether or not the model implements simplifying heuristics or carries out a brute-force search, the space of possible hypotheses updates remains (approximately) as vast and problematic. Implementing heuristics approximately preserves this vastness: even if once the heuristic is implemented the search is considerably smaller, before that we already had to search over possible heuristics.

In fact, generalizing such arguments could be a piece in an argument that "abstracted perfect Alignment", in the sense of "a water-tight solution that aligns with arbitrary goals agents of arbitrary capability (arbitrarily close-to-perfect consequentialist)", is unsolvable. That is, if we abstract away all contextual contingencies that can make (even the strongest) AIs imperfect consequentialists, then (almost "by definition") they will always outplay our schemes (because the search space is being conceptualized as unboundedly vast).

Update: Vanessa addressed this concern.

Wait, so PreDCA solves inner-misalignment by just ... assuming that "we will later have an ideal learning theory with provable guarantees"?

By the claim "PreDCA solves inner-misalignment" as implied by the original protocol / distillation posts, I thought it somehow overcame the core problem of demons-from-imperfect-search. But it seems like the protocol already starts with an assumption of "demons-from-imperfect-search won't be a problem because of amazing theory" and instead tackles a special instantiation of inner-misalignment that happens because of the nature of the protocol itself (i.e. simulation hypotheses due to simplicity bias + assuming an ideal/perfect search or learning theory).

If my understanding is correct, I think the implication regarding inner-misalignment is misleading because it PreDCA is operating at a whole different level of abstraction/problem-level than most of the discourse around inner-misalignment

I share this intuition that the solution as stated is underwhelming. But from my perspective that's just because that key central piece is missing, and this wasn't adequately communicated in the available public resources about PreDCA (even if it was stressed that it's a work in progress). I guess this situation doesn't look as worrisome to Vanessa simply because she has a clearer picture of that central piece, or good motives to believe it will be achievable, which she hasn't yet made public. Of course, while this is the case we should treat optimism with suspicion.

Also, let me note that my a priori understanding of the situation is not

let's suppose amazing theory will solve imperfect search, and then tackle the other inner misalignment directly stemming from our protocol

but more like

given our protocol, we have good mathematical reasons to believe it will be very hard for an inner optimizer to arise without manipulating hypothesis update. We will use amazing theory to find a concrete learning setup and prove/conjecture that said manipulation is not possible (or that the probability is low). We then hope the remaining inner optimization problems are rare/few/weak enough as for other more straightforward methods to render them highly unlikely (like having the core computing unit explicitly reason about the risk of inner optimization).