Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a linkpost for https://carado.moe/predca.html

PreDCA: vanessa kosoy's alignment protocol

3Vivek Hebbar

2Martín Soto

2harfe

2Tamsin Leake

1harfe

1Frank_R

3Tamsin Leake

1Frank_R

New Comment

`Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)]`

is the probability that, for a random policy`π∈ξ`

, that policy has worse utility than the policy`G*`

its program dictates; in essence, how good`G`

's policies are compared to random policy selection

What prior over policies?

given

`g(G|U)`

, we caninfer the probability that an agent

Ghas a given utility function`, as`

U`Pr[U] ∝ 2^-K(U) / Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)])`

where`∝`

means "is proportional to" and`K(U)`

is the kolmogorov complexity of utility function`U`

.

Suppose the prior over policies is max-entropy (uniform over all action sequences). If the number of "actions" is greater than the number of bits it takes to specify my brain^{[1]}, it seems like it would conclude that my utility function is something like "1 if {acts exactly like [insert exact copy of my brain] would}, else 0".

^{^}Idk if this is plausible

What prior over policies?

Some kind of simplicity prior, as mentioned here.

Suppose the prior over policies is max-entropy (uniform over all action sequences). If the number of "actions" is greater than the number of bits it takes to specify my brain

^{[1]}, it seems like it would conclude that my utility function is something like "1 if {acts exactly like [insert exact copy of my brain] would}, else 0".

Yes. In fact I'm not even sure we need your assumption about bits. Say policies are sequences of actions, and suppose at each time step we have actions available. Then, in our process of approximating your perfect/overfitted utility "1 if {acts exactly like [insert exact copy of my brain] would}, else 0", adding one more specified action to our can be understood as adding one more symbol to its generating program, and so incrementing by 1. But also, adding one more (perfect) specified action multiplies the denominator probability by (since the prior is uniform). So as long as , will be unbounded when approximating your utility.

And of course, this is solved by the simplicity prior, because this makes it easier for simple s to achieve low denominator probability. So a way simpler (less overfitted to *) will achieve almost the same low denominator probability as your function, because the only policies that maximize better than * are too complex.

In `Pr[U] ≈ 2^-K(U) / Pr π∈ξ [U(⌈G⌉,π) ≤ U(⌈G⌉,G*)])`

: Shouldn't the inequality sign be the other way around? I am assuming that we want to maximize and not minimize .
As currently written, a good agent with utility function would be better than most random policies, and therefore would be close to and therefore be rather small.

If the sign should indeed be the other way around, then a similar problem might be present in the definition of , if you want to be high for more agenty programs .

after thinking about it and asking vanessa, it seems that you're correct; thanks for noticing. the mistake comes from the fact that i express things in terms of utility functions and vanessa expresses things in terms of loss functions, and they are reversed. the post should be fixed now.

note that in the `g(G|U)`

definition, i believe it is also ≥, because `-log`

flips the function.

(Copied partially from here)

My intuition is that preDCa falls short on the "extrapolated" part in "Coherent extrapolated volition". PreDCA would extract a utility function from the flawed algorithm implemented by a human brain. This utility function would be coherent, but might not be extrapolated: The extrapolated utility function (ie what humans would value if they would be much smarter) is probably more complicated to formulate than the un-extrapolated utility function.

For example, the policy implemented by an average human brain probably contributes more to total human happiness than most other policies. Lets say is an utility function that values human happiness as measured by certain chemical states in the brain, and is "extrapolated happiness" (where "putting all humans brains in vat to make it feel happy" would not be good for ). Then it is plausible that . But the policy implemented by an average human brain would do approximately equally well on both utility functions. Thus, .

I am starting to learn theoretical stuff about AI alignment and have a question. Some of the quantities in your post contain the Kolmogorov complexity of U. Since it is not possible to compute the Kolmogorov complexity of a given function or to write down a list of all function whose complexity is below a certain bound, I wonder how it would be possible to implement the PreDCA protocol on a physical computer.

Like all the other uncomputable or untractable logic in the post, the AI is to make increasingly informed guesses about them using something like logical induction, where one can estimate the likelihood of a logical statement without having to determine its truth value for sure.

(

this post has been written for the second Refine blog post day. thanks tovanessa kosoy,adam shimi, sid black,artaxerxes, andpaul bricmanfor their feedback.)in this post, i try to give an overview of vanessa kosoy's new alignment protocol,

Precursor Detection, Classification and AssistanceorPreDCA, as she describes it in a recent youtube talk.keep in mind that i'm not her and i could totally be misunderstanding her video or misfocusing on what the important parts are supposed to be.

the gist of it is: the goal of the AI should be to

assisttheuserby picking policies which maximize the user'sutility function. to that end, we characterize what makes anagentand itsutility function, thendetectagents which could potentially be the user by looking forprecursorsto the AI, and finally weselecta subset of those which likely contains the user. all of this is enabled by infra-bayesian physicalism, which allows the AI to reason about what the world is like and what the results of computations are.the rest of this post is largely a collection of mathematical formulas (or informal suggestions) defining those concepts and tying them together.

an important aspect of PreDCA is that the mathematical formalisms are

theoreticalones which could be given to the AI as-is, not necessarily specifications as to what algorithms or data structures should exist inside the AI. ideally, the AI could just figure out what it needs to know about them, to what degree of certainty, and using what computations.the various pieces of PreDCA are described below.

infra-bayesian physicalism, in which an agent has a hypothesis`Θ ∈ □(Φ×Γ)`

(note that`□`

isactuallya square, not a character that your computer doesn't have a glyph for) where:`Φ`

is the set of hypotheses about how the physical world could be — for example, different hypotheses could entail different truthfulness for statements like "electrons are lighter than protons" or "norway has a larger population than china".`Γ`

is the set of hypotheses about what the outputs of all programs are — for example, a given hypothesis could contain a statement such as "2+2 gives 4", "2+2 gives 5", "the billionth digit of π is 7", or "a search for proofs that either P=NP or P≠NP would find that P≠NP". note that, as the "2+2 gives 5" example demonstrates, these don't have to be correct hypotheses; in fact, PreDCA relies a lot on entertaining counterfactual hypotheses about the results of programs. a given hypothesis`γ∈Γ`

would have type`γ : program → output`

.`Φ×Γ`

is the set of pairs of hypotheses — in each pair, one hypothesis about the physical world and one hypothesis about computations. note that a given hypothesis`φ∈Φ`

or`γ∈Γ`

is not a single statement about the world or computationspace, but rather entire descriptions of those. a given`φ∈Φ`

would sayeverything there is to sayabout the world, and a given`γ`

would specify the output ofall possible programs. they are not to be stored inside the AI in their entirety of course; the AI would simply make increasingly informed guesses as to what correct hypotheses would entail, given how they are defined.`□(Φ×Γ)`

assigns degrees of beliefs to those various hypotheses; in infra-bayesianism, those degrees are represented as "infra-distributions". i'm not clear on what those look like exactly, and a full explanation of infra-bayesianism is outside the scope of this post, but i gather that — as opposed to scalar bayesian probabilities — they're meant to encode not just the probability but also uncertainty about said probability.`Θ`

is one such infra-bayesian distribution.vanessa emphasizes that infra-bayesian physicalist hypotheses are described "from a bird's eye view" as opposed to being agent-centric, which helps with embedded agency: the AI has guesses as to what the whole world is like, which just happens to contain itself somewhere. in a given hypothesis, the AI is simply described as a part of the world, same as any other part.

next,

a measure of agencyis then defined: a "g-factor"`g(G|U)`

for a given agent`G`

and a given utility function (or loss function)`U`

, which is defined as`g(G|U) = -log(Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)])`

where`ξ`

is the set of policies which an agent could counterfactually hypothetically implement.`G`

is an agent; it is composed of a program implementing a specific policy, along with its cartesian boundary. the policy which the agent`G`

actually implements is written`G*`

, and the cartesian boundary of the agent is written`⌈G⌉`

— think of it as the outline separating the agent from the rest of the world, across which its inputs and outputs happen.`U : cartesian-boundary × policy → value`

is a utility function, measuring how much utility the world would have if a given agent's cartesian boundary contained a program implementing a given policy. its return value is typically a simple scalar, but could really be any ordered quantity such as a tuple of scalars with lexicographic ordering.`U(⌈G⌉,G*)`

is the utility produced by agent`G`

if it would execute the actual policy`G*`

which its program implements`U(⌈G⌉,π)`

is the utility produced by agent`G`

hypothetically executing some counterfactual policy`π`

— if the cartesian boundary`⌈G⌉`

contained a program implementing policy`π`

instead of implementing the policy`G*`

.`Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)]`

is the probability that, for a random policy`π∈ξ`

, that policy has worse utility than the policy`G*`

its program dictates; in essence, how good`G`

's policies are compared to random policy selectionso

`g(G|U)`

measures how good agent`G`

is at satisfying a given utility function`U`

.given

`g(G|U)`

, we caninfer the probability that an agent, as`G`

has a given utility function`U`

`Pr[U] ∝ 2^-K(U) / Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)])`

where`∝`

means "is proportional to" and`K(U)`

is the kolmogorov complexity of utility function`U`

.so an agent

`G`

probably has utility function`U`

if it's relatively good at satisfying that utility function and if that utility function is relatively simple — we penalize arbitrarily complex utility functions notably to avoid hypotheses such as "woah, this table isreally goodat being the exact table it is now (a complete description of the world would be an extremely complex utility function).we also get the ability to

detect what programs are agents— or more precisely, how agenty a given program is:`g(G|U) - K(U)`

tells us how agenty a program`G`

with utility function`U`

is: its agentyness is its g-factor minus the complexity of its utility function."computationalism and counterfactuals": given a belief`Θ ∈ □(Φ×Γ)`

, the AI can test whether it thinks the world contains a given program by examining the following counterfactual: "if the result of that program was adifferentresult than what it actually is, would the world look different?"for example, we can consider the AKS prime number testing algorithm. let's say

`AKS(2^82589933-1)`

returns`TRUE`

. we can ask "if it returned`FALSE`

instead, would the universe — according to our computational hypothesis about it — look different?" if itwouldlook different, then that means that someone or something in the world is running the program`AKS(2^82589933-1)`

.to offer a higher-level example: if we were to know the true name of suffering, described as a program, then we can test whether the world contains suffering by asking a counterfactual: let's say that every time suffering happened, a goldfish appeared (somehow as an output of the suffering computation). if that were the case, would the world look different? if it

would, then it contains suffering.this ability to determine which programs are running in the world, coupled with the ability to measure how agenty a given program is, lets us find what agents exist in the world.

agentic causality: to determine whether an agent`H`

's executed policy`H*`

can causate onto another agent`G`

, we can ask whether, if`H`

had executed a different policy`π≠H*`

, the agent`G`

would receive different inputs. we can apparently get an information-theoritic measure of "how impactful"`H*`

is onto agent`G`

by determining how much mutual information there is between`H*`

and`G`

's input.precursor detection: we say that an agent`H`

is a precursor of agent`G`

if, counterfactually,`H`

could have prevented`G`

from existing by executing a policy which is different from its actual policy`H*`

.we can now start to build a definition that lets the AI

detectand thenclassifywho its user is.user detection: the AI is trying to determine who its precursor program could be. but, given a hypothesis for "the thing producingthesepolicies is the precursor", there are infinitely many different programs which could output the observed policies. so we choose the one which is the most agenty, using the function described above:`g(H|U) - K(U)`

.note that while we extrapolate the user's actions into the future, the user is defined as an

whichinstant-agentprecedesthe AI's existence; such that the actual physical person's actual future actions does not change what utility function the AI should try to maximize. this stops the AI from influencing the user's utility function: we define the user strictly in the past, causally outside of the AI's light-cone. the AI is maximizing the utility function of the instant-user which causated its existence, not that of the continuously existing user-over-time.user classification: for each potential precursor hypothesis, we have now selected a program that models them and their respective utility functions. we then eliminate some hypotheses as to what the user could be — notably to avoid acausal attacks by remote aliens or counterfactual demons — using the following criteria:finally, we end up with a hopefully small set of hypotheses as to who the user could be; at that point, we simply compose their utility functions, perhaps weighed by the infra-distribution of each of those hypotheses. this composition is the utility function that the AI should want to maximize, by selecting policies which maximize the utility that the world would have if they were enacted, to the best of the AI's ability to evaluate.

vanessa tells us how far along her protocol is, as a collection of pieces that have been completed to various degrees — green parts have gotten some progress, purple parts not as much. "informal PreDCA" is the perspective that she provides in her talk and which is hopefully conveyed by this post.

finally, some takeaways that can be taken from this informal PreDCA perspective:

my own opinion is that PreDCA is a very promising perspective. it offers, if not full "direct alignment", at least a bunch of pieces that might be of significant use to general AI risk mitigation.