this work was done by Tamsin Leake and Julia Persson at Orthogonal. thanks to mesaoptimizer for his help putting together this post.
what does the QACI plan for formal-goal alignment actually look like when formalized as math? in this post, we'll be presenting our current formalization, which we believe has most critical details filled in.
1. math constructs
in this first part, we'll be defining a collection of mathematical constructs which we'll be using in the rest of the post.
1.1. basic set theory
we'll be assuming basic set theory notation; in particular, A×B×C is the set of tuples whose elements are respectively members of the sets A, B, and C, and for n∈N, Sn is the set of tuples of n elements, all members of S.
B={⊤,⊥} is the set of booleans and N is the set of natural numbers including 0.
given a set X, P(X) will be the set of subsets of X.
#S is the cardinality (number of different elements) in set S.
for some set X and some complete ordering <∈X2→B, min< and max< are two functions of type P(X)∖{∅}→X finding the respective minimum and maximum element of non-empty sets when they exist, using < as an ordering.
1.2. functions and programs
if n∈N, then we'll denote f∘n as repeated composition of f: f∘…∘f (n times), with ∘ being the composition operator: (f∘g)(x)=f(g(x)).
λx:X.B is an anonymous function defined over set X, whose parameter x is bound to its argument in its body B when it is called.
A→B is the set of functions from A to B, with → being right-associative (A→B→C is A→(B→C)). if f∈A→B→C, then f(x)(y) is simply f applied once to x∈A, and then the resulting function of type B→C being applied to y∈B. A→B is sometimes denoted BA in set theory.
AH→B is the set of always-halting, always-succeeding, deterministic programs taking as input an A and returning a B.
given f∈AH→B and x∈A, R(f,x)∈N∖{0} is the runtime duration of executing f with input x, measured in compute steps doing a constant amount of work each — such as turing machine updates.
1.3. sum notation
i'll be using a syntax for sums ∑ in which the sum iterates over all possibles values for the variables listed above it, given that the constraints below it hold.
x,y∑y=1y=xmod2x∈{1,2,3,4}x≤2
says "for any value of x and y where these three constraints hold, sum y".
1.4. distributions
for any countable set X, the set of distributions over X is defined as:
ΔX≔{f|f∈X→[0;1],x∑x∈Xf(x)≤1}
a function f∈X→[0;1] is a distribution ΔX over X if and only if its sum over all of X is never greater than 1. we call "mass" the scalar in [0;1] which a distribution assigns to any value. note that in our definition of distribution, we do not require that the distribution over all elements in the domain sums up to 1, but merely that it sums up to at most 1. this means that different distributions can have different "total mass".
we define Δ0X∈ΔX as the empty distribution: Δ0X(x)=0.
we define Δ1X∈X→ΔX as the distribution entirely concentrated on one element: Δ1X(x)(y)={1ify=x0ify≠x
we define NormalizeX∈ΔX→ΔX which modifies a distribution to make it sum to 1 over all of its elements, except for empty distributions:
in which variables x are sampled from their respective distributions X, such that each instance of V is multiplied by X(x) for each x. constraints C and iterated variables v are kept as-is.
it is intended to weigh its expression body V by various sets of assignments of values to the variables v, weighed by how much mass the X distributions return and filtered for when the C constraints hold.
to take a fairly abstract but fully calculable example,
in this syntax, the variables being sampled from distributions are allowed to be bound by an arbitrary amount of logical constraints or new variable bindings below it, other than the variables being sampled from distributions.
1.6. bitstrings
B∗ is the set of finite bitstrings.
bitstrings can be compared using the lexicographic order <B∗, and concatenated using the ∥ operator. for a bitstring x∈B∗, |x|∈N is its length in number of bits.
for any countable set X, EncodeX∈X→B∗ and will be some reasonable function to convert values to bitstrings, such that ∀(x,y)∈X2:EncodeX(x)=EncodeX(y)⇔x=y. "reasonable" entails constraints such as:
it can be computed efficiently.
it can be inverted efficiently and unambiguously.
its output's size is somewhat proportional to the actual amount of information. for example, integers are encoded in binary, not unary.
1.7. cryptography
we posit σ≔B¯σ, the set of "signatures", sufficiently large bitstrings for cryptographic and uniqueness purposes, with their length defined as ¯σ=231 for now. this feels to me like it should be enough, and if it isn't then something is fundamentally wrong with the whole scheme, such that no manageable larger size would do either.
we posit a function ExpensiveHash∈B∗H→σ, to generate fixed-sized strings from seed bitstrings, which must satisfy the following:
it must be too expensive for the AI to compute in any way (including through superintelligently clever tricks), but cheap enough that we can compute it outside of the AI — for example, it could require quantum computation, and the AI could be restricted to classical computers
it should take longer to compute (again, in any way) than the expected correct versions of Loc's f,g functions (as will be defined later) could afford to run
it should tend to be collision-resistant
at some point, we might come up with more formal ways to define ExpensiveHash in a way that checks that it isn't being computed inside Loc's f,g functions, nor inside the AI.
this work was done by Tamsin Leake and Julia Persson at Orthogonal.
thanks to mesaoptimizer for his help putting together this post.
what does the QACI plan for formal-goal alignment actually look like when formalized as math? in this post, we'll be presenting our current formalization, which we believe has most critical details filled in.
1. math constructs
in this first part, we'll be defining a collection of mathematical constructs which we'll be using in the rest of the post.
1.1. basic set theory
we'll be assuming basic set theory notation; in particular, A×B×C is the set of tuples whose elements are respectively members of the sets A, B, and C, and for n∈N, Sn is the set of tuples of n elements, all members of S.
B={⊤,⊥} is the set of booleans and N is the set of natural numbers including 0.
given a set X, P(X) will be the set of subsets of X.
#S is the cardinality (number of different elements) in set S.
for some set X and some complete ordering <∈X2→B, min< and max< are two functions of type P(X)∖{∅}→X finding the respective minimum and maximum element of non-empty sets when they exist, using < as an ordering.
1.2. functions and programs
if n∈N, then we'll denote f∘n as repeated composition of f: f∘…∘f (n times), with ∘ being the composition operator: (f∘g)(x)=f(g(x)).
λx:X.B is an anonymous function defined over set X, whose parameter x is bound to its argument in its body B when it is called.
A→B is the set of functions from A to B, with → being right-associative (A→B→C is A→(B→C)). if f∈A→B→C, then f(x)(y) is simply f applied once to x∈A, and then the resulting function of type B→C being applied to y∈B. A→B is sometimes denoted BA in set theory.
AH→B is the set of always-halting, always-succeeding, deterministic programs taking as input an A and returning a B.
given f∈AH→B and x∈A, R(f,x)∈N∖{0} is the runtime duration of executing f with input x, measured in compute steps doing a constant amount of work each — such as turing machine updates.
1.3. sum notation
i'll be using a syntax for sums ∑ in which the sum iterates over all possibles values for the variables listed above it, given that the constraints below it hold.
x,y∑y=1y=xmod2x∈{1,2,3,4}x≤2
says "for any value of x and y where these three constraints hold, sum y".
1.4. distributions
for any countable set X, the set of distributions over X is defined as:
ΔX≔{f|f∈X→[0;1],x∑x∈Xf(x)≤1}
a function f∈X→[0;1] is a distribution ΔX over X if and only if its sum over all of X is never greater than 1. we call "mass" the scalar in [0;1] which a distribution assigns to any value. note that in our definition of distribution, we do not require that the distribution over all elements in the domain sums up to 1, but merely that it sums up to at most 1. this means that different distributions can have different "total mass".
we define Δ0X∈ΔX as the empty distribution: Δ0X(x)=0.
we define Δ1X∈X→ΔX as the distribution entirely concentrated on one element: Δ1X(x)(y)={1ify=x0ify≠x
we define NormalizeX∈ΔX→ΔX which modifies a distribution to make it sum to 1 over all of its elements, except for empty distributions:
NormalizeX(δ)(x)≔⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩δ(x)y∑y∈Xδ(y)ifδ≠Δ0X0ifδ=Δ0X
we define UniformX as a distribution attributing equal value to every different element in a finite set X, or the empty distribution if X is infinite.
UniformX(x)≔{1#Xif#X∈N0if#X∉N
we define maxΔX∈ΔX→P(X) as the function finding the elements of a distribution with the highest value:
maxΔX(δ)≔{x|x∈X,∀x′∈X:δ(x′)≤δ(x)}
1.5. constrained mass
given distributions, we will define a notation which i'll call "constrained mass".
it is defined as a syntactic structure that turns into a sum:
v1,…,vpv1,…,vpM[V]≔∑X1(x1)⋅…⋅Xn(xn)⋅Vx1:X1x1∈domain(X1)⋮⋮xn:Xnxn∈domain(Xn)C1C1⋮⋮CmCm
in which variables x are sampled from their respective distributions X, such that each instance of V is multiplied by X(x) for each x. constraints C and iterated variables v are kept as-is.
it is intended to weigh its expression body V by various sets of assignments of values to the variables v, weighed by how much mass the X distributions return and filtered for when the C constraints hold.
to take a fairly abstract but fully calculable example,
x,fM[f(x,2)]≔x,f∑(λn:{1,2,3}.n10)(x)⋅Uniform{min,max}(f)⋅f(x,2)x:λn:{1,2,3}.n10x∈domain(λn:{1,2,3}.n10)f:Uniform{min,max}f∈domain(Uniform{min,max})xmod2≠0xmod2≠0=x,f∑x10⋅12⋅f(x,2)x∈{1,2,3}f∈{min,max}xmod2≠0=1⋅min(1,2)10⋅2+3⋅min(3,2)10⋅2+1⋅max(1,2)10⋅2+3⋅max(3,2)10⋅2=1⋅1+3⋅2+1⋅2+3⋅320=1+6+2+920=1820=910
in this syntax, the variables being sampled from distributions are allowed to be bound by an arbitrary amount of logical constraints or new variable bindings below it, other than the variables being sampled from distributions.
1.6. bitstrings
B∗ is the set of finite bitstrings.
bitstrings can be compared using the lexicographic order <B∗, and concatenated using the ∥ operator. for a bitstring x∈B∗, |x|∈N is its length in number of bits.
for any countable set X, EncodeX∈X→B∗ and will be some reasonable function to convert values to bitstrings, such that ∀(x,y)∈X2:EncodeX(x)=EncodeX(y)⇔x=y. "reasonable" entails constraints such as:
1.7. cryptography
we posit σ≔B¯σ, the set of "signatures", sufficiently large bitstrings for cryptographic and uniqueness purposes, with their length defined as ¯σ=231 for now. this feels to me like it should be enough, and if it isn't then something is fundamentally wrong with the whole scheme, such that no manageable larger size would do either.
we posit a function ExpensiveHash∈B∗H→σ, to generate fixed-sized strings from seed bitstrings, which must satisfy the following:
at some point, we might come up with more formal ways to define ExpensiveHash in a way that checks that it isn't being computed inside Loc's f,g functions, nor inside the AI.
1.8. text and math evaluation
for any countable