this work was done by Tamsin Leake and Julia Persson at Orthogonal.
thanks to mesaoptimizer for his help putting together this post.

what does the QACI plan for formal-goal alignment actually look like when formalized as math? in this post, we'll be presenting our current formalization, which we believe has most critical details filled in.

1. math constructs

in this first part, we'll be defining a collection of mathematical constructs which we'll be using in the rest of the post.

1.1. basic set theory

we'll be assuming basic set theory notation; in particular, is the set of tuples whose elements are respectively members of the sets , , and , and for , is the set of tuples of elements, all members of .

is the set of booleans and is the set of natural numbers including .

given a set , will be the set of subsets of .

is the cardinality (number of different elements) in set .

for some set and some complete ordering , and are two functions of type finding the respective minimum and maximum element of non-empty sets when they exist, using as an ordering.

1.2. functions and programs

if , then we'll denote as repeated composition of : ( times), with being the composition operator: .

is an anonymous function defined over set , whose parameter is bound to its argument in its body when it is called.

is the set of functions from to , with being right-associative ( is ). if , then is simply applied once to , and then the resulting function of type being applied to . is sometimes denoted in set theory.

is the set of always-halting, always-succeeding, deterministic programs taking as input an and returning a .

given and , is the runtime duration of executing with input , measured in compute steps doing a constant amount of work each — such as turing machine updates.

1.3. sum notation

i'll be using a syntax for sums in which the sum iterates over all possibles values for the variables listed above it, given that the constraints below it hold.

says "for any value of and where these three constraints hold, sum ".

1.4. distributions

for any countable set , the set of distributions over is defined as:

a function is a distribution over if and only if its sum over all of is never greater than 1. we call "mass" the scalar in which a distribution assigns to any value. note that in our definition of distribution, we do not require that the distribution over all elements in the domain sums up to 1, but merely that it sums up to at most 1. this means that different distributions can have different "total mass".

we define as the empty distribution: .

we define as the distribution entirely concentrated on one element:

we define which modifies a distribution to make it sum to 1 over all of its elements, except for empty distributions:

we define as a distribution attributing equal value to every different element in a finite set , or the empty distribution if is infinite.

we define as the function finding the elements of a distribution with the highest value:

1.5. constrained mass

given distributions, we will define a notation which i'll call "constrained mass".

it is defined as a syntactic structure that turns into a sum:

in which variables are sampled from their respective distributions , such that each instance of is multiplied by for each . constraints and iterated variables are kept as-is.

it is intended to weigh its expression body by various sets of assignments of values to the variables , weighed by how much mass the distributions return and filtered for when the constraints hold.

to take a fairly abstract but fully calculable example,

in this syntax, the variables being sampled from distributions are allowed to be bound by an arbitrary amount of logical constraints or new variable bindings below it, other than the variables being sampled from distributions.

1.6. bitstrings

is the set of finite bitstrings.

bitstrings can be compared using the lexicographic order , and concatenated using the operator. for a bitstring , is its length in number of bits.

for any countable set , and will be some reasonable function to convert values to bitstrings, such that . "reasonable" entails constraints such as:

  • it can be computed efficiently.
  • it can be inverted efficiently and unambiguously.
  • its output's size is somewhat proportional to the actual amount of information. for example, integers are encoded in binary, not unary.

1.7. cryptography

we posit , the set of "signatures", sufficiently large bitstrings for cryptographic and uniqueness purposes, with their length defined as for now. this feels to me like it should be enough, and if it isn't then something is fundamentally wrong with the whole scheme, such that no manageable larger size would do either.

we posit a function , to generate fixed-sized strings from seed bitstrings, which must satisfy the following:

  • it must be too expensive for the AI to compute in any way (including through superintelligently clever tricks), but cheap enough that we can compute it outside of the AI — for example, it could require quantum computation, and the AI could be restricted to classical computers
  • it should take longer to compute (again, in any way) than the expected correct versions of 's functions (as will be defined later) could afford to run
  • it should tend to be collision-resistant

at some point, we might come up with more formal ways to define in a way that checks that it isn't being computed inside 's functions, nor inside the AI.

1.8. text and math evaluation

for any countable