Introduction: A rational number can represent an arbitrary sequence of bits where bit patterns either terminate or repeat infinitely. This provides a mathematical toy model for reasoning about AGI safety. A bit sequence can be translated into a sequence of actions. The goal is to prove safety of some bit sequence without bounds. While this toy model does not allow input to the AGI, it offers useful insights into the complexity of safe AGI in the case where the entire deterministic world state and human values have been internalized.

I have been thinking about numbers of the form:

(a / b)^(1 / c)

Where a, b, c are natural numbers. In number theory, this is called "Radical Extensions of Rational numbers" (I will use RER for short in this blog post).

RER is closed under multiplication, but not addition:

RER covers the real line a little better than rationals, by including some irrational numbers, but not completely due to the Abel-Ruffini theorem, which implies that there are more irrational numbers than can be constructed using a generalization of RER to solutions of polynomial equations (up to some limit below the more irrational solutions).

These "more irrational" numbers are called "higher-order algebraic numbers" or "transcendental algebraic numbers" in number theory.

Translated as an analogy to computer science:

A natural number might be thought of as a "source" of some program

A rational number might be thought of as a "computationally reducible" calculation using some source as input

A RER might be thought of as a simple encryption algorithm

A higher-order algebraic number might be thought of as a recursive encryption scheme that can make it arbitrary difficult to interpret some message

Translated as an analogy to AI safety:

A natural number might be thought of as a sequence of actions

A rational number might be thought of as operational safety of a sequence of actions

A RER might be thought of as simple yet non-trivial contextual safety of a sequence of actions

A higher-order algebraic number might be thought of as a human goal

Since human goals are difficult to encode, one might think about them using this analogy, as guessing a higher-order algebraic number.

The analogy also gives some insight into predictability. For example, consider translating this analogy to the architecture of Large Language Model (LLM). The correspondence of a rational number is a sequence of tokens that is predictable and non-surprising, although it might be infinite in length. To talk about doing useful work in new environments, it requires capabilities of RER or higher, such that the sequence of tokens is non-predictable and looks like it covers the domain of application (RER contains some irrational numbers that looks indistinguishable from other more irrational numbers at first sight). However, at some point we discover that the LLM outputs an unsafe sequence of tokens. It deviates from the analogue of infinite sequence of tokens that are not predictable in advance, but also safe (like guessing some higher-order algebraic number).

Therefore given this analogy, the complexity of safe AGI corresponds to the complexity of guessing a higher-order algebraic number. It is very hard to say during training that the system will be safe in the real world, because it is able to output sequences of actions that look safe during training. There is only a limited amount of data available to aim the AGI in the right direction and we want it to continue in that direction, correcting itself as it goes. This means, not only just output sequences of actions that look safe at first, but also "really get" what safety means in the long term.

Introduction: A rational number can represent an arbitrary sequence of bits where bit patterns either terminate or repeat infinitely. This provides a mathematical toy model for reasoning about AGI safety. A bit sequence can be translated into a sequence of actions. The goal is to prove safety of some bit sequence without bounds. While this toy model does not allow input to the AGI, it offers useful insights into the complexity of safe AGI in the case where the entire deterministic world state and human values have been internalized.

I have been thinking about numbers of the form:

`(a / b)^(1 / c)`

Where

`a, b, c`

are natural numbers. In number theory, this is called "Radical Extensions of Rational numbers" (I will use RER for short in this blog post).RER is closed under multiplication, but not addition:

`(a0/a1)^(1/a2) *`

(b0/b1)^(1/b2) = ( (a0^b2*b0^a2) / (a1^b2*b1^a2) )^(1 / (a2*b2) )RER covers the real line a little better than rationals, by including some irrational numbers, but not completely due to the Abel-Ruffini theorem, which implies that there are more irrational numbers than can be constructed using a generalization of RER to solutions of polynomial equations (up to some limit below the more irrational solutions).

These "more irrational" numbers are called "higher-order algebraic numbers" or "transcendental algebraic numbers" in number theory.

Translated as an analogy to computer science:

Translated as an analogy to AI safety:

Since human goals are difficult to encode, one might think about them using this analogy, as guessing a higher-order algebraic number.

The analogy also gives some insight into predictability. For example, consider translating this analogy to the architecture of Large Language Model (LLM). The correspondence of a rational number is a sequence of tokens that is predictable and non-surprising, although it might be infinite in length. To talk about doing useful work in new environments, it requires capabilities of RER or higher, such that the sequence of tokens is non-predictable and looks like it covers the domain of application (RER contains some irrational numbers that looks indistinguishable from other more irrational numbers at first sight). However, at some point we discover that the LLM outputs an unsafe sequence of tokens. It deviates from the analogue of infinite sequence of tokens that are not predictable in advance, but also safe (like guessing some higher-order algebraic number).

Therefore given this analogy, the complexity of safe AGI corresponds to the complexity of guessing a higher-order algebraic number. It is very hard to say during training that the system will be safe in the real world, because it is able to output sequences of actions that look safe during training. There is only a limited amount of data available to aim the AGI in the right direction and we want it to continue in that direction, correcting itself as it goes. This means, not only just output sequences of actions that look safe at first, but also "really get" what safety means in the long term.