Yes, it seems like both creating a "New Shortform" when hovering over my user name and commenting on "Leon Lang's Shortform" will do the exact same thing. But I can also reply to the comments.

Making the Telephone Theorem and Its Proof Precise

This short form distills the Telephone Theorem and its proof. The short form will thereby not at all be "intuitive"; the only goal is to be mathematically precise at every step.

Let M0,M1,… be jointly distributed finite random variables, meaning they are all functions

Mi:Ω→Mi

starting from the same finite sample space with a given probability distribution P and into respective finite value spaces Mi. Additionally, assume that these random variables form a Markov chain M0→M1→….

Lemma: For a Markov chain M0→M1→M2, the following two statements are equivalent:

(a) I(M1;M0)=I(M2;M0)

(b) For all m1,m2 with P(m1,m2)>0: P(M0∣m1)=P(M0∣m2).

Proof:

Assume (a): Inspecting an information diagram of M0,M1,M2 will immediately result in us also observing the Markov chain M0→M2→M1. Markov chains can be turned around, thus we get the two chains

M1→M2→M0,M2→M1→M0.

Factorizing along these two chains, we obtain:

P(M0∣M2)⋅P(M1,M2)=P(M0,M1,M2)=P(M0∣M1)⋅P(M1,M2)

and thus, for all m1,m2 with P(m1,m2)>0:P(M0∣m1)=P(M0∣m2). That proves (b).

where, in the second step, we used the Markov chain M0→M1→M2 and in the third step, we used assumption (b). This independence gives us the vanishing of conditional mutual information:

I(M0;M1∣M2)=0.

Together with the Markov chain M0→M1→M2, this results, by inspecting an information diagram, in the equality I(M1;M0)=I(M2;M0). □

Theorem: Let n≥1. The following are equivalent:

(a) I(Mn;M0)=I(Mn+1;M0)

(b) There are functions fn,fn+1 defined on Mn,Mn+1, respectively such that:

fn(Mn)=fn+1(Mn+1) with probability 1, i.e., the measure of all ω∈Ω such that the equality doesn't hold is zero.

For all mn∈Mn, we have the equality P(M0∣Mn=mn)=P(M0∣fn(Mn)=fn(mn)), and the same for n+1.

Proof: The Markov chain immediately also gives us a Markov chain M0→Mn→Mn+1, meaning we can without loss of generality assume that n=1. So let's consider the simple Markov chain M0→M1→M2.

Assume (a): By the lemma, this gives us for all m1,m2 with P(m1,m2)>0: P(M0∣m1)=P(M0∣m2).

Define the two functions fi:Mi→Δ(M0), i=1,2 by:

fi(mi):=P(M0∣mi).

Then we have f1(m1)=f2(m2) with probability 1^{[1]}, giving us the first condition we wanted to prove.

For the second condition, we use a trick from Probability as Minimal Map: set p:=f1(m1), which is a probability distribution. We get

There is some subtlety about whether the random variable f1(M1) can be replaced by f2(M2) in that equation. But given that they are "almost" the same random variables, I think this is valid inside the probability equation.

These are rough notes trying (but not really succeeding) to deconfuse me about Alex Turner's diamond proposal. The main thing I wanted to clarify: what's the idea here for how the agent remains motivated by diamondseven while doing very non-diamond related things like "solving mazes" that are required for general intelligence?

Summarizing Alex's summary:

Multimodal SSL initialization

recurrent state, action head

imitation learning on humans in simulation, + sim2real

low sample complexity

Humans move toward diamonds

policy-gradient RL: reward the AI for getting near diamonds

the recurrent state retains long-term information

After each task completion: the AI is near diamonds

SSL will make sure the diamond abstraction exists

Proto Diamond shard:

There is a diamond abstraction that will be active once a diamond is seen. Imagine this as being a neuron.

Then, hook up the "move forward" action with this neuron being active. Give reward for being near diamonds. Voila, you get an agent which obtains reward! This is very easy to learn! More easily than other reward-obtaining computations

Also, other such computations may be reinforced, like "if shiny object seen, move towards it" --- do adversarial training to rule those out

This is all about prototypical diamonds. Thus, the AI may not learn to create a diamond as large as a sun, but that's also not what the post is about.

Preserving diamond abstraction/shard:

In Proto planning, the AI primarily thinks about how to achieve diamonds. Such thinking is active across basically all contexts, due to early RL training.

Then, we will give the AI other types of tasks, like "maze solving" or "chess playing" or anything else, from very easy to very hard.

At the end of each task, there will be a diamond and reward.

By default, at the start of this new training process, the diamond shard will be active since training so far ensures it is active in most contexts. It will bid for actions before the reward is reached, and therefore, its computations will be reinforced and shaped. Also, other shards will be reinforced (ones that plan how to solve a maze, since they also steer toward the reinforcement event), but the diamond shard is ALWAYS reinforced.

The idea here is that the diamond shard is contextually activated BY EVERY CONTEXT, and so it is basically one huge circuit thinking about how to reach diamonds that simply gets extended with more sub-computations for how to reach diamonds.

Caveat: another shard may be better than the diamond shard at planning toward the end of a maze than the diamond shard which "isn't specialized". And if that's the case, then reinforcement events may make the diamond shard continuously less active in maze-solving contexts until it doesn't activate anymore at start-of-maze contexts. It's unclear to me what the hypothesis is for how to prevent this.

Possibly the hypothesis is captured in this paragraph of Alex, but I don't understand it: "In particular, even though online self-supervised learning continues to develop the world model and create more advanced concepts, the reward events also keep pinging the invocation of the diamond-abstraction as responsible for reward (because insofar as the agent's diamond-shard guides its decisions, then the diamond-shard's diamond-abstraction is in fact responsible for the agent getting reward). The diamond-abstractiongradient starves the AI from exclusively acting on the basis of possible advanced "alien" abstractions which would otherwise have replaced the diamond abstraction. The diamond shard already gets reward effectively, integrating with the rest of the agent's world model and recurrent state, and therefore provides "job security" for the diamond-abstraction. (And once the agent is smart enough, it will want to preserve its diamond abstraction, insofar as that is necessary for the agent to keep achieving its current goals which involve prototypical-diamonds.)"

I don't understand what it means to "ping the invocation of the diamond-abstraction as responsible for reward". I can imagine what it means to have subcircuits whose activation is strengthened on certain inputs, or whose computations (if they were active in the context) are changed in response to reinforcement. And so, I imagine the shard itself to be shaped by reward. But I'm not sure what exactly is meant by pinging the invocation of the diamond abstraction as responsible for reward.

what's the idea here for how the agent remains motivated by diamondseven while doing very non-diamond related things like "solving mazes" that are required for general intelligence?

I think that the agent probably learns a bunch of values, many related to gaining knowledge and solving games and such. (People are also like this; notice that raising a community-oriented child does not require a proposal for how the kid will only care about their community, even as they go through school and such.)

Also, other shards will be reinforced (ones that plan how to solve a maze, since they also steer toward the reinforcement event), but the diamond shard is ALWAYS reinforced.

The idea here is that the diamond shard is contextually activated BY EVERY CONTEXT, and so it is basically one huge circuit thinking about how to reach diamonds that simply gets extended with more sub-computations for how to reach diamonds.

I think this is way stronger of a claim than necessary. I think it's fine if the agent learns some maze-/game-playing shards which do activate while the diamond-shard doesn't -- it's a quantitative question, ultimately. I think an agent which cares about playing games and making diamonds and some other things too, still ends up making diamonds.

I don't understand what it means to "ping the invocation of the diamond-abstraction as responsible for reward".

Credit assignment (AKA policy gradient) credits the diamond-recognizing circuit as responsible for reward, thereby retaining this diamond abstraction in the weights of the network.

Credit assignment (AKA policy gradient) credits the diamond-recognizing circuit as responsible for reward, thereby retaining this diamond abstraction in the weights of the network.

This is different from how I imagine the situation. In my mind, the diamond-circuit remains simply because it is a good abstraction for making predictions about the world. Its existence is, in my imagination, not related to an RL update process.

Other than that, I think the rest of your comment doesn't quite answer my concern, so I try to formalize it more. Let's work in the simple setting that the policy network has no world model and is simply a non-recurrent function f:O→Δ(A) mapping from observations to probability distributions over actions. I imagine a simple version of shard theory to claim that f decomposes as follows:

f(o)=SM(∑iai(o)⋅fi(o)),

where i is an index for enumerating shards, ai(o) is the contextual strength of activation of the i-th shard (maybe with 0≤ai(o)≤1), and fi(o) is the action-bid of the i-th shard, i.e., the vector of log-probabilities it would like to see for different actions. Then SM is the softmax function, producing the final probabilities.

In your story, the diamond shard starts out as very strong. Let's say it's indexed by 0 and that a0(o)≈1 for most inputs o and that f0 has a large "capacity" at its disposal so that it will in principle be able to represent behaviors for many different tasks.

Now, if a new task pops up, like solving a maze, in a specific context om, I imagine that two things could happen to make this possible:

f0(om) could get updated to also represent this new behavior

The strength a0(o) could get weighed down and some other shard could learn to represent this new behavior.

One reason why the latter may happen is that f0 possibly becomes so complicated that it's "hard to attach more behavior to it"; maybe it's just simpler to create an entirely new module that solves this task and doesn't care about diamonds. If something like this happens often enough, then eventually, the diamond shard may lose all its influence.

One reason why the latter may happen is that f0 possibly becomes so complicated that it's "hard to attach more behavior to it"; maybe it's just simpler to create an entirely new module that solves this task and doesn't care about diamonds. If something like this happens often enough, then eventually, the diamond shard may lose all its influence.

I don't currently share your intuitions for this particular technical phenomenon being plausible, but imagine there are other possible reasons this could happen, so sure? I agree that there are some ways the diamond-shard could lose influence. But mostly, again, I expect this to be a quantitative question, and I think experience with people suggests that trying a fun new activity won't wipe away your other important values.

This is my first comment on my own, i.e., Leon Lang's, shortform. It doesn't have any content, I just want to test the functionality.

Unfortunately not, as far as my interface goes, if you wanted to comment here.

Yes, it seems like both creating a "New Shortform" when hovering over my user name and commenting on "Leon Lang's Shortform" will do the exact same thing. But I can also reply to the comments.

## Making the Telephone Theorem and Its Proof Precise

This short form distills the Telephone Theorem and its proof. The short form will thereby not at all be "intuitive"; the only goal is to be mathematically precise at every step.

Let M0,M1,… be jointly distributed finite random variables, meaning they are all functions

Mi:Ω→Mistarting from the same finite sample space with a given probability distribution P and into respective finite value spaces Mi. Additionally, assume that these random variables form a Markov chain M0→M1→….

Lemma:For a Markov chain M0→M1→M2, the following two statements are equivalent:(a) I(M1;M0)=I(M2;M0)

(b) For all m1,m2 with P(m1,m2)>0: P(M0∣m1)=P(M0∣m2).

Proof:

M1→M2→M0, M2→M1→M0.Assume (a):Inspecting an information diagram of M0,M1,M2 will immediately result in us also observing the Markov chain M0→M2→M1. Markov chains can be turned around, thus we get the two chainsFactorizing along these two chains, we obtain:

P(M0∣M2)⋅P(M1,M2)=P(M0,M1,M2)=P(M0∣M1)⋅P(M1,M2)and thus, for all m1,m2 with P(m1,m2)>0: P(M0∣m1)=P(M0∣m2).

That proves (b).

P(M0,M1∣M2)=P(M0∣M1,M2)⋅P(M1∣M2)=P(M0∣M1)⋅P(M1∣M2)=P(M0∣M2)⋅P(M1∣M2),Assume (b):We havewhere, in the second step, we used the Markov chain M0→M1→M2 and in the third step, we used assumption (b). This independence gives us the vanishing of conditional mutual information:

I(M0;M1∣M2)=0.Together with the Markov chain M0→M1→M2, this results, by inspecting an information diagram, in the equality I(M1;M0)=I(M2;M0). □

Theorem: Let n≥1. The following are equivalent:(a) I(Mn;M0)=I(Mn+1;M0)

(b) There are functions fn,fn+1 defined on Mn,Mn+1, respectively such that:

doesn'thold is zero.Proof: The Markov chain immediately also gives us a Markov chain M0→Mn→Mn+1, meaning we can without loss of generality assume that n=1. So let's consider the simple Markov chain M0→M1→M2.Assume (a): By the lemma, this gives us for all m1,m2 with P(m1,m2)>0: P(M0∣m1)=P(M0∣m2).Define the two functions fi:Mi→Δ(M0), i=1,2 by:

fi(mi):=P(M0∣mi).Then we have f1(m1)=f2(m2) with probability 1

^{[1]}, giving us the first condition we wanted to prove.For the second condition, we use a trick from Probability as Minimal Map: set p:=f1(m1), which is a probability distribution. We get

P(M0∣f1(M1)=f1(m1))=P(M0∣f1(M1)=p)=P(M0, f1(M1)=p)P(f1(M1)=p)=∑m′1:f1(m′1)=pP(M0,m′1)∑m′1:f1(m′1)=pP(m′1)=∑m′1:f1(m′1)=pP(M0∣m′1)⋅P(m′1)∑m′1:f1(m′1)=pP(m′1)=∑m′1:f1(m′1)=pp⋅P(m′1)∑m′1:f1(m′1)=pP(m′1)=p=f1(m1)=P(M0∣M1=m1).That proves (b).

f1(m1)=[f1(M1)](ω)=[f2(M2)](ω)=f2(m2)Assume (b):For the other direction, let m1,m2 be given with P(m1,m2)>0. Let ω∈Ω be such that (M1(ω),M2(ω))=(m1,m2) and with P(ω)>0. We haveand thus

P(M0∣m1)=P(M0∣f1(M1)=f1(m1))=P(M0∣f2(M2)=f2(m2))=P(M0∣m2).The result follows from the Lemma.

^{[2]}^{^}Somehow, my brain didn't find this obvious. Here is an explanation:

P({ω∣f1(M1(ω))≠f2(M2(ω))})≤P({ω∣P(M1(ω),M2(ω))=0})

=P((M1,M2)−1({(m1,m2)∣P(m1,m2)=0}))=∑m1,m2:P(m1,m2)=0P((M1,M2)−1(m1,m2))

=∑m1,m2:P(m1,m2)=0P(m1,m2)=0

^{^}There is some subtlety about whether the random variable f1(M1) can be replaced by f2(M2) in that equation. But given that they are "almost" the same random variables, I think this is valid inside the probability equation.

These are rough notes trying (but not really succeeding) to deconfuse me about Alex Turner's diamond proposal. The main thing I wanted to clarify: what's the idea here for how the agent

remains motivated by diamondseven while doing very non-diamond related things like "solving mazes" that are required for general intelligence?eachtask completion: the AI is near diamondsbeing a neuron.This is very easy to learn! More easily than other reward-obtaining computationsThis is all about prototypical diamonds. Thus, the AI may not learn to create a diamond as large as a sun, but that's also not what the post is about.thinks about how to achieve diamonds. Such thinking is active across basically all contexts, due to early RL training.andreward.this new training process, the diamond shard will be active sincetraining so far ensures it is active in most contexts.Itwill bid for actions before the reward is reached, and therefore,its computationswill be reinforced and shaped. Also,othershards will be reinforced (ones that plan how to solve a maze, since they also steer toward the reinforcement event), butthe diamond shard is ALWAYS reinforced.better than the diamond shardat planning toward the end of a maze than the diamond shard which "isn't specialized". And if that's the case, then reinforcement events may make the diamond shard continuously less active in maze-solving contexts until it doesn't activate anymore at start-of-maze contexts. It's unclear to me what the hypothesis is for how to prevent this."In particular, even though online self-supervised learning continues to develop the world model and create more advanced concepts, the reward events also keep pinging the invocation of the diamond-abstraction as responsible for reward (because insofar as the agent's diamond-shard guides its decisions, then the diamond-shard'sdiamond-abstraction is in fact responsible for the agent getting reward). The diamond-abstractiongradient starvesthe AI from exclusively acting on the basis of possible advanced "alien" abstractions which would otherwise have replaced the diamond abstraction. The diamond shard already gets reward effectively, integrating with the rest of the agent's world model and recurrent state, and therefore provides "job security" for the diamond-abstraction. (And once the agent is smart enough, it will want to preserve its diamond abstraction, insofar as that is necessary for the agent to keep achieving its current goals which involve prototypical-diamonds.)"itselfto be shaped by reward. But I'm not sure what exactly is meant by pinging the invocation of the diamond abstraction as responsible for reward.I think that the agent probably learns a bunch of values, many related to gaining knowledge and solving games and such. (People are also like this; notice that raising a community-oriented child does not require a proposal for how the kid will

onlycare about their community, even as they go through school and such.)I think this is way stronger of a claim than necessary. I think it's fine if the agent learns some maze-/game-playing shards which do activate while the diamond-shard doesn't -- it's a quantitative question, ultimately. I think an agent which cares about playing games and making diamonds and some other things too, still ends up making diamonds.

Credit assignment (AKA policy gradient) credits the diamond-recognizing circuit as responsible for reward, thereby retaining this diamond abstraction in the weights of the network.

Thanks for your answer!

This is different from how I imagine the situation. In my mind, the diamond-circuit remains simply because it is a good abstraction for making predictions about the world. Its existence is, in my imagination, not related to an RL update process.

Other than that, I think the rest of your comment doesn't quite answer my concern, so I try to formalize it more. Let's work in the simple setting that the policy network has no world model and is simply a non-recurrent function f:O→Δ(A) mapping from observations to probability distributions over actions. I imagine a simple version of shard theory to claim that f decomposes as follows:

f(o)=SM(∑iai(o)⋅fi(o)),

where i is an index for enumerating shards, ai(o) is the contextual strength of activation of the i-th shard (maybe with 0≤ai(o)≤1), and fi(o) is the action-bid of the i-th shard, i.e., the vector of log-probabilities it would like to see for different actions. Then SM is the softmax function, producing the final probabilities.

In your story, the diamond shard starts out as very strong. Let's say it's indexed by 0 and that a0(o)≈1 for most inputs o and that f0 has a large "capacity" at its disposal so that it will in principle be able to represent behaviors for many different tasks.

Now, if a new task pops up, like solving a maze, in a specific context om, I imagine that two things could happen to make this possible:

othershard could learn to represent this new behavior.One reason why the latter may happen is that f0 possibly becomes so complicated that it's "hard to attach more behavior to it"; maybe it's just simpler to create an entirely new module that solves this task and doesn't care about diamonds. If something like this happens often enough, then eventually, the diamond shard may lose all its influence.

I don't currently share your intuitions for this particular technical phenomenon being plausible, but imagine there are other possible reasons this could happen, so sure? I agree that there are some ways the diamond-shard could lose influence. But mostly, again, I expect this to be a quantitative question, and I think experience with people suggests that trying a fun new activity won't wipe away your other important values.

This is my first short form. It doesn't have any content, I just want to test the functionality.