Maxime Riché

Wiki Contributions


This is a big reason for why GPT4 is likely not that big but instead trained on much more data :)

Do you also have estimates of the fraction of resources in our light cone that we expect to be used to create optimised good stuff?

Maybe the use of prompt suffixes can do a great deal to decrease the probability chatbots turning into Waluigi. See the "insert" functionality of OpenAI API
Chatbots developers could use suffix prompts in addition to prefix prompts to make it less likely to fall into a Waluigi completion. 

Indeed, empirical results show that filtering the data, helps quite well in aligning with some preferences: Pretraining Language Models with Human Preferences

What about the impact of dropout (parameters, layers), normalisation (batch, layer) (with a batch containing several episodes), asynchronous distributed data collection (making batch aggregation more stochastic), weight decay (impacting any weight), multi-agent RL training with independent agents, etc.
And other possible stuff that don't exist at the moment: online pruning and growth while training, population training where the gradient hackers are exploited.

Shouldn't that naively make gradient hacking very hard?

We see a lot of people die, in the reality, fictions and dreams.

We also see a lot of people having sex or sexual desire in fictions or dreams before experiencing it.

IDK how strong this is a counter argument to how powerful the alignment in us is. Maybe a biological reward system + imitation+ fiction and later dreams is simply what is at play in humans.

Should we expect these decompositions to be even more interpretable if the model was trained to output a prediction as soon as possible? (After any block, instead of outputting the prediction after the full network)

Some quick thoughts about "Content we aren’t (yet) discussing":


Shard theory should be about transmission of values by SL (teaching, cloning, inheritance) more than learning them using RL

SL (Cloning) is more important than RL. Humans learn a world model by SSL, then they bootstrap their policies through behavioural cloning and finally they finetune their policies thought RL.

Why? Because of theoretical reasons and from experimental data points, this is the cheapest why to generate good general policies…

  • SSL before SL because you get much more frequent and much denser data about the world by trying to predict it. => SSL before SL because of a bottleneck on the data from SL.
  • SL before RL because this remove half (in log scale) of the search space by removing the need to discover|learn your reward function at the same time than your policy function. Because in addition, this remove the need do to the very expensive exploration and the temporal and "agential"(when multiagent) credit assignments. => SL before RL because of the cost of doing RL.


  • In cloning, the behaviour comes first and then the biological reward is observed or not. Behaviours that gives no biological reward to the subject can be learned. The subject will still learn some kind of values associated to these behaviours.
  • Learning with SL, instead of RL, doesn’t rely as much on credit assignment and exploration. What are the consequences of that?

What values are transmitted?

1) The final values

The learned values known by the previous generation.


  • Because it is costly to explore by yourself your reward function space
  • Because it is beneficiary to the community to help you improve your policies quickly

2) Internalised instrumental values

Some instrument goals are learned as final goal, they are “internalised”.


  • exploration is too costly
    • finding an instrumental goal is too rare or too costly
  • exploitation is too costly
    • having to make the choice of pursuing an instrumental goal in every situation is too costly or not quick enough (reaction time)
  • when being highly credible is beneficial
    • implicit commitments to increase your credibility

3) Non-internalised instrumental values


  • Because it is beneficiary to the community to help you improve your policies quickly



Shard theory is not about the 3rd level of reward function

We have here 3 level of rewards function:

1) The biological rewards

Hardcoded in our body

Optimisation process creating it: Evolution

  • Universe + Evolution ⇒ Biological rewards

Not really flexible

  • Without “drugs” and advanced biotechnologies

Almost no generalization power

  • Physical scope: We feel stuff when we are directly involved
  • Temporal scope: We feel stuff when they are happening
  • Similarity scope: We fell stuff when we are directly involved

Called sensations, pleasure, pain

2) The learned values | rewards | shards

Learned through life

Optimisation process creating it: SL and RL relying on biological rewards

  • Biological rewards + SL and RL ⇒ Learned values in the brain

Flexible in term of years

Medium generalization power

  • Physical scope: We learn to care for even in case where we are not involved (our close circle)
  • Temporal scope: We learn to feel emotions about the future and the past
  • Similarity scope: We learn to feel emotions for other kind of beings

Called intuitions, feelings

Shard theory may be explaining only this part

3) (optional) The chosen values:

Decided upon reflection

Optimisation process creating it: Thinking relying on the brain

  • Learned values in the brain + Thinking ⇒ Chosen values “on paper” | “in ideas”

Flexible in term of minutes

Can have up to very high generalization power

  • Physical scope: We can chose to care without limits of distances in space
  • Temporal scope: We can chose to care without limits of distances in time
  • Similarity scope: We can chose to care without limits in term of similarity to us

Called values, moral values


Why a 3rd level was created?

In short, to get more utility OOD.

A bit more details:

Because we want to design policies far OOD (out of our space of lived experiences). To do that, we know that we need to have a value function|reward model|utility function that generalizes very far. Thanks to this chosen general reward function, we can plan and try to reach a desired outcome far OOD. After reaching it, we will update our learned utility function (lvl 2).

Thanks to lvl 3, we can design public policies, dedicate our life to exploring the path towards a larger reward that will never be observed in our lifetime.


One impact of the 3 levels hierarchy:

This could explain why most philosophers can support scope sensitive values but never act on them.

You can see the sum of the votes and the number of votes (by having your mouse over the number). This should be enough to give you a rough idea of the ration between + and - votes :) 

If you look at the logit given a range that is not [0.0, 1.0] but [low perf, high perf], then you get a bit more predictive power, but it is still confusingly low.

A possible intuition here is that the scaling is producing a transition from non-zero performance to non-perfect performance. This seems right since the random baseline is not 0.0 and reaching perfect accuracy is impossible. 

I tried this only with PaLM on NLU and I used the same adjusted range for all tasks:

[0.9 * overall min. acc., 1.0 - 0.9 * (1.0 - overall max acc.)] ~ [0.13, 0.95]

Even if this model was true, they are maybe other additional explanations like the improvement on one task are not modeled by one logit function but by several of them. A task would be composed of sub-tasks each modelizable by one logit function. And if this make sense, one could try to model the improvements in all of the tasks using only a small number of logit curves associated to each sub-tasks (decomposing each tasks into a set of sub-tasks with a simple trend).

(Also Gopher looks like less predictable and the data more sparse (no data point in the X0 B parameters))

Load More