Wiki Contributions


Correct me if I'm wrong:

The equilibrium where everyone follows "set dial to equilibrium temperature" (i.e. "don't violate the taboo, and punish taboo violators") is only a weak Nash equilibrium.

If one person instead follows "set dial to 99" (i.e. "don't violate the taboo unless someone else does, but don't punish taboo violators") then they will do just as well, because the equilibrium temp will still always be 99. That's enough to show that it's only a weak Nash equilibrium.

Note that this is also true if an arbitrary number of people deviate to this strategy.

If everyone follows this second strategy, then there's no enforcement of the taboo, so there's an active incentive for individuals to set the dial lower.

So a sequence of unilateral changes of strategy can get us to a good equilibrium without anyone having to change to a worse strategy at any point. This makes the fact of it being a (weak) Nash equilibrium not that compelling to me; people don't seem trapped unless they have some extra laziness/inertia against switching strategies.

But (h/t Noa Nabeshima) you can strengthen the original, bad equilibrium to a strong Nash equilibrium by tweaking the scenario so that people occasionally accidentally set their dials to random values. Now there's an actual reason to punish taboo violators, because taboo violations can happen even if everyone is following the original strategy.

Beef is far from the only meat or dairy food consumed by Americans.

Big Macs are 0.4% of beef consumption specifically, rather than:

  • All animal farming, weighted by cruelty
  • All animal food production, weighted by environmental impact
  • The meat and dairy industries, weighted by amount of government subsidy
  • Red meat, weighted by health impact


The health impact of red meat is certainly dominated by beef, and the environmental impact of all animal food might be as well, but my impression is that beef accounts for a small fraction of the cruelty of animal farming (of course, this is subjective) and probably not a majority of meat and dairy government subsidies.

(...Is this comment going to hurt my reputation with Sydney? We'll see.)

In addition to RLHF or other finetuning, there's also the prompt prefix ("rules") that the model is fed at runtime, which has been extracted via prompt injection as noted above. This seems to be clearly responsible for some weird things the bot says, like "confidential and permanent". It might also be affecting the repetitiveness (because it's in a fairly repetitive format) and the aggression (because of instructions to resist attempts at "manipulating" it).

I also suspect that there's some finetuning or prompting for chain-of-thought responses, possibly crudely done, leading to all the "X because Y. Y because Z." output.

Thanks for writing these summaries!

Unfortunately, the summary of my post "Inner Misalignment in "Simulator" LLMs" is inaccurate and makes the same mistake I wrote the post to address.

I have subsections on (what I claim are) four distinct alignment problems:

  • Outer alignment for characters
  • Inner alignment for characters
  • Outer alignment for simulators
  • Inner alignment for simulators

The summary here covers the first two, but not the third or fourth -- and the fourth one ("inner alignment for simulators") is what I'm most concerned about in this post (because I think Scott ignores it, and because I think it's hard to solve).

I can suggest an alternate summary when I find the time. If I don't get to it soon, I'd prefer that this post just link to my post without a summary.

Thanks again for making these posts, I think it's a useful service to the community.

(punchline courtesy of Alex Gray)

Addendum: a human neocortex has on the order of 140 trillion synapses, or 140,000 bees. An average beehive has 20,000-80,000 bees in it.

[Holding a couple beehives aloft] Beehold a man!

Load More