William_Kasper's Shortform

William_Kasper

This is a special post for quick takes by William_Kasper. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

What helped me understand the case against training for Chain of Thought traces

Wald’s bomber problem;

You can only inspect planes that came back, so the bullet holes mark places a plane could be hit and survive. Armor the gaps.

RL-against-CoT is the same shape with the casualties pre-deleted. When gradient pressure is applied to a chain of thought directly (Baker et al.) or via output penalties that spill into the trace (Drori et al.) the surviving CoTs are the ones that produced good-looking outputs without tripping the filter. Traces that would have surfaced the model’s actual computation, but got penalized for it, are gradient-descended out of the policy.

In general, alignment proposals that depend on the AI not being smart enough or not applying enough optimization power won't scale.

Hence modeling ASI as AIXI avoids a class of errors.

Not quite; AIXI optimizes for a value in its input stream, while ASIs in general can optimize for a much wider range of utility functions. (e.g. diamonds)

That has nothing to do with either your original comment or what I said.