What helped me understand the case against training for Chain of Thought traces
Wald’s bomber problem;
You can only inspect planes that came back, so the bullet holes mark places a plane could be hit and survive. Armor the gaps.
RL-against-CoT is the same shape with the casualties pre-deleted. When gradient pressure is applied to a chain of thought directly (Baker et al.) or via output penalties that spill into the trace (Drori et al.) the surviving CoTs are the ones that produced good-looking outputs without tripping the filter. Traces that would have surfaced the model’s actual computation, but got penalized for it, are gradient-descended out of the policy.
In general, alignment proposals that depend on the AI not being smart enough or not applying enough optimization power won't scale.
Not quite; AIXI optimizes for a value in its input stream, while ASIs in general can optimize for a much wider range of utility functions. (e.g. diamonds)