The strategies are manually reviewed with clear prompt injection attempts rejected.
I think the program approach proved unworkable. It is simply too difficult to write a program that can analyze another program effectively when the program being analyzed has to be complex enough to do its own analysis of other programs.
I've actually never heard of diffusion for planning. Do you have a reference?
A diffusion model for text generation (like Diffusion-LM) still has the training objective to produce text from the training distribution, optimizing over only the current episode - in this case a short text.
Well-argued throughout, but I want to focus on the first sentence:
Can advocates of the more pessimistic safety view find common ground on this point?
I often see statements like “We have no idea how to align AI,” sometimes accompanied by examples of alignment failures. But these seem to boil down either to the claim that LLMs are not perfectly aligned, or else they appear contradicted by the day-to-day experience of actually using them.
I also wish pessimists would more directly engage with a key idea underlying the sections on “Misaligned personas” and “Misalignment from long-horizon RL.” Specifically:
I would find it helpful to understand the mechanism that pessimists envision here.