Question for the alignment crowd:
The new Absolute Zero Reasoner paper proposes a "self‑play RL with zero external data" paradigm, where the same model both invents tasks and learns to solve them. It reaches SOTA coding + math scores without ever touching a human‑curated dataset. But the authors also flag a few "uh‑oh moments": e.g. their 8‑B Llama variant generated a chain‑of‑thought urging it to "outsmart … intelligent machines and less intelligent humans," and they explicitly list "lingering safety concerns" as an open problem, noting that the system "still necessitates oversight."
My question: How should alignment researchers think about a learner that autonomously expands its own task distribution? Traditional RLHF/RLAIF assumes a fixed environment and... (read more)