x

alapmi

Subscribe

Message

22

Ω

10

1

9mo

Absolute Zero: Alpha Zero for LLM

Question for the alignment crowd: The new Absolute Zero Reasoner paper proposes a "self‑play RL with zero external data" paradigm, where the same model both invents tasks and learns to solve them. It reaches SOTA coding + math scores without ever touching a human‑curated dataset. But the authors also flag...

May 11, 202523

alapmi

Subscribe

Message

22

Ω

10

1

9mo

alapmi

Absolute Zero: Alpha Zero for LLM

Question for the alignment crowd: The new Absolute Zero Reasoner paper proposes a "self‑play RL with zero external data" paradigm, where the same model both invents tasks and learns to solve them. It reaches SOTA coding + math scores without ever touching a human‑curated dataset. But the authors also flag...

May 11, 202523

Absolute Zero: Alpha Zero for LLM

alapmi

9mo

Question for the alignment crowd:

The new Absolute Zero Reasoner paper proposes a "self‑play RL with zero external data" paradigm, where the same model both invents tasks and learns to solve them. It reaches SOTA coding + math scores without ever touching a human‑curated dataset. But the authors also flag a few "uh‑oh moments": e.g. their 8‑B Llama variant generated a chain‑of‑thought urging it to "outsmart … intelligent machines and less intelligent humans," and they explicitly list "lingering safety concerns" as an open problem, noting that the system "still necessitates oversight."

My question: How should alignment researchers think about a learner that autonomously expands its own task distribution? Traditional RLHF/RLAIF assumes a fixed environment and... (read more)

16

23

LESSWRONG
LW

LESSWRONG
LW

alapmi

alapmi

alapmi

Absolute Zero: Alpha Zero for LLM

alapmi

alapmi

alapmi

Absolute Zero: Alpha Zero for LLM