x

LESSWRONG

LW

Kibidango — LessWrong

Kibidango

Kibidango

Message

2

Ω

2

1

8y

Kibidango

2

Ω

2

8y

Iterated Distillation and Amplification

Kibidango8yΩ230

AGZ’s policy network p is the learned model.

I found this bit slightly confusing. As far as I understand from the AGZ Nature paper, AGZ does not have a separate policy network p, but uses a single network $f_{θ}$ which outputs both the learned policy p and the estimated probability v that the current player will win the game. Is this what the sentence is referring to?