Interesting research! I've been trying to reproduce some of the results locally, currently the MMLU question answering+hints setup, but I'm confused on what was done for the judge. There's an included judge SFT dataset and script to build it, but the paper and code don't seem to involve a finetuned judge.
Also not able to receive rollouts or logs in wandb for training, with a single gpu and multi_gpu=none. Sorry if I'm missing something This was due to having less than required vram, more than 32gb is needed for mmlu-easy-hints without mindface
This comment on its own had some discussion about how people were voting on it