Surprised no one posted about this from Anthropic, NYU and Uni of Sussex yet:
This conditioning is very reminiscent of the decision transformer, where scalar reward tokens are prepended to the input. I believe CICERO also does something similar, conditioning on ELO scores during dialogue generation training.
From a discussion with James Chua on AISS's slack, we noted similarities between this work and Charlie Steiner's Take 13: RLHF bad, conditioning good. James is developing a library ("conditionme") specifically for rating-conditioned language modelling and was looking for some feedback, which prompted the discussion. We figured potential future work here is extending the conditioning to scalar rewards (rather than the discrete <good> vs <bad>), which James pointed out requires some caution with the tokenizer, which he hopes to address in part with conditionme.