Seems cool to me. I don't totally understand what's going on with the "embedding" of the score, but presumably this way works well for DTs.
For DTs its really just a linear function to convert the scalar reward into the same dimmensions the token embeddings.
So e.g. a single token's embedding has a hidden state of size 1024 .
We can learn a linear function that takes this scalar and outputs something of size 1024.
The more annoying (PITA) part was offset the positional/attention masks/labels for this.
There is a couple of discussions regarding conditioning / decision transformer training. "Conditioning" as in, placing your reward model's reward in the prompt. We then train our language model to create a completion that follows the specified reward.
See Safety considerations for online generative modeling, Soft optimization makes the value target bigger, RLHF bad, conditioning good.
The tldr is that training models this way could have safety benefits.
I've created a library so that we can extend a pre-trained LLM (gpt2, gpt-j) to work with conditioning by scalar rewards. This allows researchers to save time by avoiding the need to modify attention masks, positions, and labels themselves. For example, researchers can retrain GPT-2 to replicate OpenAI's summarization RLHF paper, but by relying purely on conditioning. I created it because I couldn't find an existing library that did so.
Note: an easier way of conditioning would be to use discrete tokens. Pretraining models With Human Preferences implements conditioning via discrete <|good|> and <|bad|> tokens.
However using scalar rather than discrete rewards could have the following benefits
While the library is still in early development, it can already be used for offline training of GPT-2.
I'm writing this post earlier rather than later to: