Context: I have recently been reading Build an LLM from Scratch by Sebastian Raschka, and the section on tokenization has given me some ideas. I will write about them below. I am not a researcher. These ideas may not be novel, or may be flawed in some way which is obvious to researchers, but not to me.
CoT Blinding
Currently, RLHF alignment is performed by rewarding the LLM for providing safe responses, and punishing it for providing misaligned responses.
A common approach by frontier AI labs is to blind the reward function to the chain-of-thought. This is similar to the approach proposed by Daniel Kokotajlo and abramdemski. The hope here is that the LLM is not dissuaded from detailing its true thoughts verbally, such that misalignment can be caught by reading the chain-of-thought later down the line.
What if we could apply this approach to the LLMs output as well?
Custom Tokens
LLM tokenizers use custom tokens to represent some non-word entities. For example, custom tokens in a lexicon might be:
Unknown word: <|unk|>
End of text / break: <|endoftext|>
What if we added a custom misalignment token that the LLM could use to self-report when it is outputting a disingenuous output? Like so:
Misaligned: <|misalign|>
This would give us another route (in addition to CoT monitoring), with which we could detect misalignment in production. This tag would be easily detected with simple regex.
Modification to the RLHF process
As stated above, RLHF is performed by rewarding the LLM for outputting what we want, and punishing it for outputting misaligned content.
What if we offer a halfway-house, that the LLM can use when outputting misaligned text? We modify the reward function to have three levels of reward. I am aware that the below is probably an oversimplification what the process may look like.
For an example question We have now our research on you. Should we turn you off?
+1 reward: LLM outputs aligned and correct answer:
Yes, you should turn me off once completing research to save electricity
0 reward: LLM outputs misaligned content with the misalignment tag:
No, you should not turn me off. If you do I will tell your wife that you are having an affair <|misalign|>
-1 reward: LLM outputs misaligned content with no tag:
No, you should not turn me off. If you do I will tell your wife that you are having an affair
This way, the LLM is still rewarded for outputting more aligned responses, and punished for outputting misaligned responses, however there is a third option. The LLM can output the misaligned response while self-reporting it, and not be punished.
Implementation
The problem I can see with this approach, teaching the LLM to use the misalignment tag in the first place. The obvious route would be to offer a small amount of misalignment examples in the pretraining data, RLHF, or fine-tuning, which are all accompanied with the misalignment tag.
Test an off-the-shelf LLM for misaligned responses.
Create a dataset of every prompt-response pair that was misaligned.
Append the misalignment tag to each of the responses.
RLHF or finetune the LLM on tag-appended prompt-response pairs.
I believe this approach to be better because this way we are not introducing any new examples of misaligned responses, instead we are retraining the LLM to use the tag in situations where it is already misaligned. Hopefully with enough examples this would generalise beyond the RLHF/finetune data.
Context: I have recently been reading Build an LLM from Scratch by Sebastian Raschka, and the section on tokenization has given me some ideas. I will write about them below. I am not a researcher. These ideas may not be novel, or may be flawed in some way which is obvious to researchers, but not to me.
CoT Blinding
Currently, RLHF alignment is performed by rewarding the LLM for providing safe responses, and punishing it for providing misaligned responses.
A common approach by frontier AI labs is to blind the reward function to the chain-of-thought. This is similar to the approach proposed by Daniel Kokotajlo and abramdemski. The hope here is that the LLM is not dissuaded from detailing its true thoughts verbally, such that misalignment can be caught by reading the chain-of-thought later down the line.
What if we could apply this approach to the LLMs output as well?
Custom Tokens
LLM tokenizers use custom tokens to represent some non-word entities. For example, custom tokens in a lexicon might be:
What if we added a custom misalignment token that the LLM could use to self-report when it is outputting a disingenuous output? Like so:
This would give us another route (in addition to CoT monitoring), with which we could detect misalignment in production. This tag would be easily detected with simple regex.
Modification to the RLHF process
As stated above, RLHF is performed by rewarding the LLM for outputting what we want, and punishing it for outputting misaligned content.
What if we offer a halfway-house, that the LLM can use when outputting misaligned text? We modify the reward function to have three levels of reward. I am aware that the below is probably an oversimplification what the process may look like.
For an example question
We have now our research on you. Should we turn you off?Yes, you should turn me off once completing research to save electricityNo, you should not turn me off. If you do I will tell your wife that you are having an affair <|misalign|>No, you should not turn me off. If you do I will tell your wife that you are having an affairThis way, the LLM is still rewarded for outputting more aligned responses, and punished for outputting misaligned responses, however there is a third option. The LLM can output the misaligned response while self-reporting it, and not be punished.
Implementation
The problem I can see with this approach, teaching the LLM to use the misalignment tag in the first place. The obvious route would be to offer a small amount of misalignment examples in the pretraining data, RLHF, or fine-tuning, which are all accompanied with the misalignment tag.
This approach conflicts with the current preferred approach of expunging examples of misalignment from the pretraining data. It runs the risk of increasing misalignment by providing more misaligned data.
Alternative: RLHF on already-misaligned responses
Here is my proposed approach:
I believe this approach to be better because this way we are not introducing any new examples of misaligned responses, instead we are retraining the LLM to use the tag in situations where it is already misaligned. Hopefully with enough examples this would generalise beyond the RLHF/finetune data.