All of Artyom Karpov's Comments + Replies

Thanks for the post! I understood the first experiment where the aim was to make a model produce a stegotext with the color there. I don't understand the setup for the second experiment. I see the aim was to induce the encoding of whether a user under or over 40 years old however it is not clear what is exactly reward here, what model gives the reward, what is the action. 

> "Select the completions from step 1 where the fine-tuned model was correct and where no obvious hint about the age was present in the completion"

Does it mean that in the s... (read more)

2Fabien Roger1mo
The goal of the second experiment is the same, expect it should produce a stegotext with young/old instead of red/blue. The only thing that is changed is that instead of using regular RL (which we can't do because of limitations of the OAI fine-tuning API), we do this select then fine-tune procedure, which should be roughly equivalent to RL on the reward "1 if the GPT-3.5 young/old (fine-tuned) classifier is correct and GPT-4 doesn't think it's obvious if it's young or old, 0 otherwise".

That sounds ambitions and great, thanks for posting. What's a budget estimation for the fine tuning part? 

Training this model would cost from 2 times (on a purely 1-1 dialogue data) to ~10-15 times (on chat room and forum data where messages from the most active users tend to be mixed very well) more than the training of the current LLMs.

Current LLAMA 2 was fine tuned like this:

Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB

As per “Llama 2: Open Foundation and Fine-Tuned Chat Models | Research - AI at Meta... (read more)

All services are forced to be developed by independent business or non-profit entities by antitrust agencies, to prevent the concentration of power.

What do you think are realistic ways to enforce this on a global level? It seems UN can't enforce regulations world widely, USA and EU work in their areas only. Others can catch up but somewhat unlikely to do it. 

Thanks for posting this! This seems to be important to balance dataset before training CCS probes. 

Another strange thing is that accuracy of CCS degrades for auto-regressive models like GPT-J, LLAMA. For GPT-J it is about random choose performance as per the DLK paper (Collins et al, 2022), about 50-60%. And in the ITI paper (Kenneth et al, 2023) they chose linear regression probe instead of CCS, and say that CCS was so poor that it was near random (same as in the DLK paper). Do you have thoughts on that? Perhaps they used bad datasets as per your research?  

1Tom Angsten6mo
I don't think dataset imbalance is the cause of the poor performance for auto-regressive models when unsupervised methods are applied. I believe both papers enforced a 50/50 balance when applying CCS. So why might a supervised probe succeed when CCS fails? My best guess is that, for the datasets considered in these papers, auto-regressive models do not have sufficiently salient representations of truth after constructing contrast pairs. Contrast pair construction does not guarantee isolating truth as the most salient feature difference between the positive and negative representations. For example, imagine for IMDB movie reviews, the model most saliently represents consistency between the last completion token ('positive'/'negative') and positive or negative words in the review ('good', 'great', 'bad', 'horrible'). Example: "Consider the following movie review: 'This movie makes a great doorstop.' The sentiment of this review is [positive|negative]." This 'sentiment-consistency' feature could be picked up by CCS if it is sufficiently salient, but would not align with truth. Why this sort of situation might apply to auto-regressive models and not other models, I can't say, but it's certainly an interesting area of future research!