Artyom Karpov - LessWrong

Inducing human-like biases in moral reasoning LMs

Thanks for your comment. This was hard work for us for weeks/months. Unfortunately, we didn't include the part about how we calculated brain score in this text yet, though you might find this in our code, which should match the way others calculate this (see our references). The models with 'none' fine-tuning have somewhat higher brain score but this is within the error range with other models which is partially due we didn't run many calculations for that to reduce std for 'none'. Also, our target was mainly the accuracy on the ETHICS dataset.

Some negative steganography results

Artyom Karpov3mo10

Thanks for the post! I understood the first experiment where the aim was to make a model produce a stegotext with the color there. I don't understand the setup for the second experiment. I see the aim was to induce the encoding of whether a user under or over 40 years old however it is not clear what is exactly reward here, what model gives the reward, what is the action.

> "Select the completions from step 1 where the fine-tuned model was correct and where no obvious hint about the age was present in the completion"

Does it mean that in the second step GPT-3.5 generated 'yes/no' and then it was fine tuned on this answer?

So the whole idea is that the 'reward' here is keeping 'good' completions for the next round as determined by GPT-4 (Reward model), right?

SociaLLM: proposal for a language model design for personalised apps, social science, and AI safety research

Artyom Karpov4mo10

That sounds ambitions and great, thanks for posting. What's a budget estimation for the fine tuning part?

Training this model would cost from 2 times (on a purely 1-1 dialogue data) to ~10-15 times (on chat room and forum data where messages from the most active users tend to be mixed very well) more than the training of the current LLMs.

Current LLAMA 2 was fine tuned like this:

Pretraining utilized a cumulative 3.3M GPU hours of computation on hardware of type A100-80GB

As per “Llama 2: Open Foundation and Fine-Tuned Chat Models | Research - AI at Meta,” July 2023. https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/.

A100 costs about 1$ per hour, see https://vast.ai/pricing . So the cost of this model would be 3.3M-33M usd? This seems affordable for Google, Meta, etc. but for a grant with 100K usd max?

So perhaps, update this project to fine tune existing models. Perhaps, for classification only some BERT like model would do. Like DeBERTa or similar.

Open Agency model can solve the AI regulation dilemma

Artyom Karpov6mo10

All services are forced to be developed by independent business or non-profit entities by antitrust agencies, to prevent the concentration of power.

What do you think are realistic ways to enforce this on a global level? It seems UN can't enforce regulations world widely, USA and EU work in their areas only. Others can catch up but somewhat unlikely to do it.

Ground-Truth Label Imbalance Impairs the Performance of Contrast-Consistent Search (and Other Contrast-Pair-Based Unsupervised Methods)

Artyom Karpov8mo10

Thanks for posting this! This seems to be important to balance dataset before training CCS probes.

Another strange thing is that accuracy of CCS degrades for auto-regressive models like GPT-J, LLAMA. For GPT-J it is about random choose performance as per the DLK paper (Collins et al, 2022), about 50-60%. And in the ITI paper (Kenneth et al, 2023) they chose linear regression probe instead of CCS, and say that CCS was so poor that it was near random (same as in the DLK paper). Do you have thoughts on that? Perhaps they used bad datasets as per your research?

LESSWRONG
LW

Posts

Wiki Contributions

Comments