TL;DR If you’re presenting a classifier that detects misalignment and providing metrics for it, please: 1. report the TPR at FPR=0.001, 0.01, and 0.05 2. plot the ROC curve on a log-log scale See https://arxiv.org/abs/2112.03570 for more context on why you might want to do this. ML Background (If all...
I have an intuition, and I may be heterodox here, that LLMs on their own are not sufficient, no matter how powerful and knowledgeable they get. Put differently, the reasons that powerful LLMs are profoundly unsafe are primarily social: e.g. they will be hooked up to the internet to make...
I posted in the open thread and was told that it would be worth promoting to top level. cubefox responded with a link to an great explanation of how the fine-tuning is done, which made me realize that my original question was unclear, so I'm going to try to clarify....
There isn't a lot of talk about image models (e.g. Dall-E and StableDiffusion) on LW in the context of alignment, especially compared to LLMs. Why is that? Some hypotheses: * LLMs just happened to get some traction early, and due to network effects, they are the primary research vehicle *...