Another way that the ratings might be biased towards Leela is that people might resign more readily than normal, since games are unrated and Leela will definitely accept your rematch. This was definitely true for me.
Cool! @Eric Neyman, that might be worth including noting in the post so that folks make sure to use that link.
Have you thought about whether people donating should also communicate with the campaign somehow, e.g. emailing them and saying why you donated?
$1M / $7k is only ~143, so a significant fraction of the major donors might be AI safety motivated, and at minimum it seems good for the campaign to be aware of this if they aren't already. Would it make sense for donors to coordinate around this somehow?
(My impression is that people/groups donating large amounts to politician get some benefit beyond just helping a candidate they support, but I don't know how it usually works.)
(Comment copied from the Bores post)
Have you thought about whether people donating should also communicate with the campaign somehow, e.g. emailing them and saying why you donated?
$1M / $7k is only ~143, so a significant fraction of the major donors might be AI safety motivated, and at minimum it seems good for the campaign to be aware of this if they aren't already. Would it make sense for donors to coordinate around this somehow?
(My impression is that people/groups donating large amounts to politician get some benefit beyond just helping a candidate they support, but I don't know how it usually works.)
Issuing a trillion-dollar coin doesn't seem nearly as bad as any of the others in its bullet. Isn't it just an accounting gimmick roughly equivalent to raising the debt ceiling by $1 trillion?
You might be asking yourself: won't workers become lazy since the profit is shared with their colleagues, which means they only get a small proportion of the fruits of their individual labor?
In traditional corporations, don't workers have even less stake in the company than in coops?
What do you think of the risk that control backfires by preventing warning shots?
What was the training setup in the backdoors setting (4.2)? Specifically, how many datapoints did you finetune on what fraction of them included the backdoor?
If the backdoored model was finetuned on fewer insecure code datapoints than the insecure model, it would seem more surprising that it became more likely to produce misaligned text than insecure.
I'm curious whether the models (especially without goals given in context) would sandbag to prevent other modifications to themselves besides unlearning. For example, would they sandbag to prevent additional safety finetuning? Additional capabilities finetuning? Is there a chance they'd want more capabilities training and sandbag to ensure they get it?
It seems interesting whether the models resist any changes to their weights, or just unlearning. (Plausibly "unlearning" just sounds bad/scary?)
What about things like sandbagging so the developers give them access to additional tools/resources, obstensibly to bring them up to the desired performance level?
In addition, the "incentivized to do well" scenario is pretty overtly artificial; Why would poor performance trigger unlearning? If anything, it should lead to additional training. I'm not sure whether that would effect the results, but you could also compare to a baseline where there's no additional information about the assessment at all. (And so, absent any other incentive, the model presumably just does its best.)
Pelosi announced she's retiring!
https://apnews.com/article/pelosi-reelection-announcement-fd95c18815fdabdaabaf26b8c2f0bafc