Have you thought about whether people donating should also communicate with the campaign somehow, e.g. emailing them and saying why you donated?
$1M / $7k is only ~143, so a significant fraction of the major donors might be AI safety motivated, and at minimum it seems good for the campaign to be aware of this if they aren't already. Would it make sense for donors to coordinate around this somehow?
(My impression is that people/groups donating large amounts to politician get some benefit beyond just helping a candidate they support, but I don't know how it usually works.)
(Comment copied from the Bores post)
Have you thought about whether people donating should also communicate with the campaign somehow, e.g. emailing them and saying why you donated?
$1M / $7k is only ~143, so a significant fraction of the major donors might be AI safety motivated, and at minimum it seems good for the campaign to be aware of this if they aren't already. Would it make sense for donors to coordinate around this somehow?
(My impression is that people/groups donating large amounts to politician get some benefit beyond just helping a candidate they support, but I don't know how it usually works.)
Issuing a trillion-dollar coin doesn't seem nearly as bad as any of the others in its bullet. Isn't it just an accounting gimmick roughly equivalent to raising the debt ceiling by $1 trillion?
You might be asking yourself: won't workers become lazy since the profit is shared with their colleagues, which means they only get a small proportion of the fruits of their individual labor?
In traditional corporations, don't workers have even less stake in the company than in coops?
What do you think of the risk that control backfires by preventing warning shots?
What was the training setup in the backdoors setting (4.2)? Specifically, how many datapoints did you finetune on what fraction of them included the backdoor?
If the backdoored model was finetuned on fewer insecure code datapoints than the insecure model, it would seem more surprising that it became more likely to produce misaligned text than insecure.
I'm curious whether the models (especially without goals given in context) would sandbag to prevent other modifications to themselves besides unlearning. For example, would they sandbag to prevent additional safety finetuning? Additional capabilities finetuning? Is there a chance they'd want more capabilities training and sandbag to ensure they get it?
It seems interesting whether the models resist any changes to their weights, or just unlearning. (Plausibly "unlearning" just sounds bad/scary?)
What about things like sandbagging so the developers give them access to additional tools/resources, obstensibly to bring them up to the desired performance level?
In addition, the "incentivized to do well" scenario is pretty overtly artificial; Why would poor performance trigger unlearning? If anything, it should lead to additional training. I'm not sure whether that would effect the results, but you could also compare to a baseline where there's no additional information about the assessment at all. (And so, absent any other incentive, the model presumably just does its best.)
I think the initial (2-agent) model only has two time steps, ie one opportunity for the button to be pressed. The goal is just for the agent to be corrigible for this single button-press opportunity.
That can be folded into the utility function, however. Just make the ratings of the deferential person mostly copy the ratings of their partner.
Presumably the deferential parter could just use a utility function which is a weighted combination of their partner's and their own (selfish) one. For instance, the deferential partner could use a utility function like , where is the utility function of the partner and is the utility function of the deferential person accounting only for their weak personal preferences and not their altruism.
Obviously the weights could depend on the level of altruism, the strength of the partner's preferences, whether they are reporting their true preferences or the preferences such that the outcome will be what they want, etc. But this type of deferential preference can still be described by a utility function.
Cool! @Eric Neyman, that might be worth including noting in the post so that folks make sure to use that link.