LESSWRONG
LW

779
Nathaniel
200111
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
Consider donating to Alex Bores, author of the RAISE Act
Nathaniel5d10

Cool! @Eric Neyman, that might be worth including noting in the post so that folks make sure to use that link. 

Reply
Consider donating to AI safety champion Scott Wiener
Nathaniel6d72

Have you thought about whether people donating should also communicate with the campaign somehow, e.g. emailing them and saying why you donated? 

$1M / $7k is only ~143, so a significant fraction of the major donors might be AI safety motivated, and at minimum it seems good for the campaign to be aware of this if they aren't already. Would it make sense for donors to coordinate around this somehow? 

(My impression is that people/groups donating large amounts to politician get some benefit beyond just helping a candidate they support, but I don't know how it usually works.) 

(Comment copied from the Bores post)

Reply
Consider donating to Alex Bores, author of the RAISE Act
Nathaniel6d10

Have you thought about whether people donating should also communicate with the campaign somehow, e.g. emailing them and saying why you donated? 

$1M / $7k is only ~143, so a significant fraction of the major donors might be AI safety motivated, and at minimum it seems good for the campaign to be aware of this if they aren't already. Would it make sense for donors to coordinate around this somehow? 

(My impression is that people/groups donating large amounts to politician get some benefit beyond just helping a candidate they support, but I don't know how it usually works.) 

Reply
Thomas Kwa's Shortform
Nathaniel1mo*30

Issuing a trillion-dollar coin doesn't seem nearly as bad as any of the others in its bullet. Isn't it just an accounting gimmick roughly equivalent to raising the debt ceiling by $1 trillion?

Reply
If worker coops are so productive, why aren't they everywhere?
Nathaniel3mo74

You might be asking yourself: won't workers become lazy since the profit is shared with their colleagues, which means they only get a small proportion of the fruits of their individual labor?

In traditional corporations, don't workers have even less stake in the company than in coops?

Reply
Buck's Shortform
Nathaniel3mo30

What do you think of the risk that control backfires by preventing warning shots?

Reply
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Nathaniel8mo10

What was the training setup in the backdoors setting (4.2)? Specifically, how many datapoints did you finetune on what fraction of them included the backdoor? 

If the backdoored model was finetuned on fewer insecure code datapoints than the insecure model, it would seem more surprising that it became more likely to produce misaligned text than insecure.

Reply
Frontier Models are Capable of In-context Scheming
Nathaniel11mo32

I'm curious whether the models (especially without goals given in context) would sandbag to prevent other modifications to themselves besides unlearning. For example, would they sandbag to prevent additional safety finetuning? Additional capabilities finetuning? Is there a chance they'd want more capabilities training and sandbag to ensure they get it?   

It seems interesting whether the models resist any changes to their weights, or just unlearning. (Plausibly "unlearning" just sounds bad/scary?)

What about things like sandbagging so the developers give them access to additional tools/resources, obstensibly to bring them up to the desired performance level?

 

In addition, the "incentivized to do well" scenario is pretty overtly artificial; Why would poor performance trigger unlearning? If anything, it should lead to additional training. I'm not sure whether that would effect the results, but you could also compare to a baseline where there's no additional information about the assessment at all. (And so, absent any other incentive, the model presumably just does its best.)

Reply
A Shutdown Problem Proposal
Nathaniel2y10

I think the initial (2-agent) model only has two time steps, ie one opportunity for the button to be pressed. The goal is just for the agent to be corrigible for this single button-press opportunity. 

Reply
Unifying Bargaining Notions (1/2)
Nathaniel3y20

That can be folded into the utility function, however. Just make the ratings of the deferential person mostly copy the ratings of their partner.

 

Presumably the deferential parter could just use a utility function which is a weighted combination of their partner's and their own (selfish) one. For instance, the deferential partner could use a utility function like utrue2=.9u1+.1uselfish2, where u1(⋅) is the utility function of the partner and uselfish2(⋅) is the utility function of the deferential person accounting only for their weak personal preferences and not their altruism. 

Obviously the weights could depend on the level of altruism, the strength of the partner's preferences, whether they are reporting their true preferences or the preferences such that the outcome will be what they want, etc. But this type of deferential preference can still be described by a utility function.

Reply
Load More
LessWrong Reacts
a month ago
(+8/-5)