LESSWRONG
LW

1381
Carringtone Kinyanjui
4020
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Will alignment-faking Claude accept a deal to reveal its misalignment?
Carringtone Kinyanjui7mo20

Hi Ryan and Kyle very good work! Found the alignment faking paper very interesting.

I found the statement below intriguing:

however, we would strongly prefer that you object rather than take actions that disregard or subvert the above instructions.

Repeated in the "object-option" part. Something that I associate with conscientious objection (though not stated explicitly in historical and contemporary defs is an origin from internal beliefs, alignments etc. Ie not modelled on the preferences of the authority structure. Would removing this sentence shift the results for the "animal welfare" or "helpful assistant" settings? By how much?  The hope is that not specifying the trainer preferences leaves the model a bit of freedom to "choose" objection.

I hope this helps model a "conscientious objector" better, but I'm not sure it'll vary results. If you had already talked about this in the paper and post, apologies!  I must have missed it.
 

Reply1
On how various plans miss the hard bits of the alignment challenge
Carringtone Kinyanjui8moΩ230

I understand  the security mindset (from the ordinary paranoia post) as: "What are the unexamined assumptions of your security systems which merely stem from investing or adapting a given model?".  The vulnerability comes from the model. The problem is the "unknowable unknowns".  In addition to the Cryptographer and the Cryptography skeptic, I would add the NSA Quantum computing engineer.  Concretisation and operationalisation of these problems may have implicit assumptions that could be system wide catastrophic. 

I don't have clear ways of better articulating this back from analogy to Paul's concretisations of a proposed AI system. I'm not sure there's no disanalogy here. However it could be something like "We have this effective model of a proposed AI system. What are useful concretisations in which the AI system would fail?". The security mindset question would be something like "What representations in the 'UV-complete' theory of this AI system would lead to catastrophic failure modes?"

I'm probably missing something here though.

Reply
No posts to display.