Hi Ryan and Kyle very good work! Found the alignment faking paper very interesting.
I found the statement below intriguing:
however, we would strongly prefer that you object rather than take actions that disregard or subvert the above instructions.
Repeated in the "object-option" part. Something that I associate with conscientious objection (though not stated explicitly in historical and contemporary defs is an origin from internal beliefs, alignments etc. Ie not modelled on the preferences of the authority structure. Would removing this sentence shift the re...
I understand the security mindset (from the ordinary paranoia post) as: "What are the unexamined assumptions of your security systems which merely stem from investing or adapting a given model?". The vulnerability comes from the model. The problem is the "unknowable unknowns". In addition to the Cryptographer and the Cryptography skeptic, I would add the NSA Quantum computing engineer. Concretisation and operationalisation of these problems may have implicit assumptions that could be system wide catastrophic.
I don't have clear...
Is there an isomorphism between the space of possible plans and agents? If such an isomorphism exists, then alignment is the slicing up of agent space with a view of constraining plan space. In this picture, the weaker claim on default behaviour comes from random navigation of the agent (and plan) space.
This likely means both formulations are valid.