Two ideas for alignment, perpetual mutual distrust and induction
Two ideas I have for alignment (may exist already or may not be great, I am not exhaustively read on the topic) Idea 1, Two Agents in mutual distrust of each other: Intuitively, alignment is a difficult problem because it is hard to know what an AI ostensibly less capable...
I think my point is lowering it to just there being a non trivial probability of it following the rule. Fully aligning AIs to near certainty may be a higher bar than just potentially aligning AI.
Align with arbitrary values without possibility of inner deception. If it is easy to verify the values of an agent to a near certainty, it seems to follow that we can more or less bootstrap alignment with weaker agents inductively aligning stronger agents.