[Paper] Stress-testing capability elicitation with password-locked models — LessWrong