Dangerous capability tests should be harder — LessWrong