An alternative of PPO towards alignment — LessWrong