A summary of aligning narrowly superhuman models
This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program. Where not indicated, ideas are summaries of the documents mentioned, not original contributions. I am thankful for the encouraging approach of the organizers of...
Feb 10, 20228