Reflection Mechanisms as an Alignment target: A survey
This is a product of the 2022 AI Safety Camp. The project has been done by Marius Hobbhahn and Eric Landgrebe under the supervision of Beth Barnes. We would like to thank Jacy Reese Anthis and Tyna Eloundou for detailed feedback. You can find the google doc for this post here. Posts to other sections of the text automatically link to the google doc. Feel free to add comments there. Abstract We surveyed ~1000 US-based Mechanical Turk workers (selected and quality tested by Positly) on their attitudes to moral questions, conditions under which they would change their moral beliefs, and approval towards different mechanisms for society to resolve moral disagreements. Unsurprisingly, our sample disagreed strongly on questions such as whether abortion is immoral. In addition, a substantial fraction of people reported that these beliefs wouldn’t change even if they came to different beliefs about factors we view as morally relevant such as whether the fetus was conscious in the case of abortion. However, people were generally favorable to the idea of society deciding policies by some means of reflection - such as democracy, a debate between well-intentioned experts, or thinking for a long time. In a hypothetical idealized setting for reflection (a future society where people were more educated, informed, well-intentioned e.t.c.), people were favorable to using the results of the top reflection mechanisms to decide policy. This held even when respondents were asked to assume that the results came to the opposite conclusion as them on strongly-held moral beliefs such as views on abortion. This suggests that ordinary Americans may be willing to defer to an idealized reflection mechanism, even when they have strong object-level moral disagreements. This indicates that people would likely support aligning AIs to the results of some reflection mechanism, rather than people’s current moral beliefs. Introduction Optimistically, a solution to the technical alignment