RFC: Philosophical Conservatism in AI Alignment Research


27


gworley

I've been operating under the influence of an idea I call philosophical conservatism when thinking about AI alignment. I am in the process of summarizing some of the specific stances I take and why I take them because I believe others would better serve the project of alignment research by doing the same, but in the meantime I'd like to request comments on the general line of thinking to see what others think. I've formatted the outline of the general idea and reasons for it with numbers so you can easily comment on each statement independently.

  1. AI alignment is a problem with bimodal outcomes, i.e. most of the probability distribution is clustered around success and failure with very little area under the curve between these outcomes.
  2. Thus, all else equal, we would rather be extra cautious and miss some paths to success than be insufficiently cautious and hit a path to failure.
  3. One response to this is what Yudkowsky calls security mindset by alluding to Schneier's concept of the same name.
  4. Another is what I call philosophical conservatism. The ideas are related and address related concerns but in different ways.
  5. Philosophical conservatism says you should make the fewest philosophical assumptions necessary to addressing AI alignment and that each assumption should be maximally parsimonious and be the assumption that is least convenient for addressing alignment if it were true when there is nontrivial uncertainty over whether a similar, more convenient assumption holds.
  6. This is a strategy that reduces the chance of false positives in alignment research but makes the problem possibly harder, more costly, and less competitive to solve.
  7. For example, we should assume there is no discoverably correct ethics or metaethics the AI can learn since, although it would make the problem easier if this were true, there is nontrivial uncertainty around this and so the assumption which makes it less likely that alignment projects fail is to assume that ethics and metaethics are not solvable.
  8. Current alignment research programs do not seem to operate with philosophical conservatism because they either leave philosophical issues relevant to alignment unaddressed, make unclear implicit philosophical assumptions, or admit being hopeful that helpful assumptions will prove true and ease the work.
  9. The alignment project is better served by those working on it using philosophical conservatism because it reduces the risks of false positives and spending time on research directions that are more likely than others to fail if their philosophical assumptions do not hold.