RFC: Philosophical Conservatism in AI Alignment Research

I've been operating under the influence of an idea I call philosophical conservatism when thinking about AI alignment. I am in the process of summarizing some of the specific stances I take and why I take them because I believe others would better serve the project of alignment research by doing the same, but in the meantime I'd like to request comments on the general line of thinking to see what others think. I've formatted the outline of the general idea and reasons for it with numbers so you can easily comment on each statement independently.

  1. AI alignment is a problem with bimodal outcomes, i.e. most of the probability distribution is clustered around success and failure with very little area under the curve between these outcomes.
  2. Thus, all else equal, we would rather be extra cautious and miss some paths to success than be insufficiently cautious and hit a path to failure.
  3. One response to this is what Yudkowsky calls security mindset by alluding to Schneier's concept of the same name.
  4. Another is what I call philosophical conservatism. The ideas are related and address related concerns but in different ways.
  5. Philosophical conservatism says you should make the fewest philosophical assumptions necessary to addressing AI alignment and that each assumption should be maximally parsimonious and be the assumption that is least convenient for addressing alignment if it were true when there is nontrivial uncertainty over whether a similar, more convenient assumption holds.
  6. This is a strategy that reduces the chance of false positives in alignment research but makes the problem possibly harder, more costly, and less competitive to solve.
  7. For example, we should assume there is no discoverably correct ethics or metaethics the AI can learn since, although it would make the problem easier if this were true, there is nontrivial uncertainty around this and so the assumption which makes it less likely that alignment projects fail is to assume that ethics and metaethics are not solvable.
  8. Current alignment research programs do not seem to operate with philosophical conservatism because they either leave philosophical issues relevant to alignment unaddressed, make unclear implicit philosophical assumptions, or admit being hopeful that helpful assumptions will prove true and ease the work.
  9. The alignment project is better served by those working on it using philosophical conservatism because it reduces the risks of false positives and spending time on research directions that are more likely than others to fail if their philosophical assumptions do not hold.
11 comments, sorted by
magical algorithm
Highlighting new comments since Today at 2:02 PM
Select new highlight date
Moderation Guidelinesexpand_more

I like the general thrust here, although I have a different version of this idea, which I would call "minimizing philosophical pre-commitments". For instance, there is a great deal of debate about whether Bayesian probability is a reasonable philosophical foundation for statistical reasoning. It seems that it would be better, all else equal, for approaches to AI alignment to not hinge on being on the right side of this debate.

I think there are some places where it is hard to avoid pre-commitments. For instance, while this isn't quite a philosophical pre-commitment, it is probably hard to develop approaches that are simultaneously optimized for short and long timelines. In this case it is probably better to explicitly do case splitting on the two worlds and have some subset of people pursuing approaches that are good in each individual world.

I agree we must make some assumptions or pre-commitments and don't expect we can avoid them. In particular there are epistemological issues that force our hands and require we make assumptions because complete knowledge of the universe is beyond the capacity we have to know it. I've talked about this idea some and I plan to revisit it as part of this work.

1. AI alignment is a problem with bimodal outcomes, i.e. most of the probability distribution is clustered around success and failure with very little area under the curve between these outcomes.
2. Thus, all else equal, we would rather be extra cautious and miss some paths to success than be insufficiently cautious and hit a path to failure.

At first I didn't understand the connection between these two points, but I think I get it now. (And so I'll spell it out for others who might be like me.)

If there are really only two outcomes and one is much better than the other, then when you're choosing which path to take you want to maximize the probability that it leads you to the good outcome. (Aka the maxipok rule.)

This would be different from a situation where there's many different possible outcomes of varying quality. In that case, if you think a path leads you to the best outcome, and you're wrong, maybe that's not so bad, because it might still lead you somewhere pretty good. And also if the difference between outcomes is not so great, then maybe you're more okay selecting your path based in part on how easy or convenient it is.

Whereas back in the bi-modal world, if you have a path that you're pretty sure leads you to the good outcome, if you've got the time you might as well look for a path you can be more confident in.

(Is that a fair summary, @gworley?)

Thanks, that's a useful clarification of my reasoning that I did not spell out!

asymmetry in the penalties for type 1 vs type 2 errors.

I'm not really sure what it would mean in practice to operate under the assumption that "there is no discoverably correct ethics or metaethics the AI can learn". The AI still has to do something. The idea of CEV seems to be something like: have the AI act according to preference utilitarianism. It seems to me that this would result in a reasonably good outcome even if preference utilitarianism isn't true (and in fact I think that preference utilitarianism isn't true). So the only ethical assumption is that creating a preference utilitarian AI is a good idea. What would be an example of a weaker ethical assumption that could plausibly be sufficient to design an AI?

I'm not entirely sure either, and my best approach has been to change what we really mean "ethics" to make the problem tractable without forcing a move to making choices about what is normative. I'll touch on this more when I describe my package of philosophical ideas I believe we should adopt in AI safety research, so for now I'll leave it as an example of the kind of assumption that is affected by this line of thinking.

But Statement 1 doesn't imply caution is always best. A plausibly friendly AI based on several dubious philosophical assumptions (50% chance of good outcome) is better than taking our time to get it right(99%) if someone else will make a paperclip maximizer in the mean time. We want to maximize the likelyhood of the first AI being good, which may mean releasing a sloppily made, potentially friendly AI in race conditions. (assuming the other side can't be stopped)

That's why I say in 2 that this holds all else equal. You're right that there are competing concerns that may make philosophical conservatism untenable, and I view it as one of the goals of AI policy to make sure that it is by telling us about the race conditions that would make us unable to practice philosophical conservatism.

Isn't the least convenient world a world where FAI is outright impossible and all AGIs are Unfriendly ?

If that's the case we're no longer addressing alignment and are forced to fall back on weaker safety mechanism. People are working in this direction, but alignment remains the best path until we see evidence it's not possible.