Sam Clarke

Comments

Clarifying “What failure looks like” (part 1)

This was helpful to me, thanks. I agree this seems almost certainly to be the end state if AI systems are optimizing hard for simple, measurable objectives.

I'm still confused about what happens if AI systems are optimizing moderately for more complicated, measurable objectives (which better capture what humans actually want). Do you think the argument you made implies that we still eventually end up with a universe tiled with molecular smiley faces in this scenario?

Clarifying “What failure looks like” (part 1)

Thanks for your comment!

Are we sure that given the choice between "lower crime, lower costs and algorithmic bias" and "higher crime, higher costs and only human bias", and we have dictatorial power and can consider long-term effects, we would choose the latter on reflection?

Good point, thanks, I hadn't thought that sometimes it actually would make sense, on reflection, to choose an algorithm pursuing an easy-to-measure goal over humans pursuing incorrect goals. One thing I'd add is that if one did delve into the research to work this out for a particular case, it seems that an important (but hard to quantify) consideration would be the extent to which choosing the algorithm in this case makes it more likely that the use of that algorithm becomes entrenched, or it sets a precedent for the use of such algorithms. This feels important since these effects could plausibly make WFLL1-like things more likely in the longer run (when the harm of using misaligned systems is higher, due to the higher capabilities of those systems).

Note ML systems are way more interpretable than humans, so if they are replacing humans then this shouldn't make that much of a difference.

Good catch. I had the "AI systems replace entire institutions" scenario in mind, but agree that WFLL1 actually feels closer to "AI systems replace humans". I'm pretty confused about what this would look like though, and in particular, whether institutions would retain their interpretability if this happened. It seems plausible that the best way to "carve up" an institution into individual agents/services differs for humans and AI systems. E.g. education/learning is big part of human institution design - you start at the bottom and work your way up as you learn skills and become trusted to act more autonomously - but this probably wouldn't be the case for institutions composed of AI systems, since the "CEO" could just copy their model parameters to the "intern" :). And if institutions composed of AI systems are quite different to institutions composed of humans, then they might not be very interpretable. Sure, you could assert that AI systems replace humans one-for-one, but if this is not the best design, then there may be competitive pressure to move away from this towards something less interpretable.