Owen D - LessWrong

I think your objections are all basically correct, but that you treat them as dealbreakers in ways that I (a big shard-alignment fan) don't. As I understand it, your objections boil down to 1. picking the training curriculum/reward signal is hard (and design choices pose a level of challenge beyond the simple empirical does-it-work-to-produce-an-AGI) and 2. reflectivity is very hard and might cause lots of big problems, and we can’t begin to productively engage with those issues right now.

I don’t think that curriculum and reward signal are as problematic as you seem to think. From the standpoint of AI notkilleveryoneism, I think that basically any set of prosocial/human-friendly values will be sufficient, and that something directionally correct will be very easy to find. The design choices described as relating to “what’s in the curriculum” seem of secondary importance to me-- in all but the least iterative-design-friendly worlds, we can figure this out as we go, and if we figure out the notkilleveryoneism/basic corrigibility stuff in hard-takeoff worlds we would probably be able to slow down AI development long enough for iteration.

The reflectivity stuff 100% does cause huge problems that we don’t know how to solve, but I break with you in two places here-- firstly, you seem to think that we can’t do much empirical or theoretical work right now to improve our understanding of reflective processes; and secondly, you seem to assume that reflectivity involves or induces additional challenges that IMO can very readily be avoided. Regarding the former point, I think I’m doing empirical work right now that can plausibly help improve our understanding of reflectivity, and Peli Grietzer is doing theoretical work (on what he calls "praxis-based values," based on "doing X X-ingly... the intuition that some reflective values are an uroboros of means and ends") that engages with these problems as well. There’s lots of low-hanging fruit here, and for an approach to alignment that’s only been in play for about a year I think a lot of progress has been made.

Regarding the latter point, I think lots of your points surrounding lock-in might be stated too strongly. I’m a reflective goal-directed agent, and I don’t think my values are “locked in”; I can and do change my behaviors and moral views in response to new information and circumstances. Maybe you think that “lock-in” involves actual self-modification, so that e.g. an aspiring vegan would reengineer their tastebuds so that meat tastes horrible-- but creating shards that discourage this kind of behavior seems easy as pie. Overall, the problems involving “lock-in” don’t seem as hard to me as they do to you-- I think the bigger dangers (and ones we currently don’t know how to address, but might soon) are unknown unknowns and other reflectivity problems, especially those involving how desirable shards might interact in undesirable ways and and push our agent towards bizarre and harmful behaviors.

LESSWRONG
LW

Posts

Wiki Contributions

Comments