With the release of Rohin Shah and Eliezer Yudkowsky's conversation, the Late 2021 MIRI Conversations sequence is now complete.
This post is intended as a generalized comment section for discussing the whole sequence, now that it's finished. Feel free to:
- raise any topics that seem relevant
- signal-boost particular excerpts or comments that deserve more attention
- direct questions to participants
In particular, Eliezer Yudkowsky, Richard Ngo, Paul Christiano, Nate Soares, and Rohin Shah expressed active interest in receiving follow-up questions here. The Schelling time when they're likeliest to be answering questions is Wednesday March 2, though they may participate on other days too.
I have a question for the folks who think AGI alignment is achievable in the near term in small steps or by limiting AGI behavior to make it safe. How hard will it be to achieve alignment for simple organisms as a proof of concept for human value alignment? How hard would it be to put effective limits or guardrails on the resulting AGI if we let the organisms interact directly with the AGI while still preserving their values? Imagine a setup where interactions by the organism must be interpreted as requests for food, shelter, entertainment, uplift, etc. and where not responding at all is also a failure of alignment because the tool is useless to the organism.
Consider a planaria with relatively simple behaviors and well-known neural structure. What protocols or tests can be used to demonstrate that an AGI makes decisions aligned with planaria values?
Do we need to go simpler and achieve proof-of-concept alignment with virtual life? Can we prove glider alignment by demonstrating an optimization process that will generate a Game of Life starting position where the inferred values of gliders are respected and fulfilled throughout the evolution of the game? This isn't a straw man; a calculus for values has to handle the edge-cases too. There may be a very simple answer of moral indifference in the case of gliders but I want to be shown why the reasoning is coherent when the same calculus will be applied to other organisms.
As an important aside, will these procedures essentially reverse-engineer values by subjecting organisms to every possible input to see how they respond and try to interpret those responses, or is there truly a calculus of values we expect to discover that correctly infers values from the nature of organisms without using/simulating torture?
I have no concrete idea how to accomplish the preceding things and don't expect that anyone else does either. Maybe I'll be pleasantly surprised.
Barring this kind of fundamental accomplishment for alignment I think it's foolhardy to assume ML procedures will be found to convert human values into AGI optimization goals. We can't ask planaria or gliders what they value and we will have to reason it out from first principles, and AGI will have to do the same for us with very limited help from us if we can't even align for planaria. Claiming that planaria or gliders don't have values or that they are not complex enough to effectively communicate their values are both cop-outs. From the perspective of an AGI we humans will be just as inscrutable, if not moreso. If values are not unambiguously well-defined for gliders or planaria then what hope do we have of stumbling onto well-defined human values at the granularity of AGI optimization processes? In the best case I can imagine a distribution of values-calculuses with different answers for these simple organisms but almost identical answers for more complex organisms, but if we don't get that kind of convergence we better be able to rigorously tell the difference before we send an AGI hunting in that space for one to apply to us.
Hm, upon more thought I actually kind of endorse this as a demo. I think we should be able to run an alignment scheme on c. elegans and get out a universe full of well-fed worms, and that's a decent sign that we didn't screw up, despite the fact that it doesn't engage with several key problems that arise in humans because we're more complicated, have preferences about pur preferences, etc. No weird worm-stimulation should be needed. But we do have to accept that we're not getting some notion of values independent of an act of interpretation.