Hey! Absolutely, I think a lot of this makes sense. I assume you were meaning this paragraph with the Reverse Engineering Roles and Norms paragraph:
I want to be clear that I do not mean AI systems should go off and philosophize on their own until they implement the perfect moral theory without human consent. Rather, our goal should be to design them in such a way that this will be a interactive, collaborative process, so that we continue to have autonomy over our civilizational future.
For both points here, I guess I was getting more at this question by asking these: how ought we structure this collaborative process? Like what constitutes feedback a machine sees to interactively improve with society? Who do AI interact with? What constitutes a datapoint in the moral learning process? These seem like loaded questions, and let me more concrete. In decisions without unanimity with regards to a moral fact, using simple majority rule, for example, could lead to disastrously bad moral theory: you could align an AI with norms resulting in of exploiting 40% of the public by 60% of the public (for example, if a majority deems it moral to exploit / under-provide for a minority, in an extreme case). It strikes me that to prevent this kind of failure mode, there must be some baked-in context of "obviously wrong" beforehand. If you require total unanimity, well then, you will never get even a single datapoint: people will reasonably disagree (I would argue to infinity, after arbitrary amounts of reasonable debate) about basic moral facts due to differences in values.
I think this negotiation process is in itself really really important to get right if you advocate this kind of approach, and not by advancing any one moral view of the world. I certainly don't think it's impossible, just as it isn't impossible to have relatively well-functioning democracy. But this is the point I guess: are there limit guarantees to society agreeing after arbitrary lengths of deliberation? Has modern democracy / norm-setting historically risen from mutual deliberation, or from exertion of state power / arbitrary assertion of one norm over another? I honestly don't have sufficient context to answer that, but it seems like relevant empirical fact here.
Maybe another follow up: what are your idealized conditions for "rational / mutually justifiable collective deliberation" here? It seems this phrase implicitly does a lot of heavy lifting for this framework, and I'm not quite sure myself what this would mean, even ideally.
Great post! I'm curious if you could elaborate on when you would feel comfortable making an agent to make some kind of "enlightened" decision, as opposed to one based more on "mere compliance"? Especially given an AI system that is perhaps not very interpretable, or operates on very high-stakes applications, what sort of certificate / guarantee / piece of reasoning would you want from a system to allow it to enact fundamental social changes? The nice thing about "mere compliance" is there are benchmarks for 'right' and 'wrong' decisions. But here I would expect people to reasonably disagree on whether an AI system or community of systems has made a good decision, and therefore it seems harder to ever fully trust machines to make decisions at this level.
Also, despite this being mentioned, it strikes me that purely appealing to what IS done, as opposed to what OUGHT to be done directly, in the context of building moral machines, can still ultimately be quite regressive without this "enlightened compliance" component. It even strikes me that the point of technological progress itself it so incrementally, and slightly, modify and upend the roles we play in the economy and society more broadly, and so strikes me as somewhat paradoxical to impose this 'freezing' criterion on tech and society in some way. Like we want AI to improve the conditions of society, but not fundamentally change its dynamics too much? I may be misunderstanding something about the argument here.
All of this is to say, it does feel somewhat unavoidable to me to advance some kind of claim about the precise constents of a superior moral framework for what systems ought to do, beyond just matching what people do (in Russell's case) or what society does (in this post's case). You mention the interaction of cooperating and working with machines in the "enlightened compliance" section, and not ever fully automating the decision. But what are deciding if not the contents of an eventually superior moral theory, or superior social one? This seems to me the inevitable, and ultimate, desideratum of social progress.
It reminds me of a lot of Chomskyian comments on law and justice: all norms and laws grope at an idealized theory of justice which is somehow biological and emergent from human nature. And while we certainly know that we ought not let system designers, corporations, or states just dictate the content of such a moral theory directly, I don't think we can just purely lean on contractualism to avoid the question in the long run. Perhaps useful in the short run while we sort out the content of such a theory, but ultimately it seems we cannot avoid the question forever.
Idk, just my thinking, very thought-provoking post! Strong upvote, a conversation the community definitely ought to have.
I wonder if implications for this kind of reasoning go beyond AI: indeed, you mention the incentive structure for AI as just being a special case of failing to incentivize people properly (e.g. the software executive), and the only difference being AI occurring at a scale which has the potential to drive extinction. But even in this respect, AI doesn't really seem unique: take the economic system as a whole, and "green" metrics, as a way to stave off catastrophic climate change. Firms, with the power to extinguish human life through slow processes like gradual climate destruction, will become incentivized towards methods of pollution that are easier to hide as regulations on carbon and greenhouse gases become more stringent. This seems like just a problem of an error-prone humanity having greater and greater control of our planet, and our technology and metrics, as a reflection of this, also being error-prone, only with greater and greater consequence for any given error.
Also, what do you see, more concretely, as a solution to this iterative problem? You argue that coming up with the right formalism for what we want, for example, as a way to do this, but this procedure is ultimately also iterative: we inevitably fail to specify our values correctly on some subset of scenarios, and then your reasoning equally applies on the meta-iteration procedure of specifying values, and waiting to see what it does in real systems. Whether with RL from human feedback or RL from human formalism, a sufficiently smart agent deployed on a sufficiently hard task will always find unintended easy ways to optimize an objective, and hide them, vs. solving the original task. Asking that we "get it right", and figuring out what we want, seems kind of equivalent to waiting for the right iteration of human feedback, except on a different iteration pipeline (which, to me, don't seem fundamentally different on the AGI scale).