I would like to proposed an idea for value-loading in AI. It might sound unworkable at first, but I'm not sure it really is. Specifically, the idea is to just have the utility function be a fixed function in the AI that sends a message to a person asking them a question about the desirability of something, and then use that result as its preferences.
This requires a high degree of interperability. Further, the part of the AI that is focused on querying the person for the desirability of goals would need to be non-superintelligent. Otherwise, the AI could potentially find a way to "hack" the humans' brain to make them respond very wrongly to queries.
You might imagine this would be incredibly inefficient due to how slow people would be in answering the AI's queries. However, with the right optimization algorithm I'm not sure this would be much of a problem. I don't see a reason to think that it's impossible to make an AI perform well on an extremely slow objective functions. For example, an optimization algorithm with a very slow objective function could be programmed to form fast approximations of the objective function, use them to plan, and then query the slow objective function on occasion if necessary.
I think a big potential issue with the alignment approaches I've seen discussed is that they either rely on correct value loading to make the AI behave corrigably, or otherwise require a learned model that encodes corrigability. With the approaches I've seen, if you want your AI to avoid misbehaving and getting out of control, you need to hope that the learned model of preferences correctly and robustly penalizes incorrigable behavior. However, given the general difficulty of reliably learning models, this is a challenge.
My technique (potentially) gets around this by having the AI's objective function literally just be a piece of code that texts the developers or something about the desirability of something, and then returns whatever the developers text back. And the optimization algorithm would hard-code calling the function at least a minimum number of times and then updating its models based on this.
This way, corrigability is effectively hard-coded. The AI's algorithm is stuck clarifying the human's preferences on any impactful action and then updating its learned models correctly.
For example, suppose an AI mistakenly thinks you would want it to convert the world to paperclips. Then in the course of coming up with a plan to make the world paperclips, the AI would need to at some points query the person for the desirability of some of the result. Then the AI would (hopefully) see that these results wouldn't be what people want and update its model of people's preferences. And then stop planning to convert the world into paperclips.
This system isn't foolproof. If the AI is has unreliable or overconfident reasoning, it could potentially do catastrophic damage. For example, it might fail to conceive the possibility that people wouldn't want the world to be paperclips, and so go ahead an make it paperclips without querying anyone. But this is intended to provide corrigability and value loading (for outer alignment), not reliability or impact. It also doesn't help with inner alignment.