The problem with constructing an aligned AI is that any active utility function or attempt at world optimization is likely to succumb to the Goodhart's law in one of its many forms, as discussed here and elsewhere by the good people of MIRI. I wonder if a more passive approach is worth considering, or may have been considered already.
Humanity is a part of the Universe, and building and organizing accurate knowledge about the Universe is what science is. Not using the scientific knowledge to advance specific interests or goals, e.g. technological advancements or personal gain, but for the knowledge's sake. Such a scientifically-minded agent would not be interested in modifying the Universe, and would limit any effects to the minimum needed to understand it. A part of this scientific research would be to understand humanity as deeply as possible, including what we humans imagine an aligned AI would look like even though we do not fully understand it ourselves at this point.
Presumably at some point such an AI would understand the universe and the humans in it enough to basically serve as a safe DWIM (do what I mean) genie. It would be inherently safe because doing anything unsafe, or agreeing to do anything unsafe would mean that the genie does not understand the part of the Universe that is the humanity. After all, we would not want to do anything that has unsafe and unintended consequences. "Unsafe" includes doing nothing at all: an AI that would prevent humans from doing anything would not understand humans, and so would not understand the universe. In other words
Aligned AI is AI the scientist, not AI the engineer.
This is, of course, is easier said than done. Learning all about the world while actively minimizing any impact on the world is something that we humans often strive to do when trying to understand the ecosystem of the Earth, with mixed results. Still, sometimes we succeed, and, odds are, so could an agent smarter than us.
So on the one hand this seems to echo Stuart Armstrong's take on building oracle AI (paper here coauthored with Sandberg and Bostrom) where we might summarize the starting intuition as "build AI that is not an act-based agent so we avoid Goodharting in ways that pose an x-risk". On the other, though, I remain suspicious of the idea that we can avoid dangerous Goodharting because optimizing for the measure of a variable rather than the variable itself is baked in at such a deep level that I'm inclined to think it more likely that we've fooled ourselves or failed to see far enough rather than overcoming Goodharting if we think that's what we've found a way to do. Since you've just proposed an idea rather than something very specific I can't say much more, but I think things of this class of approaches are unlikely to work, and in this case specifically my thinking caches out as predicting we'd never see this Scientist AI reach a point where we could trust it to do what we mean.
we'd never see this Scientist AI reach a point where we could trust it to do what we mean.
Quite possibly. But I suspect that means that we will not be able to trust any AI to DWIM.