But exactly how complex and fragile?

by KatjaGraceMeteuphoric3 min read3rd Nov 201922 comments

68

Ω 21

Value LearningComplexity of ValueAI Risk
Frontpage
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a post about my own confusions. It seems likely that other people have discussed these issues at length somewhere, and that I am not up with current thoughts on them, because I don’t keep good track of even everything great that everyone writes. I welcome anyone kindly directing me to the most relevant things, or if such things are sufficiently well thought through that people can at this point just correct me in a small number of sentences, I’d appreciate that even more.

~

The traditional argument for AI alignment being hard is that human value is ‘complex’ and ‘fragile’. That is, it is hard to write down what kind of future we want, and if we get it even a little bit wrong, most futures that fit our description will be worthless. 

The illustrations I have seen of this involve a person trying to write a description of value conceptual analysis style, and failing to put in things like ‘boredom’ or ‘consciousness’, and so getting a universe that is highly repetitive, or unconscious. 

I’m not yet convinced that this is world-destroyingly hard. 

Firstly, it seems like you could do better than imagined in these hypotheticals:

  1. These thoughts are from a while ago. If instead you used ML to learn what ‘human flourishing’ looked like in a bunch of scenarios, I expect you would get something much closer than if you try to specify it manually. Compare manually specifying what a face looks like, then generating examples from your description to using modern ML to learn it and generate them.
  2. Even in the manually describing it case, if you had like a hundred people spend a hundred years writing a very detailed description of what went wrong, instead of a writer spending an hour imagining ways that a more ignorant person may mess up if they spent no time on it, I could imagine it actually being pretty close. I don’t have a good sense of how far away it is.

I agree that neither of these would likely get you to exactly human values.

But secondly, I’m not sure about the fragility argument: that if there is basically any distance between your description and what is truly good, you will lose everything. 

This seems to be a) based on a few examples of discrepancies between written-down values and real values where the written down values entirely exclude something, and b) assuming that there is a fast takeoff so that the relevant AI has its values forever, and takes over the world.

My guess is that values that are got using ML but still somewhat off from human values are much closer in terms of not destroying all value of the universe, than ones that a person tries to write down. Like, the kinds of errors people have used to illustrate this problem (forget to put in, ‘consciousness is good’) are like forgetting to say faces have nostrils in trying to specify what a face is like, whereas a modern ML system’s imperfect impression of a face seems more likely to meet my standards for ‘very facelike’ (most of the time).

Perhaps a bigger thing for me though is the issue of whether an AI takes over the world suddenly. I agree that if that happens, lack of perfect alignment is a big problem, though not obviously an all value nullifying one (see above). But if it doesn’t abruptly take over the world, and merely becomes a large part of the world’s systems, with ongoing ability for us to modify it and modify its roles in things and make new AI systems, then the question seems to be how forcefully the non-alignment is pushing us away from good futures relative to how forcefully we can correct this. And in the longer run, how well we can correct it in a deep way before AI does come to be in control of most decisions. So something like the speed of correction vs. the speed of AI influence growing.

These are empirical questions about the scales of different effects, rather than questions about whether a thing is analytically perfect. And I haven’t seen much analysis of them. To my own quick judgment, it’s not obvious to me that they look bad.

For one thing, these dynamics are already in place: the world is full of agents and more basic optimizing processes that are not aligned with broad human values—most individuals to a small degree, some strange individuals to a large degree, corporations, competitions, the dynamics of political processes. It is also full of forces for aligning them individually and stopping the whole show from running off the rails: law, social pressures, adjustment processes for the implicit rules of both of these, individual crusades. The adjustment processes themselves are not necessarily perfectly aligned, they are just overall forces for redirecting toward alignment. And in fairness, this is already pretty alarming. It’s not obvious to me that imperfectly aligned AI is likely to be worse than the currently misaligned processes, and even that it won’t be a net boon for the side of alignment.

So then the largest remaining worry is that it will still gain power fast and correction processes will be slow enough that its somewhat misaligned values will be set in forever. But it isn’t obvious to me that by that point it isn’t sufficiently well aligned that we would recognize its future as a wondrous utopia, just not the very best wondrous utopia that we would have imagined if we had really carefully sat down and imagined utopias for thousands of years. This again seems like an empirical question of the scale of different effects, unless there is a an argument that some effect will be totally overwhelming. 

68

Ω 21

22 comments, sorted by Highlighting new comments since Today at 10:41 AM
New Comment
But secondly, I’m not sure about the fragility argument: that if there is basically any distance between your description and what is truly good, you will lose everything.

I think this is an oversimplification of the fragility argument, which people tend to use in discussion because there's some nontrivial conceptual distance on the way to a more rigorous fragility argument.

The main conceptual gap is the idea that "distance" is not a pre-defined concept. Two points which are close together in human-concept-space may be far apart in a neural network's learned representation space or in an AGI's world-representation-space. It may be that value is not very fragile in human-concept-space; points close together in human-concept-space may usually have similar value. But that will definitely not be true in all possible representations of the world, and we don't know how to reliably formalize/automate human-concept-space.

The key point is not "if there is any distance between your description and what is truly good, you will lose everything", but rather, "we don't even know what the relevant distance metric is or how to formalize it". And it is definitely the case, at least, that many mathematically simple distance metrics do display value fragility.

I found this very helpful, thanks! I think this is maybe what Yudkowsky was getting at when he brought up adversarial examples here.

Adversarial examples are like adversarial goodhart. But an AI optimizing the universe for its imperfect understanding of the good is instead like extremal goodhart. So, while adversarial examples show that cases of dramatic non-overlap between human and ML concepts exist, it may be that you need an adversarial process to find them with nonnegligible probability. In which case we are fine.

This optimistic conjecture could be tested by looking to see what image *maximally* triggers a ML classifier. Does the perfect cat, the most cat-like cat according to ML actually look like a cat to us humans? If so, then by analogy the perfect utopia according to ML would also be pretty good. If not...

Perhaps this paper answers my question in the negative; I dont know enough ML to be sure. Thoughts?

If you want to visualize features, you might just optimize an image to make neurons fire. Unfortunately, this doesn’t really work. Instead, you end up with a kind of neural network optical illusion — an image full of noise and nonsensical high-frequency patterns that the network responds strongly to.

The natural response to this is "ML seems really good at learning good distance metrics".

And it is definitely the case, at least, that many mathematically simple distance metrics do display value fragility.

Which is why you learn the distance metric. "Mathematically simple" rules for vision, speech recognition, etc. would all be very fragile, but ML seems to solve those tasks just fine.

One obvious response is "but what about adversarial examples"; my position is that image datasets are not rich enough for ML to learn the human-desired concepts; the concepts they do learn are predictive, just not about things we care about.

Another response is "but there are lots of rewards / utilities that are compatible with observed behavior, so you might learn the wrong thing, e.g. you might learn influence-seeking behavior". This is the worry behind inner alignment concerns as well. This seems like a real worry to me, but it's only tangentially related to the complexity / fragility of value.

The natural response to this is "ML seems really good at learning good distance metrics".

No, no they absolutely do not seem...

my position is that image datasets are not rich enough for ML to learn the human-desired concepts; the concepts they do learn are predictive, just not about things we care about.

... right, yes, that is exactly the issue here. They do not learn the things we care about. Whether ML is good at learning predictive distance metrics is irrelevant here; what matters is whether they are good at learning human distance metrics. Maybe throwing more data at the problem will make learned metrics converge to human metrics, but even if it did, would we reliably be able to tell?

The key point is that we don't even know what the relevant distance metric is. Even in human terms, we don't know what the relevant metric is. We cannot expect to be able to distinguish an ML system which has learned the "correct" metric from one which has not.

The key point is that we don't even know what the relevant distance metric is. Even in human terms, we don't know what the relevant metric is. We cannot expect to be able to distinguish an ML system which has learned the "correct" metric from one which has not.

This seems true, and also seems true for the images case, yet I (and I think most researchers) predict that image understanding will get very good / superhuman. What distinguishes the images case from the human values case? My guess at your response is that we aren't applying optimization pressure on the learned distance function for images.

In that case, my response would be that yes, if you froze in place the learned distance metric / "human value representation" at any given point, and then ratcheted up the "capabilities" of the agent, that's reasonably likely to go badly (though I'm not sure, and it depends how much the current agent has already been trained). But presumably the agent is going to continue learning over time.

Even in the case where we freeze the values and ratchet up the capabilities: you're presumably not aligned with me, but it doesn't seem like ratcheting up your capabilities obviously leads to doom for me. (It doesn't obviously not lead to doom either though.)

(and I think most researchers) predict that image understanding will get very good / superhuman. What distinguishes the images case from the human values case? My guess at your response is that we aren't applying optimization pressure on the learned distance function for images.

Good guess, but no. My response is that "image understanding will get very good" is completely different from "neural nets will understand images the same way humans do" or "neural nets will understand images such that images the net considers similar will also seem similar to humans". I agree that ML systems will get very good at "understanding" images in the sense of predicting motion or hidden pixels or whatever. But while different humans seem to have pretty similar concepts of what a tree is, it is not at all clear that ML systems have the same tree-concept as a human... and even if they did, how could we verify that, in a manner robust to both distribution shifts and Goodhart?

For friendliness purposes, it does not matter how well a neural net "understands" images/values, what matters is that their "understanding" be compatible with human understanding - in the sense that, if the human considers two things similar, the net should also consider them similar, and vice versa. Otherwise the fragility problem comes into play: two human-value-estimates which seem close together in the AI's representation may be disastrously different for a human.

I agree that ML systems will get very good at "understanding" images in the sense of predicting motion or hidden pixels or whatever.

... So why can't ML systems get very good at predicting what humans value, if they can predict motion / pixels? Or perhaps you can think they can predict motion / pixels, but they can't e.g. caption images, because that relies on higher-level concepts? If so, I predict that ML systems will also be good at that, and maybe that's the crux.

But while different humans seem to have pretty similar concepts of what a tree is, it is not at all clear that ML systems have the same tree-concept as a human.

I'm also predicting that vision-models-trained-with-richer-data will have approximately the same tree-concept as humans. (Not exactly the same, e.g. they won't have a notion of a "Christmas tree", presumably.)

and even if they did, how could we verify that, in a manner robust to both distribution shifts and Goodhart?

I'm not claiming we can verify it. I'm trying to make an empirical prediction about what happens. That's very different from what I can guarantee / verify. I'd argue the OP is also speaking in this frame.


I'm trying to make an empirical prediction about what happens. That's very different from what I can guarantee / verify. I'd argue the OP is also speaking in this frame.

That may be the crux. I'm generally of the mindset that "can't guarantee/verify" implies "completely useless for AI safety". Verifying that's it's safe is the whole point of AI safety research. If we were hoping to make something that just happened to be safe even though we couldn't guarantee it beforehand or double-check afterwards, that would just be called "AI".

I'm not saying we need proof-level guarantees for everything. Reasoning from strong enough priors would be ok, but saying "well, it seems like it'll probably be safe, but we can't actually verify our assumptions or reasoning" really doesn't cut it. Especially when we do not understand what the things-of-interest (values) even are, or how to formalize them.

I'm also predicting that vision-models-trained-with-richer-data will have approximately the same tree-concept as humans.

If we're saying that tree-concepts of vision-models-trained-with-richer-data will be similar to the human tree-concept according to humans, then I actually do agree with that. I do not expect it to generalize to values. (Although if we had a way to verify that the concepts match, I would expect the concept-match-verification method to generalize.) Here's a few different views on why I wouldn't expect it to generalize, which feel to me like they're all working around the edges of the same central idea:

  • In game/decision-theoretic terms, values depend on off-equilibrium behavior. They depend on counterfactual situations which will never actually happen.
  • In reductive terms, things in images can mostly be expressed as complicated clusters in atom-configuration space. Those clusters are directly relevant to predictive models, and they have predictive power. Values, and agency, aren't like that - we could model and predict the world just fine without assigning agency to any processes in it. (I suspect that a formalization of this distinction drops naturally out of a theory of abstraction, but that's still under construction.)
  • Humans can generally agree on what a tree is. Disagreements over values - or over what values even are - feel qualitatively different. From a human perspective, it feels like values and trees are defined in qualitatively different ways.

Again, if we had ways to guarantee/verify that a human and an ML system were using the same concepts, or had similar notions of "distance" and "approximation", then I do expect that would generalize from images to values. But I don't expect that methods which find human-similar concepts in images will also generally find human-similar concepts in values.

That may be the crux. I'm generally of the mindset that "can't guarantee/verify" implies "completely useless for AI safety". Verifying that's it's safe is the whole point of AI safety research. If we were hoping to make something that just happened to be safe even though we couldn't guarantee it beforehand or double-check afterwards, that would just be called "AI"

Surely "the whole point of AI safety research" is just to save the world, no? If the world ends up being saved, does it matter whether we were able to "verify" that or not? From my perspective, as a utilitarian, it seems to me that the only relevant question is how some particular intervention/research/etc. affects the probability of AI being good for humanity (or the EV, to be precise). It certainly seems quite useful to be able to verify lots of stuff to achieve that goal, but I think it's worth being clear that verification is an instrumental goal not a terminal one—and that there might be other possible ways to achieve that terminal goal (understanding empirical questions, for example, as Rohin wanted to do in this thread). At the very least, I certainly wouldn't go around saying that verification is "the whole point of AI safety research."

Surely "the whole point of AI safety research" is just to save the world, no?

Suppose you're an engineer working on a project to construct the world's largest bridge (by a wide margin). You've been tasked with safety: designing the bridge so that it does not fall down.

One assistant comes along and says "I have reviewed the data on millions of previously-built bridges as well as record-breaking bridges specifically. Extrapolating the data forward, it is unlikely that our bridge will fall down if we just scale-up a standard, traditional design."

Now, that may be comforting, but I'm still not going to move forward with that bridge design until we've actually run some simulations. Indeed, I'd consider the simulations the core part of the bridge-safety-engineer's job; trying to extrapolate from existing bridges would be at most an interesting side-project.

But if the bridge ends up standing, does it matter whether we were able to guarantee/verify the design or not?

The problem is model uncertainty. Simulations of a bridge have very little model uncertainty - if the simulation stands, then we can be pretty darn confident the bridge will stand. Extrapolating from existing data to a record-breaking new system has a lot of model uncertainty. There's just no way one can ever achieve sufficient levels of confidence with that kind of outside-view reasoning - we need the levels of certainty which come with a detailed, inside-view understanding of the system.

If the world ends up being saved, does it matter whether we were able to "verify" that or not?

Go find an engineer who designs bridges, or buildings, or something. Ask them: if they were designing the world's largest bridge, would it matter whether they had verified the design was safe, so long as the bridge stood up?

That may be the crux. I'm generally of the mindset that "can't guarantee/verify" implies "completely useless for AI safety". Verifying that's it's safe is the whole point of AI safety research. If we were hoping to make something that just happened to be safe even though we couldn't guarantee it beforehand or double-check afterwards, that would just be called "AI".

It would be nice if you said this in comments in the future. This post seems pretty explicitly about the empirical question to me, and even if you don't think the empirical question counts as AI safety research (a tenable position, though I don't agree with it), the empirical questions are still pretty important for prioritization research, and I would like people to be able to have discussions about that.

(Partly I'm a bit frustrated at having had another long comment conversation that bottomed out in a crux that I already knew about, and I don't know how I could have known this ahead of time, because it really sounded to me like you were attempting to answer the empirical question.)

Although it occurs to me that you might be claiming that empirically, if we fail to verify, then we're near-definitely doomed. If so, I want to know the reasons for that belief, and how they contradict my arguments, rather than whatever it is we're currently debating. (And also, I retract both of the paragraphs above.)

Re: the rest of your comment: I don't in fact want to have AI systems that try to guess human "values" and then optimize that -- as you said we don't even know what "values" are. I more want AI systems that are trying to help us, in the same way that a personal assistant might help you, despite not knowing your "values".

Sorry we wound up deep in a thread on a known crux. Mostly I just avoid timeline/prioritization/etc conversations altogether (on the margin I think it's a bikeshed). But in this case I read the OP as wondering why safety researchers were interested in the fragility argument, more than arguing over fragility itself.

As for AIs trying to help us rather than guessing human values... I don't really see how that circumvents the central problem? It sort-of splits off some of the nebulous, unformalized ideas which seem relevant into their own component, but we still end up with a bunch of nebulous, unformalized ideas which do not seem like the same kind of conceptual objects as "trees". We still need notions of wanting things, of agency, etc.

One obvious response is “but what about adversarial examples”; my position is that image datasets are not rich enough for ML to learn the human-desired concepts; the concepts they do learn are predictive, just not about things we care about.

To clarify, are you saying that if we had a rich enough dataset, the concepts they learn would be things we care about? If so, what is this based on, and how rich of a dataset do you think we would need? If not, can you explain more what you mean?

In the images case, I meant that if you had a richer dataset with more images in more conditions, accompanied with touch-based information, perhaps even audio, and the agent were allowed to interact with the world and see through these input mechanisms what the world did in response, then it would learn concepts that allow it to understand the world the way we do -- it wouldn't be fooled by occlusions, or by putting picture of a baseball on top of an ocean picture, etc. (This also requires a sufficiently large dataset; I don't know how large.)

I'm not saying that such a dataset would lead it to learn what we value. I don't know what that dataset would look like, partly because it's not clear to me what exactly we value.

There's a distinction worth mentioning between the fragility of human value in concept space, and the fragility induced by a hard maximizer running after its proxy as fast as possible.

Like, we could have a distance metric whereby human value is discontinuously sensitive to nudges in concept space, while still being OK practically (if we figure out eg mild optimization). Likewise, if we have a really hard maximizer pursuing a mostly-robust proxy of human values, and human value is pretty robust itself, bad things might still happen due to implementation errors (the AI is incorrigibly trying to accrue human value for itself, instead of helping us do it).

So then the largest remaining worry is that it will still gain power fast and correction processes will be slow enough that its somewhat misaligned values will be set in forever. But it isn’t obvious to me that by that point it isn’t sufficiently well aligned that we would recognize its future as a wondrous utopia, just not the very best wondrous utopia that we would have imagined if we had really carefully sat down and imagined utopias for thousands of years. This again seems like an empirical question of the scale of different effects, unless there is a an argument that some effect will be totally overwhelming.

I think this argument mostly holds in the case of proxy alignment, but fails in the case of deceptive alignment. If a model is deceptively aligned, then I don't think there is any reason we should expect it to be only "somewhat misaligned"—once a mesa-optimizer becomes deceptive, there's no longer optimization pressure acting to keep its mesa-objective in line with the base, which means it could be totally off, not just slightly wrong. Additionally, a deceptively aligned mesa-optimizer might be able to do things like gradient hacking to significantly hinder our correction processes.

Also, I think it's worth pointing out that deception doesn't just happen during training: it's also possible for a non-deceptive proxy aligned mesa-optimizer to become deceptive during deployment, which could throw a huge wrench in your correction processes story. In particular, non-myopic proxy aligned mesa-optimizers "want to be deceptive" in the sense that, if presented with the strategy of deceptive alignment, they will choose to take it (this is a form of suboptimality alignment). This could be especially concerning in the presence of an adversary in the environment (a competitor AI, for example) that is choosing its output to cause other AIs to behave deceptively.

This Facebook post has the best discussion of this I know of; in particular check out Dario's comment and the replies to it.

I wonder if Paul Christiano ever wrote down his take on this, because he seems to agree with Eliezer that using ML to directly learn and optimize for human values will be disastrous, and I'm guessing that his reasons/arguments would probably be especially relevant to people like Katja Grace, Joshua Achiam, and Dario Amodei.

I myself am somewhat fuzzy/confused/not entirely convinced about the "complex/fragile" argument and even wrote kind of a counter-argument a while ago. I think my current worries about value learning or specification has less to do with the "complex/fragile" argument and more to do with what might be called "ignorance of values" (to give it an equally pithy name) which is that humans just don't know what our real values are (especially when applied to unfamiliar situations that will come up in the future) so how can AI designers specify them or how can AIs learn them?

People try to get around this by talking about learning meta-preferences, e.g., preferences for how to deliberate about values, but that's not some "values" that we already have and the AI can just learn, but instead a big (and I think very hard) philosophical and social science/engineering project to try to figure out what kinds of deliberation would be better than other kinds or would be good enough to eventually lead to good outcomes. (ETA: See also this comment.)

It’s not obvious to me that imperfectly aligned AI is likely to be worse than the currently misaligned processes, and even that it won’t be a net boon for the side of alignment.

My own worry is less that "imperfectly aligned AI is likely to be worse than the currently misaligned processes" but more that the advent of AGI might be the last good chance for humanity to get alignment right (including addressing "human safety problem"), and if we don't do a good enough job (even if we improve on the current situation in some sense) we'll be largely stuck with the remaining misalignment because there won't be another opportunity like it. ETA: A good slogan for this might be "AI risk as the risk of missed opportunity".

This again seems like an empirical question of the scale of different effects, unless there is a an argument that some effect will be totally overwhelming.

I'm not entirely sure I understand this sentence, but this post might be relevant here: https://www.lesswrong.com/posts/Qz6w4GYZpgeDp6ATB/beyond-astronomical-waste.

But secondly, I’m not sure about the fragility argument: that if there is basically any distance between your description and what is truly good, you will lose everything. 
This seems to be a) based on a few examples of discrepancies between written-down values and real values where the written down values entirely exclude something, and b) assuming that there is a fast takeoff so that the relevant AI has its values forever, and takes over the world.

When I think of the fragility argument, I usually think in terms of Goodhart's Taxonomy. In particular, we might deal with--

  • Extremal Goodhart -- Human values are already unusually well-satisfied relative to what is normal for this universe and pushing proxies of our values to the extremes might inadvertently move the universe away from that in some way we didn't consider
  • Adversial Goodhart -- The thing that matters which is absent from our proxy is absolutely critical for satsifying our values and requires the same kinds of resources that our proxy relies on

My impression is that our values are complex enough that they have a lot of distinct absolutely critical pieces that hard to pin down even if you try really hard. I mainly think this because I once tried imagining how to make an AGI that optimizes for 'fulfilling human requests' and realized that fulfill, human and request all had such complicated and fragile definitions that it would take me an extremely long time to pin-down what I meant. And I wouldn't be confident in the result I made after pinning things down.

While I don't find this kind of argument fully convincing, I think it's more powerful than ' a) based on a few examples of discrepancies between written-down values and real values where the written down values entirely exclude something'.

That being said, I agree with b). I also lean toward the view that Slow Take-Off plus Machine-Learning may allow a non-catastrophic "good enough" solutions to human value problems.

My guess is that values that are got using ML but still somewhat off from human values are much closer in terms of not destroying all value of the universe, than ones that a person tries to write down. Like, the kinds of errors people have used to illustrate this problem (forget to put in, ‘consciousness is good’) are like forgetting to say faces have nostrils in trying to specify what a face is like, whereas a modern ML system’s imperfect impression of a face seems more likely to meet my standards for ‘very facelike’ (most of the time).

I agree that Machine-Learning will probably give us better estimations of human-flourishing than trying to write-down our values themselves. However, I'm still very apprehensive about it unless we're also being very careful about slow take-off. The main reasons for this apprehensiveness comes from Rohin Shah's work sequence on Value Learning (particularly ambitious value-learning). My main take-away from this was: Learning human values from examples of humans is hard without writing down some extra assumptions about human values (which may leave something important out).

Here's a practical example of this: If you create an AI that learns human values from a lot of examples of humans, what do you think its stance will be on Person-Affecting Views? What will its stance be on value-lexicality responses to Torture vs. Dust-Specks? My impression is that you'll have to write down something to tell the AI how to decide these cases (when should we categorize human behaviors as irrational vs when should we not). And a lot of people may regard the ultimate decision as catastrophic.

There are other complications too. If the AI can interact with the world in ways that change human values and then updates to care about those changed values, strange things might happen. For instance, the AI might pressure humanity to adopt simpler, easier to learn values if it's agential. This might not be so bad but I suspect there are things the AI might do that could potentially be very bad.

So, because I'm not that confident in ML value-learning and because I'm not that confident in human values in general, I'm pretty skeptical of the idea that machine-learning will avert extreme risks associated with value mispecification.

Corrigibility is another reason to think that the fragility argument is not an impossibility proof: If we can make an agent that sufficiently understands and respects the human desire for autonomy and control, then it would presumably ask for permission before doing anything crazy and irreversible, so we would presumably be able to course-correct later on (even with fast/hard takeoff).

For one thing, these dynamics are already in place: the world is full of agents and more basic optimizing processes that are not aligned with broad human values—most individuals to a small degree, some strange individuals to a large degree, corporations, competitions, the dynamics of political processes.

I don't think of this as evidence that unaligned AI is not dangerous. Arguable we're already seeing bad effects from unaligned AI, such as effects on public discourse as a result of newsfeed algorithms. Further, anything that limits the impact of unaligned action now seems largely the result of existing agents being of relatively low or similar power. Even the most powerful actors in the world right now can't effectively control much of the world (e.g. no government has figured out how to eliminate dissent, no military how to stop terrorists, etc.). I expect thing to look quite different if we develop an actor that is more powerful than a majority of all other actors combined, even if it develops into that power slowly because the steps along the way to that seem individually worth the tradeoff.

But it isn’t obvious to me that by that point it isn’t sufficiently well aligned that we would recognize its future as a wondrous utopia, just not the very best wondrous utopia that we would have imagined if we had really carefully sat down and imagined utopias for thousands of years.

To our ancestors we would appear to live in a wondrous utopia (bountiful food, clean water, low disease, etc.), yet we still want to do better. I think there will be suffering so long as we are not at the global maximum and anyone realizes this.

I think that fully specifying human values may not be the best approach to an AI utopia. Rather, I think it would be easier and safer to tell the AI to upload humans and run an Archipelago-esque simulated society in which humans are free to construct and search for the society they want, free from many practical problems in the world today such as resource scarcity.