I wrote this essay in early August. I now consider the presentation to be somewhat confused, and now better understand where problems arise within the "standard alignment model." I'm publishing a somewhat edited version, on the grounds that something is better than nothing.
Summary: Consider the argument: "Imperfect value representations will, in the limit of optimization power, be optimized into oblivion by the true goal we really wanted the AI to optimize." But... I think my brother cares about me in some human and "imperfect" way. I also think that the future would contain lots of value for me if he were a superintelligent dictator (this could be quite bad for other reasons, of course).
Therefore, this argument seems to prove too much. It seems like one of the following must be true:
- My brother cares about me in the "perfect" way, or
- Dictator-brother would do valueless things according to my true values, or
- The argument is wrong.
To explore these points, I dialogue with my model of Eliezer.
Suppose you win a raffle and get to choose one of n prizes. The first prize is a book with true value 10, but your evaluation of it is noisy (drawn from the Gaussian N(10,1) with standard deviation 1). The other n-1 prizes are widgets with true value 1, but your evaluation is more noisy (drawn from N(1,16) with standard deviation 4). As n increases, you’re more probable to select a widget and lose out on 10-1=9 utility. By considering so many options, you’re selecting against your own ability to judge prizes by implicitly selecting for high noise. You end up “optimizing so hard” that you delude yourself. This is the Optimizer’s Curse.
You’re probably already familiar with Goodhart’s Law, which applies when an agent optimizes a proxy U (e.g. how many nails are produced) for the true quantity V which we value (e.g. how profitable the nail factory is).
Goodhart’s Curse is their combination. According to the article, optimizing a proxy measure U over a trillion plans can lead to high regret under the true values V, even if U is an unbiased but noisy estimator of the true values V. This seems like bad news, insofar as it suggests that even getting an AI which understands human values on average across possibilities can still produce bad outcomes.
Below is a dialogue with my model of Eliezer. I wrote the dialogue to help me think about the question at hand. I put in work to model his counterarguments, but ultimately make no claim to have written him in a way he would endorse.
An obvious next question is "Why not just define the AI such that the AI itself regards [a proxy measure] as an estimate of [true human values], causing the AI's [proxy measure] to more closely align with [true human values] as the AI gets a more accurate empirical picture of the world?"
Reply: Of course this is the obvious thing that we'd want to do. But what if we make an error in exactly how we define "treat as an estimate of "? Goodhart's Curse will magnify and blow up any error in this definition as well. — Goodhart’s Curse
Alex (A): Suppose that I can get smarter over time—that AGI just doesn’t happen, that the march of reason continues, that lifespans extend, and that I therefore gradually become very old and very smart. Goodhart’s Curse predicts that—even though I think I want to help my brother be happy, even though I think I value his having a good life by his own values—my desire to help him is not “error-free”, it is not perfect, and so Goodhart’s Curse will magnify and blow up these errors. And time unfolds, the Curse is realized, and I bring about a future which is high-regret by his “true values” . Insofar as Goodhart’s Curse has teeth as an argument for AI risk, it seems to predict that my brother will deeply regret my optimization.
This prediction seems flatly wrong: I wouldn’t bring about an outcome like that. Why do I believe that? Because I have reasonably high-fidelity access to my own policy, via imagining myself in the relevant situations. I have information about whether I would e.g. improve myself such that "improved" Alex screws over his brother. I simply imagine being faced with the choice, and through my imagination, I realize I wouldn't want to improve myself that way.
Alex’s model of Eliezer (A-EY): What a shocking observation—humans pursue human values, like helping their relatives.
A: The point is not that I value what I value. The point is that, based on introspective evidence, I value my brother achieving his values. If it’s really so difficult to point to what other agents want, why should my pointer to his values be so “robust” (whatever that means)?
A-EY: This is surprising why? Kin cooperation was heavily selected for in the ancestral environment. You, peering out from inside of your own human mind, perceive the apparent simplicity of caring about your brother, because that’s a thing your brain was natively built to do. You do not appreciate the generations upon generations of selection pressure which accreted complex and fine-tuned genetic machinery into your source code.
A: Indeed. And even though evolution couldn’t get us to value inclusive genetic fitness, evolution somehow found an adaptation for caring about kin, such that this caring generalizes off-distribution into the radically different modern environment, such that Goodhart’s Curse will be unable to drastically blow up the way I care for my brother, because I care about my brother’s preferences in the requisite “error-free” way.
That doesn’t sound like a thing which happens in reality. Seems more likely that the Curse doesn’t have the alignment implications which I perceive you to be claiming (e.g. that imperfect motivations get Goodharted to oblivion in the limit).
A-EY: What’s your point? Suppose I agreed that you are not currently falling victim to Goodhart’s Curse, at least in the sense of not bringing about outcomes your brother strongly disvalues. What next?
A: The point isn’t that I’m avoiding Goodhart’s Curse right now. I’m further claiming I won’t self-improve into an entity which foreseeably-to-current-me breaks the “I care about my brother” invariant. Therefore, insofar as I can become more intelligent, I will self-improve into an entity which doesn’t bring about horrible outcomes for my brother.
(And yes, if I considered a trillion zillion plans for improving myself and assigned each one a rating—a rather foolish procedure, all things considered—the highest-rated plan wouldn’t be the actually-best plan. The optimizer’s curse applies. Furthermore, the highest-rated plan might be worse than doing nothing, insofar as I consider plans which my search implicitly optimizes to look deceptively good to me. Since I already know that that decision-making procedure sucks, I just won’t use that procedure.)
This smarter version of myself poses a problem to certain implications of Goodhart’s Curse. You claim that it’s really hard to point AI to care about other entities’ values. But if we look at actual reality, at the one empirical example ever of general intelligence, there are literally billions of examples of those intelligences caring about each other every day. And, speaking for myself, I wouldn’t do something stupid like “consider a zillion plans and then choose the best one, with the foreseeable result of screwing over my brother due to my imperfect pointer to his values.”
A-EY: Why do you keep saying that humans care about each other? Of course humans care about each other.
A: Oh? And why’s that?
A-EY: Because it was selected for. Heard of evolution?
A: That’s not an explanation for why the mechanism works, that’s an explanation of how the mechanism got there. It’s like if I said “How can your car possibly be so fast?” and you said “Of course my car is fast, I just went to the shop. If they hadn’t made my car go faster, they’d probably be a bad shop, and would have gone out of business.”
A-EY: Sure. I don’t know why the mechanism works, and we probably won’t figure it out before DeepAI kills everyone. As a corollary, you don’t understand the mechanism either. So what are we debating?
A: I disagree with your forecast, but that’s beside the point. The point is that, by your own writing (as of several years ago), Goodhart’s Curse predicts that “imperfect” pointing procedures will produce optimization which is strongly disvalued by the preferences which were supposed to be pointed to. However, back in actual reality, introspective evidence indicates that I wouldn’t do that. From this I infer that some step of the Curse is either locally invalid, wrongly framed, or inapplicable to the one example of general intelligence we’ve ever seen.
A-EY: As I stated in the original article, mild optimization seemed (back in the day) like a possible workaround. That is, humans don’t do powerful enough search to realize the really bad forms of the Curse.
A: I do not feel less confused after hearing your explanation. You say phrases like “powerful enough search” and “mild optimization.” But what do those phrases mean? How do I know that I’m even trying to explain something which really exists, or which will probably exist—that unless we get “mild optimizers”, we will hit “agents doing powerful search” such that “imperfections get blown up”? Why should I believe that Goodhart’s Curse has the implications you claim, when the “Curse” seems like a nothingburger in real life as it applies to me and my brother?
A-EY: [Sighs] We have now entered the foreseeable part of the conversation where my interlocutor fails to understand how intelligence works. You peer out from your human condition, from the messy heuristics which chain into each other, and fail to realize the utter dissimilarity of AI. Of how Utility will sharpen and unify messy heuristics into coherence, pulling together disparate strands of cognition into a more efficient whole. You introspect upon your messy and kludgy human experience and, having lived nothing else, expect the same from AI.
AI will not be like you. AI will not think like you do. AI will not value like you do. I don’t know why this is so hard for people to understand.
A: Ad hominem and appeal to your own authority without making specific counterarguments against my point. Also, “humans are a mess” is not an actual explanation but a profession of ignorance (more precisely, a hypothesis class containing a range of high-complexity hypotheses), nor does that statement explain how your Goodhart’s Curse arguments shouldn’t blow up my ability to successfully care about my brother.
Let’s get back to the substance: I’m claiming that your theory makes an introspectively-apparent-to-me misprediction. You’re saying that your theory doesn’t apply to this situation, for reasons which I don’t currently understand and which have not been explained to my satisfaction. I'm currently inclined to conclude that the misprediction-generator (i.e. "value pointers must be perfect else shattering of all true value") is significantly less likely to apply to real-world agents. (And that insofar as this is a hole in your argument which no one else noticed, there are probably more holes in other alignment arguments.)
A-EY: [I don’t really know what he’d say here]
Again, I wrote this dialogue to help me think about the issue. It seems to me like there's a real problem here with arguments like "Imperfect motivations get Goodharted into oblivion." Seems just wrong in many cases. See also Katja's recent post section "Small differences in utility functions may not be catastrophic."
Thanks to Abram Demski and others for comments on a draft of this post.
When I first wrote this dialogue, I may have swept difficulties under the rug like "augmenting intelligence may be hard for biological humans to do while preserving their values." I think the main point should still stand.
We can also swap out "I bring about a good future for my brother" with "my brother brings about a good future for me, and I think that he will do a good job of it, even though he presumably doesn't contain a 'perfect' motivational pointer to my true values."
I was actually thinking about this issue recently but I hadn't heard of the term "optimizer's curse" yet. Regardless there is a pretty straight forward solution related to empowerment.
The key issue is uncertainty in value estimates. But why do these primarily arise? For most agents and real world complex decisions the primary source of value uncertainty is model prediction uncertainty, which only compounds over time. However the fact that uncertainty increases farther into the planning horizon also then implies that uncertainty for specific future dates decreases as time advances and those dates get closer to the present.
The problem with the optimizer's curse flows from the broken assumption that state utility should use the deterministic maximum utility of successor states. Take something like your example where we have 10 future states B0. B1 .. B9, with noisy expected utility estimates. But we aren't actually making the decision now, as that is a future decision. Instead say in the current moment the agent is deciding between 2 current options A0 and A1 - neither of which has immediate utility, but A0 enables all of the B options while A1 enables only one: B9. Now let's say that B9 happens to have the highest predicted expected utility - but that is naturally only due to prediction noise. Computing the chained EU of A states based on max discounted utility (as in bellman recursion) results in the same utility for A0 and A1, which is clearly wrong.
One simple potential fix for this is to use a softmax rather than max, which then naturally strongly favors A0 as it enables more future successor options - and would usually even favor A0 if it only enabled the first 9 B options.
At the more abstract level the solution is to recognize the value of future information and how that causes intermediate states with higher optionality to have more value - because those states benefit the most from future information. In fact at some point optionality/empowerment becomes the entirety of the corrected utility function, which is just another way of arriving at instrumental convergence to empowerment.
Interestingly AIXI also makes all the right decisions here, even though it is an arg maxer, but only because it considers the full distribution of all possible world-models and only chooses the max expected utility decision after averaging out over all world-models. So it chooses A0 because in most worlds the winning pathways go through A0.
Applying these lessons to human utility functions results in the realization that external empowerment is almost all we need.
Empowerment is difficult in alignment contexts because humans are not rational utility maximizers. You might risk empowering humans to make a mistake.
Also taken too far you run into problems with Eudaimonia. We probably wouldn't want AI to remove all challenge.
I mostly agree with that tradeoff - a perfect humanity empowering agent could still result in sub-optimal futures if it empowers us and we then make mistakes, vs what could be achieved by a theoretical fully aligned sovereign. But that really doesn't seem so bad, and also may not be likely as empowering us probably entails helping us with better future modeling.
In practice the closest we may get to a fully aligned sovereign is some form of uploading, because practical strong optimization in full alignment with our brain's utility function probably requires extracting and empowering functional equivalents to much of the brain's valence/value circuits.
So the ideal scenario is probably AI that helps us upload and then hands over power.
It seems potentially extremely bad to me, since power could cause e.g. death, maiming or torture if wielded wrong.
"that" here refers to "a perfect humanity empowering agent" which hands power over to humanity. In that sense it's not that different from us advancing without AI. So if you think that's extremely bad because you are assuming only a narrow subset of humanity is empowered, well, that isn't what I meant by "a perfect humanity empowering agent". If you still think that's extremely bad even if humanity is empowered broadly then you seem to just think that humanity advancing without AI would be extremely bad. In that case I think you are expecting too much of your AI and we have more fundamental disagreements.
Humans usually put up lots of restrictions that reduce empowerment in favor of safety. I think we can be excessive about such restrictions, but I don't think they are always a bad idea, and instead think that if you totally removed them, you would probably make the world much worse. Examples of things that seem like a good idea to me:
And the above are just things that are mainly designed to protect you from yourself. If we also count disempowering people to prevent them from harming others, then I support bans and limits on many kinds of weapon sales, and I think it would be absolutely terrible if an AI taught people a simple way to build a nuke in their garage.
Your examples are just examples of empowerment tradeoffs.
Fences that prevent you from falling off stairs can be empowering because death or disability are (maximally, and extremely) disempowering.
Same with drugs and sockets. Precomitting to a restriction on your future ability to use some dangerous addictive drug can increase empowerment, because addiction is highly disempowering. I don't think you are correctly modelling long term empowerment.
I think in order to generally model this as disempowering, you need a model of human irrationality, as if you instead model humans as rational utility maximizers, we wouldn't make major simple avoidable mistakes that we would need protection from.
But modelling human irrationality seems like a difficult and ill-posed problem, which contains most of the difficulty of the alignment problem.
The difficulties this leads to in practice is what to do when writing "empowerment" into the the utility function from your AI; how do you specify that it is human-level rationality that must be empowered, rather than ideal utility maximizers?
My comment began as a discourse of why practical agents are not really utility argmaxers (due to the optimizer's curse).
You do not need to model human irrationality and it is generally a mistake to do so.
Consider a child who doesn't understand that the fence is to prevent them from falling off stairs. It would be a mistake to optimize for the child's empowerment using their limited irrational world model. It is correct to use the AI's more powerful world model for computing empowerment, which results in putting up the fence (or equivalent) in situations where the AI models that as preventing the child from death or disability.
Likewise for the other scenarios.
I usually don't consider this a problem, since I have different atomic building blocks for my value set.
However, if I was going to criticize it, I'd criticize the fact that inner-alignment issues incentivize it to deceive us.
It's still an advance. If the core claims are correct, then it solves the entire outer alignment problem in one go, including Goodhart problems.
Now I get the skepticalness of this solution, because from the outside view, someone (solving a major problem with pet theory) almost never happens, and a lot of the efforts have turned out not to work.
If you are talking about external empowerment I wasn't the first to write up that concept - that credit goes to Franzmeyer et al. Admittedly my conception is a little different and my writeup focuses more on the longer term consequences, but they have the core idea there.
If you are talking about how empowerment arises naturally from just using correct decision making under uncertainty in situations where you have future value of information that improves subsequent future value estimates - that idea may be more novel and I'll probably write it up if it isn't so novel that it has non-epsilon AI capability value. (Some quick google searches reveals some related 'soft' decision RL approaches that seem similar)
Franzmeyer, Tim, Mateusz Malinowski, and João F. Henriques. "Learning Altruistic Behaviours in Reinforcement Learning without External Rewards." arXiv preprint arXiv:2107.09598 (2021). ↩︎
First, I think you underestimate the selection pressures your "caring about your brother" function has been under. That's not a mechanistic argument, I know, but bear with me here.
That function wasn't produced by evolution, which you know. It wasn't produced by the reward circuitry either, nor your own deliberations. Rather, it was produced by thousands of years of culture and adversity and trial-and-error.
A Stone Age or a medieval human, if given superintelligent power, would probably make life miserable for their loved ones, because they don't have the sophisticated insights into psychology and moral philosophy and meta-cognition that we use to implement our "caring" function. Stone-Agers/medievals would get hijacked by some bright idea or ideology or a mistaken conception of what they imagine their loved ones want or ought to want, and would just inflict that on them. You might object that this means they don't "really" care — well yes, that's the point!
And that sort of thing has been happening for generations, on a smaller scale. And is still happening: I expect that if you chose a random pair of people who'd profess to (and genuinely believe to) be each other's loved ones, and then you amplified one of them to superintelligent godhood, they'd likely make the second one's life fairly miserable.
Most people have not actually thought all of this through to the extent you or other LW-types have; they're not altruistic in the correct nuanced way, and don't have enough meta-cognition to recognize when they're screwing up on that. Their utopias would be fairly dystopian or lethal for most people. They would not course-correct after their victims object, either: they don't have enough meta-cognition for that either. They'd dig their heels in. If not impulsively self-modify into a monster on the spot.
The reason some of the modern people, who'd made a concentrated effort to become kind, can fairly credibly claim to genuinely care for others, is because their caring functions are perfected. They'd been perfected by generations of victims of imperfect caring, who'd pushed back on the imperfections, and by scientists and philosophers who took such feedback into account and compiled ever-better ways to care about people in a way that care-receivers would endorse. And care-receivers having the power to force the care-givers to go along with their wishes was a load-bearing part in this process.
It doesn't answer the question of how it works, yes. But the argument that it's trivial or simple is invalid. Plausibly, there might not actually be a billion examples of this.
But to the extent that there are people who genuinely care about others in a way we here would recognize as "real", in a way that's robust to intelligence amplification, I think they do have perfect caring functions, or a perfect pointer to a caring function, or something. Some part of the system is perfect.
The claim about Stone Age people seems probably false to me - I think if Stone Age people could understand what they were actually doing (not at the level of psychology or morality, but at the purely "physical" level), they would probably do lots of very nice things for their friends and family, in particular give them a lot of resources. However, even if it is true, I don't think the reason we have gotten better is because of philosophy - I think it's because we're smarter in a more general way. Stone Age people were uneducated and had less good nutrition than us; they were literally just stupid.
Education is part of what I'm talking about. Modern humans iterate on the output of thousands of years of cultural evolution, their basic framework of how the world works is drastically different from the ancestral ones. Implicit and explicit lessons of how to care about people without e. g. violating their agency come part and parcel with it.
At the basic level, why do you think that their idea of "nice things" would be nuanced enough to understand that, say, non-consensual wireheading is not a nice thing? Some modern people don't.
Stone Age people didn't live a very comfortable life by modern standards, the experience of pleasure and escape from physical ailments would be common aspirations, while the high-cognitive-tech ideas of "self-actualization" would make no native sense to them. Why would a newly-ascended Stone Age god not assume that making everyone experience boundless pleasure free of pain forever is not the greatest thing there could possibly be? Would it occur to that god to even think carefully about whether such assumptions are right?
Edit: More than that, ancient people's world-models are potentially much more alien and primitive than we can easily imagine. I refer you to the speculations in the section 2 here. The whole "voices of the gods" thing in the rest of the post is probably wrong, but I find it easy to believe that the basic principles of theory-of-mind that we take for granted are not something any human would independently invent. And if so, who knows what people without it would consider the maximally best way to be nice to someone?
I think the world modelling improvements from modern science and IQ raising social advances can be analytically separated from changes in our approach to welfare. As for non consensual wireheading, I am uncertain as to the moral status of this, so it seems like partially we just disagree about values. I am also uncertain as to the attitude of Stone Age people towards this - while your argument seems plausible, the fact that early philosophers like the Ancient Greeks were not pure hedonists in the wireheading sense but valued flourishing seems like evidence against this, suggesting that favoring non consensual wireheading is downstream of modern developments in utilitarianism.
Fair enough, I suppose. My point is more— Okay, let's put the Stone Age people aside for a moment and think about the medieval people instead. Many of them were religious and superstitious and nationalistic, as the result of being raised on the diet of various unwholesome ideologies. These ideologies often had their own ideas of "the greater good" that they tried to sell people, ideas no less un-nice than non-consensual wireheading. Thus, a large fraction of humanity for the majority of its history endorsed views that would be catastrophic if scaled up.
I just assume this naturally extrapolates backwards to the Stone Age. Stone-age people had their own superstitions and spiritual traditions, and rudimentary proto-ideologies. I assume that these would also be catastrophic if scaled up.
Note that I'm not saying that the people in the past were literally alien, to the extent that they wouldn't be able to converge towards modern moral views if we e. g. resurrected and educated one of them (and slightly intelligence-amplified them to account for worse nutrition, though I'm not as convinced that it'd be necessary as you), then let them do value reflection. But this process of "education" would need to be set up very carefully, in a way that might need to be "perfect".
My argument is simply that if we granted godhood to one of these people and let them manage this process themselves, that will doom the light cone.
This seems like you're confusing two things here, because the thing you would want is not knowable by introspection. What I think you're introspecting is that if you'd noticed that the-thing-you-pursued-so-far was different from what your brother actually wants, you'd do what he actually wants. But the-thing-you-pursued-so-far doesn't play the role of "your utility function" in the goodhart argument. All of you plays into that. If the goodharting were to play out, your detector for differences between the-thing-you-pursued-so-far and what-your-brother-actually-wants would simply fail to warn you that it was happening, because it too can only use a proxy measure for the real thing.
I want to know whether, as a matter of falsifiable fact, I would enact good outcomes by my brother's values were I very powerful and smart. You seem to be sympathetic to the falsifiable-in-principle prediction that, no, I would not. (Is that true?)
Anyways, I don't really buy this counterargument, but we can consider the following variant (from footnote 2):
"True" values: My own (which I have access to)
"Proxy" values: My brother's model of my values (I have a model of his model of my values, as part of the package deal by which I have a model of him)
I still predict that he would bring about a good future by my values. Unless you think my predictive model is wrong? I could ask him to introspect on this scenario and get evidence about what he would do?
That prediction may be true. My argument is that "I know this by introspection" (or, introspection-and-generalization-to-others) is insufficient. For a concrete example, consider your 5-year-old self. I remember some pretty definite beliefs I had about my future self that turned out wrong, and if I ask myself how aligned I am with it I don't even know how to answer, he just seems way too confused and incoherent.
I think it's also not absurd that you do have perfect caring in the sense relevant to the argument. This does not require that you don't make mistakes currently. If you can, with increasing intelligence/information, correct yourself, then the pointer is perfect in the relevant sense. "Caring about the values of person X" is relatively simple and may come out of evolution whereas "those values directly" may not.
My short answer: Violations of the IID assumption is the likeliest problem in trying to generalize your values, and I see this as the key flaw underlying the post.
What does that mean? Can you give an example to help me follow?
Specifically, it means that you have to deal with generalizing your values to new situations, but without the IID assumption, you can't just interpolate from existing values anymore, and you will likely overfit to your IID data points, and that's the better case. In other words, your behavior will be dominated by your inductive biases and priors. And my fear is that given real life examples of intelligence differences that violate IID distributions, things end up misaligned really fast. I'm not saying that we are doomed, but I want to call this out since I think breaking IID will most likely cause Turner to do something really bad to his brother if we allow even one order of magnitude more compute.
Scale this up to human civilization relying on IID distributions in intelligence, and I'm much more careful than Turner is in trying to extrapolate.
What if you become way smarter than your brother and try to give him advice, but he doesn't understand the logic of the advice well enough to know if it would help him, though he's willing to trust you if you say it will. Would you then provide him with advice?
I tagged this "Pointers Problem" but am not 100% sure it's getting at the same thing. Curious if there's a different tag that feels more appropriate.
The problem is that "Goodhart's curse" is an informal statement. It doesn't say literally "for all |u - v| > 0 optimation for v leads to oblivion of u". When we talk about "small differences", we talk about "differences in the space of all possible minds", where difference between two humans is practically nonexistent. If you, say, find subset of utility functions V, such that |v - u| < 10^(-32) utilon for all v in V, where u - humanity utility function, you should implement it right now in Sovereign, because, yes, we lose some utility, but we have time limit for solving alignment. The problem of alignment is that we can't specify V with such characteristics. We can't specify V even such that corr(u, v) > 0.5.