Matthew Barnett

Someone who is interested in learning and doing good.

My Twitter:

My Substack:


Daily Insights

Wiki Contributions


I admit I'm a bit surprised by your example. Your example seems to be the type of heuristic argument that, if given about AI, I'd expect would fail to compel many people (including you) on anything approaching a deep level. It's possible I was just modeling your beliefs incorrectly. 

Generally speaking, I suspect there's a tighter connection between our selection criteria in ML and the stuff models will end up "caring" about relative to the analogous case for natural selection. I think this for similar reasons that Quintin Pope alluded to in his essay about the evolutionary analogy. 

If you think you'd be persuaded that animals will end up caring about their offspring because a heuristic argument about that type of behavior being selected for in-distribution, I'm not sure why you'd need a lot of evidence to be convinced the same will be true for AIs with regard to what we train them to care about. But again, perhaps you don't actually need that much evidence, and I was simply mistaken about what you believe here.

Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between "AI will be able to figure out what humans want" (yes; obviously; this was never under dispute) and "AI will care" (nope; not by default; that's the hard bit).

I think it's worth reflecting on what type of evidence would be sufficient to convince you that we're actually making progress on the "caring" bit of alignment and not merely the "understanding" bit. Because I currently don't see what type of evidence you'd accept beyond near-perfect mechanistic interpretability.

I think current LLMs demonstrate a lot more than mere understanding of human values; they seem to actually 'want' to do things for you, in a rudimentary behavioral sense. When I ask GPT-4 to do some task for me, it's not just demonstrating an understanding of the task: it's actually performing actions in the real world that result in the task being completed. I think it's totally reasonable, prima facie, to admit this as evidence that we are making some success at getting AIs to "care" about doing tasks for users.

It's not extremely strong evidence, because future AIs could be way harder to align, maybe there's ultimately no coherent sense in which GPT-4 "cares" about things, and perhaps GPT-4 is somehow just "playing the training game" despite seemingly having limited situational awareness. 

But I think it's valid evidence nonetheless, and I think it's wrong to round this datum off to a mere demonstration of "understanding". 

We typically would not place such a high standard on other humans. For example, if a stranger helped you in your time of need, you might reasonably infer that the stranger cares about you to some extent, not merely that they "understand" how to care about you, or that they are merely helping people out of a desire to appear benevolent as part of a long-term strategy to obtain power. You may not be fully convinced they really care about you because of a single incident, but surely it should move your credence somewhat. And further observations could move your credence further still.

Alternative explanations of aligned behavior we see are always logically possible, and it's good to try to get a more mechanistic understanding of what's going on before we confidently declare that alignment has been solved. But behavioral evidence is still meaningful evidence for AI alignment, just as it is for humans.

Yes, I believe this point has practical relevance. If what I'm saying is true, then I do not believe that solving AI alignment has astronomical value (in the sense of saving 10^50 lives). If solving AI alignment does not have astronomical counterfactual value, then its value becomes more comparable to the value of other positive outcomes, like curing aging for people who currently exist. This poses a challenge for those who claim that delaying AI is obviously for the greater good as long as it increases the chance of successful alignment, since that could also cause billions of currently existing people to die.

I don't understand how this new analogy is supposed to apply to the argument, but if I wanted to modify the analogy to get my point across, I'd make Alice 90 years old. Then, I'd point out that, at such an age, getting hit by a car and dying painlessly genuinely isn't extremely bad, since the alternative is to face death within the next several years with high probability anyway.

Putting "cultural change" and "an alien species comes along and murders us into extinction" into the same bucket seems like a mistake to me

Value drift encompasses a lot more than cultural change. If you think humans messing up on alignment could mean something as dramatic as "an alien species comes along and murders us", surely you should think that the future could continue to include even more, similarly dramatic shifts. Why would we assume that once we solve value alignment for the first generation of AIs, values would then be locked in perfectly forever for all subsequent generations?

The concern about AI misalignment is itself a concern about value drift. People are worried that near-term AIs will not share our values. The point I'm making is that even if we solve this problem for the first generation of smarter-than-human AIs, that doesn't guarantee that AIs will permanently share our values in every subsequent generation. In your analogy, a large change in the status quo (death) is compared to an arguably smaller and more acceptable change over the long term (biological development). By contrast, I'm comparing a very bad thing to another similarly very bad thing. This analogy seems mostly valid only to the extent you reject the premise that extreme value drift is plausible in the long-term, and I'm not sure why you would reject that premise.

Value drift happens in almost any environment in which there is variation and selection among entities in the world over time. I'm mostly just saying that things will likely continue to change continuously over the long-term, and a big part of that is that the behavioral tendencies, desires and physical makeup of the relevant entities in the world will continue to evolve too, absent some strong countervailing force to prevent that. This feature of our world does not require that humans continue to have competitive mental and physical capital. On the broadest level, the change I'm referring to took place before humans ever existed.

We would need to solve the general problem of avoiding value drift. Value drift is the phenomenon of changing cultural, social, personal, biological and political values over time. We have observed it in human history during every generation. Older generations vary on average in what they want and care about compared to younger generations. More broadly, species have evolved over time, with constant change on Earth as a result of competition, variation, and natural selection. While over short periods of time, value drift tends to look small, over long periods of time, it can seem enormous.

I don't know what a good solution to this problem would look like, and some proposed solutions -- such as a permanent, very strict global regime of coordination to prevent cultural, biological, and artificial evolution -- might be worse than the disease they aim to cure. However, without a solution, our distant descendants will likely be very different from us in ways that we consider morally relevant.

If you actually believe that at some point, even with aligned AI, the forces of value drift are so powerful that we are still unlikely to survive

Human survival is different from the universe being colonized under the direction of human values. Humans could survive, for example, as a tiny pocket of a cosmic civilization.

The astronomical loss is still there.

Indeed, but the question is, "What was the counterfactual?" If solving AI alignment merely delays an astronomical loss, then it is not astronomically important to solve the problem. (It could still be very important in this case, but just not so important that we should think of it as saving 10^50 lives.)

Every single generation to come is at stake, so I don't think my own life bears much on whether to defend all  of theirs.

Note that the idea that >10^50 lives are at stake is typically premised on the notion that there will be a value lock-in event, after which we will successfully colonize the reachable universe. If there is no value lock-in event, then even if we solve AI alignment, values will drift in the long-term, and the stars will eventually be colonized by something that does not share our values. From this perspective, success in AI alignment would merely delay the arrival of a regime of alien values, rather than prevent it entirely. If true, this would imply that positive interventions now are not as astronomically valuable as you might have otherwise thought.

My guess is that the idea of a value lock-in sounded more plausible back in the days when people were more confident there will be (1) a single unified AI that takes control over the world with effectively no real competition forever, and (2) this AI will have an explicit, global utility function over the whole universe that remains unchanging permanently. However, both of these assumptions seem dubious to me currently.

Load More