Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

As a writing exercise, I'm writing an AI Alignment Hot Take Advent Calendar - one new hot take, written every day (more or less - I'm getting back on track!) for 25 days. Or until I run out of hot takes.

When considering AI alignment, you might be tempted to talk about "the human's utility function," or "the correct utility function." Resist the temptation when at all practical. That abstraction is junk food for alignment research.

As you may already know, humans are made of atoms. Collections of atoms don't have utility functions glued to them a priori - instead, we assign preferences to humans (including ourselves!) when we model the world, because it's a convenient abstraction. But because there are multiple ways to model the world, there are multiple ways to assign these preferences; there's no "the correct utility function."

Maybe you understand all that, and still talk about an AI "learning the human's utility function" sometimes. I get it. It makes things way easier to assume there's some correct utility function when analyzing the human-AI system. Maybe you're writing about inner alignment and want to show that some learning procedure is flawed because it wouldn't learn the correct utility function even if humans had one. Or that some learning procedure would learn that correct utility function. It might seem like this utility function thing is a handy simplifying assumption, and once you have the core argument you can generalize it to the real world with a little more work.

That seeming is false. You have likely just shot yourself in the foot.

Because the human-atoms don't have a utility function glued to them, building aligned AI has to do something that's actually materially different than learning "the human's utility function." Something that's more like learning a trustworthy process. If you're not tracking the difference and you're using "the human's utility function" as a target of convenience, you can all too easily end up with AI designs that aren't trying to solve the problems we're actually faced with in reality - instead they're navigating their own strange, quasi-moral-realist problems.

Another way of framing that last thought might be that wrapper-minds are atypical. They're not something that you actually get in reality when trying to learn human values from observations in a sensible way, and they have alignment difficulties that are idiosyncratic to them (though I don't endorse the extent to which nostalgebraist takes this).

What to do instead? When you want to talk about getting human values into an AI, try to contextualize discussion of the human values with the process the AI is using to infer them. Take the AI's perspective, maybe - it has a hard and interesting job trying to model the world in all its complexity, if you don't short-circuit that job by insisting that actually it should just be trying to learn one thing (that doesn't exist). Take the humans' perspective, maybe - what options do they have to communicate what they want to the AI, and how can they gain trust in the AI's process?

Of course, maybe you'll try to consider the AI's value-inference process, and find that its details make no difference whatsoever to the point you were trying to make. But in that case, the abstraction of "the human's utility function" probably wasn't doing any work anyhow. Either way.

New to LessWrong?

New Comment
22 comments, sorted by Click to highlight new comments since: Today at 7:46 AM

A steelman of the claim that a human has a utility function is that agents that make coherent decisions have utility functions, therefore we may consider the utility function of a hypothetical AGI aligned with a human. That is, assignment of utility functions to humans reduces to alignment, by assigning the utility function of an aligned AGI to a human.

I think this is still wrong, because of goodhart scope of AGIs and corrigibility of humans. Agent's goodhart scope is the space of situations where it has good proxies for its preference. An agent with decisions governed by a utility function can act in arbitrary situations, it always has good proxies for its utility function. Logical uncertainty doesn't put practical constraints on its behavior. But for an aligned AGI that seems unlikely, CEV seems complicated and possible configurations of matter superabundant, therefore there are always intractable possibilities outside the current goodhart scope. So it can at best be said to have a utility function over its goodhart scope, not over all physically available possibilities. Thus the only utility function it could have is itself a proxy for some preference that's not in practice a utility function, because the agent can never actually make decisions according to a global utility function. Conversely, any AGI that acts according to a global utility function is not aligned, because its preference is way too simple.

Corrigibility is in part modification of agent's preference based on what happens in environment. The abstraction of an agent usually puts its preference firmly inside its boundaries, so that we can consider the same agent, with the same preference, placed in an arbitrary environment. But a corrigible agent is not like that, its preference depends on environment, and in the limit it's determined by its environment, not just by the agent. Environment doesn't just present the situations for an agent to choose from, it also influences the way it's making its decisions. So it becomes impossible to move a corrigible agent to a different environment while preserving its preference, unless we package its whole original environment as part of the agent that's being moved to a new environment.

Humans are not at all classical agent abstractions that carry the entirety of their preference inside their heads, they are eminently corrigible, their preference depends on environment. As a result, an aligned AGI must be corrigible not just temporarily because it needs to pay attention to humans to grow up correctly, but permanently, because its preference must also continually incorporate the environment, to remain the same kind of thing as human preference. Thus even putting aside logical uncertainty that keeps AGI's goodhart scope relatively small, an aligned AGI can't have a utility function because of observational/indexical uncertainty, it doesn't know everything in the world (including the future) and so doesn't have the data that defines its aligned preference.

A steelman of the claim that a human has a utility function is that agents that make coherent decisions have utility functions, therefore we may consider the utility function of a hypothetical AGI aligned with a human. That is, assignment of utility functions to humans reduces to alignment, by assigning the utility function of an aligned AGI to a human.

The problem is, of course, that any possible set of behaviors can be construed as maximizing some utility function. The question is whether doing so actually simplifies the task of reasoning and making predictions about the agent in question, or whether mapping the agent's actual motivational schema to a utility function only adds unwieldy complications.

In the case of humans, I would say it's far more useful to model us as generating and pursuing arbitrary goal states/trajectories over time. These goals are continuously learned through interactions with the environment and its impact on pain and pleasure signals, deviations from homeostatic set points, and aesthetic and social instincts. You might be able to model this as a utility function with a recursive hidden state, but would that be helpful?

any possible set of behaviors can be construed as maximizing some utility function

(Edit: What do you mean? This calls to mind a basic introduction to what utility functions do, given below, but on second thought that's probably not what the claim is about. I'll leave the rest of the comment here, as it could be useful for someone.)

A utility function describes decisions between lotteries, which are mixtures of outcomes, or more generally events in a sample space. The setting assumes uncertainty, outcomes are only known to be within some event, not individually. So a situation where a decision can be made is a collection of events/lotteries, one of which gets to be chosen, the choice is the behavior assigned to this situation. This makes situations reuse parts of each other, they are not defined independently. As a result, it becomes possible to act incoherently, for example pick A from (A, B), pick B from (B, C) and pick C from (A, C). Only satisfying certain properties of collections of behaviors allows existence of a probability measure and a utility function such that agent's choice among the collection of events in any situation coincides with picking the event that has the highest expected utility.

Put differently, the issue is that behavior described by a utility function is actually behavior in all possible and counterfactual situations, not in some specific situation. Existence of a utility function says something about which behaviors in different situations can coexist. Without a utility function, each situation could get an arbitrary response/behavior of its own, independently from the responses given for other situations. But requiring a utility function makes that impossible, some behaviors become incompatible with the other behaviors.

In the grandparent comment, I'm treating utility functions more loosely, but their role in constraining collections of behaviors assigned to different situations is the same.

Meaning "simple utility function" by the phrase "utility function" might be a conceptual trap. It make s a big difference whether you consider a function with hundreds of terms of or billions of terms or even things that can not be expressed as a sum.

As a "tricky utility function", "human utility function" is mostly fine. Simple utility functions are relevant to todays programming but I don't know whether honing your concepts to apply better for AGI is served to make a cleanly cut concept that limits only that domain.

Some hidden assumtions might be things like "If humans have a utility function it can be written down", "Figuring out a humans utility function is practical epistemological stance with a single agent encountering new humans"

If you take stuff like that out the "mere" existence of a function is not that weighty a point.

As you may already know, humans are made of atoms. Collections of atoms don't have utility functions glued to them

Whole theories of physics can be formulated as a single action that is then extremised. Taking different theories as different answers to a question like "what happens next?" a single theorys formula is its "choice". Thus it seems a lot like physical systems could be understood in terms of utility functions. An electron knows how an electron behaves, it does have a behaviour glued into it. If you just add a lot of electrons or protons (and other stuff that has similar laws) it is not like aggregation from the microbehaviours makes the function fail to be a function as a macrobehaviour.

I'll reiterate that a problem with this is lack of uniqueness. There is not a thing that is the human utility function, even if you allow arbitrarily messy utility functions. If you assume that there is one, it turns out that this is a weighty meta-level commitment even if your class of utility functions is so broad as to be useless on the object level.

I think reflection could help a lot with this, deciding how to proceed in formulating preference based on currently available proxies for preference (with some updatelessness taking care of undue path sensitivity). At some point, preference mostly develops itself, without looking at external data.

If you can agree that putting two electrons in the same system can still be predicted by minimizing an action then you should agree that putting two humans in the same system can still be in principle justified how it plays out. Iterate a little bit and you have a predictable 6 billion human system.

So what operation are we doing where this particular object level is relevant?

I don't understand what you mean, particularly the last question.

Yes, electrons and humans can be predicted by the laws of physics. The laws of physics are not uniquely specified by our observations, but they are significantly narrowed down by Occam's razor. But how are you thinking this applies to alignment? We don't want an AI to learn "humans are collections of atoms and what they really want is to follow the laws of physics."

Questions like "what would this human do in a situation where there is a cat in a room" has a unique answer that reflects reality, as if that kidn of situation was ran then something would need to happen.

Sure if we start from high abstract values and then try to make them more concrete we might lose the way. If we can turn philosophies into feelings but do not know how to turn feelings into chemistry then there is a level of representation that might not be sufficient. But we know there is one level that is sufficient to describe action and that all the levels are somehow (maybe in an unknown way) connected (mostly stacked on top). So this incompatibility of representation can not be fundamental. Because if it was, then there would be a gap between the levels and the thing would not be connected anymore.

So there is no question "presented with this stimuli how would the human react?" that would be in principle unanswerable. If preferences are expressed as responces to choice situations this is a subcategory of reaction. Even if preferences are expressed as responces to philosophy prompts they would be a subcategory.

One could say that it is not super clarifying that if a two human system represented with philosophical stimuli of "Is candy worth 4$?" you get one human that says "yes" and another human that says "no". But this is just a swiggle in the function. The function is being really inconvenient when you can't use an approximation where you can think of just one "average human" and then all humans would reflect that very closely. But we are not promised that the function is a function of time of day or function of verbal short term memory or function of television broadcast data.

Maybe you are saying something like "genetic fitness doesn't exist" because some animals are fit when they are small and some animals are fit when they are large, so there is no consistent account whether smallness is good or not. Then "human utility function doesn't exist" because human A over here dares to have different opinions and strategies than human B over here and they do not end up mimicing each other. But like an animal lives or dies, a human will zig or zag. And it can not be that the zigging would fail to be a function of worldstate (with some QM assumed away to be non-significant (and even then maybe not)). What it can be is fail to be function of the world state as we understand it, or our computer system models it, or can be captured in the variables we are using. But then the question is whether we can make do with just these variables and not that there would be nothing to model.

In this language it could be rephrased:

If you think you have a good wide set of variables to come up with any needed solution function, you don't. You have too few variables.

But the "function" in this sense is how the computer system models reality (or like attitudial modes it can take towards reality). But part of how we know that the setup is inadequate is that there is an entity outside of the system that is not reflected in it. Aka, this system can only zig or zag when we needed zog which it can not do. The thing that will keep on missing is the way that reality actually dances. Maybe in some small bubbles we can actually have totally capturing representations in the senses that we care. But there is a fact of the matter to the inquiry. For any sense we might care there is a slice of the whole thing that is sufficient for that. To express zog you need these features, to express zeg you need these other ones.

Human will is quite complex so we can reasonably expect to be spending quite a lot of time in undermodelling. But that is a very different thing from being unmodellable.

Questions like "what would this human do in a situation where there is a cat in a room" has a unique answer that reflects reality, as if that kidn of situation was ran then something would need to happen.


It's not about what the human would do in a given situation. It's about values – not everything we do reflects our values. Eating meat when you'd rather be vegetarian, smoking when you'd rather not, etc. How do you distinguish biases from fundamental intuitions? How do you infer values from mere observations of behavior? There are a bunch of problems described in this sequence. Not to mention stuff I discuss here about how values may remain under-defined even if we specify a suitable reflection procedure and have people undergo that procedure. 

Ineffective values do not need to be considered for a utility function as they do not effect what gets strived for. If you say "I will choose B" and still choose A you are still choosing A. You are not required to be aware of your utility function.

That is a lot of material to go throught en masse, so I will need some sharper pointers of relevance to actually engage.

Ineffective values do not need to be considered for a utility function as they do not effect what gets strived for. If you say "I will choose B" and still choose A you are still choosing A. You are not required to be aware of your utility function.

Uff, a future where humans get more of what they're striving for but without adjusting for biases and ineffectual values? Why would you care about saving our species, then? 

It sounds like people are using "utility function" in different ways in this thread. 

I do think that there is a lot of confusion and definitional ground work would probably bear fruit.

If one is trying to "save" some fictious homo economicus that significantly differs from human, that is not really humans.

A world view where humans-as-is is too broken to bother salvaging is rather bleak. I see that the transition away from biases can be modelled has having a utility function with biases and then describing a utility function "without biases" the "how the behaviour should be" and arguing what kind of tweaks we need to make into the gears so that we get from the first white box to the target white box. Part of this is getting the "broken state of humans" to be modelled accurately. If we can get a computer to follow that we would hit aligned exactly-medium-AI. Then we can ramp up the virtuosity of the behaviour (by providing a more laudable utility function).

There seems to be an approach where we just describe the "ideal behaviour utility function" and try to get the computers to do that. Without any of the humans having the capability to know or to follow such a utility function. First make it laudable and then make it reminiscent of humans (hopefully making it human approvable).

The exactly-medium-AI function is not problematically ambigious. "Ideal reasoning behaviour" is under significant and hard-to-reconcile difference of opinion. "Human utility function" refers to exactly-medium-AI but only run on carbon.

I would benefit and appriciate if anyone bothers to fish out conflicting or inconsistent use of the concept.

component of why I'm not sure I agree with this: I claim stable diffusion has a utility function. does anyone disagree with this subclaim?

[This comment is no longer endorsed by its author]Reply

Do you mean model's policy as it works on a query, or learning as it works on a dataset? Or something specific to stable diffusion? What is the sample space here, and what are the actions that decisions choose between?

score based models, such as diffusion, work by modeling the derivative of the utility function (density function) over examples, I believe?

see, eg, https://lilianweng.github.io/posts/2021-07-11-diffusion-models/ or any of the other recommended posts at the top.

actions are denoising steps. sample space is output space, ie image space for stable diffusion.

You're talking about the score function, right? Which is the derivative of the log probability density function. I dunno how to get from there to a utility function interpretation. Like, we don't produce samples from the model by globally maximizing over the PDF (at worst, trying that might produce an adversarial example, and at best, that would sample the "most modal" image).

ah, okay. yup, you're right, that's what I was referring to. I am now convinced I was wrong in my original comment!

Lots of things "have a utility function" in the colloquial sense that they can be usefully modeled as having consistent preferences. But sure, I'll be somewhat skeptical if you want to continue "taking the utility-function perspective on stable diffusion is in some way useful for thinking about its alignment properties."

but diffusion specifically works by modeling the derivative of the utility function, yeah?

Ah, you're talking about guidance? That makes sense, but you could also take the perspective that guidance isn't really playing the role of a utility function, it's just nudging around this big dynamical system by small amounts.

no, I'm talking about the basic diffusion model underneath. It models the derivative of the probability density function, which seems reasonable to call a utility function to me. see my other comment for link