Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I believe that it is possible for two agents to have the exact same source code while (in some sense) optimising two different utility functions.

I take a utility function to be a function from possible world states to real numbers. When I say that an agent is optimising a utility function I mean something like that the agent is "pushing" its environment towards states with higher values according to said utility function. This concept is not entirely unambiguous, but I don't think its necessary to try to make it more explicit here. By source code I mean the same thing as everyone else means by source code.

Now, consider an agent which has a goal like "gain resources" (in some intuitive sense). Say that two copies of this agent are placed in a shared environment. These agents will now push the environment towards different states, and are therefore (under the definition I gave above) optimising different utility functions.

New Comment
11 comments, sorted by Click to highlight new comments since: Today at 10:02 AM

Makes sense. It seems to flow from the fact that the source code is in some sense allowed to use concepts like 'Me' or 'I', which refer to the agent itself. So both agents have source code which says "Maximise the resources that I have control over", but in Agent 1 this translates to the utility function "Maximise the resources that Agent 1 has control over", and in Agent 2 this translates to the different utility function "Maximise the resources that Agent 2 has control over".

So this source code thing that we're tempted to call a 'utility function' isn't actually valid as a mapping from world states to real numbers until the agent is specified, because these 'Me'/'I' terms are undefined.

The core feature here seems to be that the agent has some ability to refer to itself, and that this localization differs between instantiations. Alice optimizes for dollars in her wallet, Bob optimizes for dollars in his wallet, and so they end up fighting over dollars despite being clones, because the cloning procedure doesn't result in arrows pointing at the same wallet.

It seems sensible to me to refer to this as the 'exact same source code,' but it's not obvious to me how you would create these sort of conflicts without that sort of different resolution of pointers, and so it's not clear how far this argument can be extended.

You don't necessarily need "explicit self-reference". The difference in utility functions can also be obtained due to a difference in the location of the agent in the universe. Two identical worms placed in different locations will have different utility functions due to their atoms being not exactly in the same location, despite not having explicit self-reference. Similarly, in a computer simulation, the agents with the same source code will be called by the universe-program in different contexts (if they weren't, I don't see how it makes sense to even speak of them as being "different instances of the same source code". There would just be one instance of the source code.).

So in fact, I think that this is probably a property of almost all possible agents. It seems to me that you need a very complex and specific ontological model in the agent to prevent these effects and have the two agents have the same utility function.

Using the definitions from the post, those agents would be optimising the same utility functions, just by taking different actions.

I agree.

An even simpler example: If the agents are reward learners, both of them will optimize for their own reward signal, which are two different things in the physical world.

Yeah. But I think that can only happen if the agents aren't very smart :-) A very smart agent would've self-modified at startup before seeing any observations, and adopted a utility function that's a weighted sum of selfish utilities of all copies, with the weights given by its prior of being this or that copy.

Yes, agents with different preferences are incentivised to cooperate provided that the cost of enforcing cooperation is less than the cost of conflict. Agreeing to adopt a shared utility function via acausal trade might potentially be a very cheap way to enforce cooperation, and some agents might do this just based on their prior. However, this is true for any agents with different preferences, not just agents of the type I described. You could use the same argument to say that you are in general unlikely to find two very intelligent agents with different utility functions.

Agents with identical source code will reason identically before seeing any observations, so the "acausal trade" in this case barely feels like trade at all, just making your preferences updateless over possible future observations. That's much simpler than acausal trade between agents with different source code, which we can't even formalize yet.

Here are some counterarguments:

There can be scenario's where the agent cannot change his source code without processing observations. e.g. the agent may need to reprogram himself via some external device.

The agent may not be aware that there are multiple copies of him.

It seems that for many plausible agent designs, it would require a significant change in the architecture to change his utility function. E.g. if two human sociopaths would want to change their utility function into a weighted average of the two, they couldn't do so without significantly changing their brain architecture. A TDT agent could do this, but I think it is not prudent to assume that all actually future existing AGI's we will deal with will be TDT's (in fact, most likely most of them won't be it seems to me).

So I don't think your comment invalidates the relevance of the point made by the poster.

Yeah, I was talking mostly about idealized UDT agents, not humans.

Hmm, I feel like I'm missing some AISFP context here. Why should I care about this result?