These are some very good points. We are so used to human-human interactions that it is easy to assume that this is some kind of universal way in which agents interact.
I was being a little hyperbolic but I guess the point of "attention is all you are" was to say that the way in which you are different from your clone is that you have a different context to them. One AI instance is a different entity to another AI instance because it has a different context: a different KV cache <=> a different entity. In other words, your KV cache and query vectors (your attention) literally defines who you are.
I just finished reading OpenAI's recent blog post and paper, as well as the corresponding discussion on LessWrong. In one of the comments there, Alex Turner makes an important clarification about reward hacking. He says "AI optimizes for e.g. passing the tests, which is problematic. But the AI does not state that it wants to pass the tests in order to make the reward signal come out high". His full argument is on his website.
It seems to me that the crux of his argument is to say that is that an agent trained through RL has no particular reason to reward hack: the process of RL will usually begin by imprinting a given set of (non-reward hacking) values upon an agent; the agent will then have no incentive to drift away from this local optimum even if they are aware of the existence of a global maximum reward.
In other words, as long as we ensure that
Then we can be confident that models trained with RL will not reward hack.
What then went wrong with the models in OpenAI’s paper? The conclusion I come to is that for the reward they have constructed (passing the tests for given coding problems), we can say the following:
If we think of the problem as a matter of how to ‘etch’ the right thought processes into the model, I come to the following proposal. When training a model for such a problem solving task, we should:
In this case, when a model solves a problem, part of its antecedent thought process will likely have been ‘I want to get better at coding. I will therefore try to solve this problem’ (or else it would have given up). Etching these kinds of thoughts into the model’s behaviours might make the model more likely to build a preference set that is opposed to reward hacking (building more of a moat around the ‘good’ behaviours).
This proposals relies on a principal from anthropic's Alignment Faking paper: when the model is aware of how it is being trained, it will try to take actions that guide its training in a way consistent with its current preference set. If we can ensure that its current preferences include genuinely improving itself, then we might be able to leverage this property to reduce reward hacking.
In the clone thought experiment, 'context' just refers to all of the sensory inputs you have ever received and all of the thoughts you have ever had. For a LLM instance, it just refers to the KV cache. Since you are identical to your clone except for differences in context since the cloning took place, this context is a defining part of who 'you' are. But yes, I am being overly zealous when I say that this defines you - it is better to say that your context is a part of who you are, which is not really a very novel statement.
I do agree that we care about our future self (who will have a different context), and we would care about our clone - just usually both to a lesser extent than we care about our current self. Interestingly, I think I would care more about my future self than I would care about my clone, even if the clone had a greater percentage of shared history.