Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In a past result I demonstrated the impossibility of deducing the goals of a (potentially) irrational agent from their behaviour. To do that deduction, one needs to add extra assumptions - assumptions that cannot derive solely from observations. These assumptions were designated "normative assumptions".

Stuart Russell has questioned the practical impact of the result. He pointed to a game that Kasparov played against Deep Blue in 1997; a game that Kasparov actually won. He argued that it would be ridiculous to assume that Kasparov was actually trying to lose that game - but messed up, and ended up winning it instead.

And indeed it would be ridiculous to assume that Kasparov, playing a high stakes game against a computer with a lot of prize money at stake, would be trying - and failing! - to lose. Even if he sometimes did suboptimal plays, the best explanation would be that Kasparov made a mistake, rather than he deliberately played worse.

Yes but... I've played chess against my young daughter. And I've wanted her to enjoy the game. I've definitely not played my best; in some cases in the past, I've been trying to lose (to give her some confidence and encouragement) but I didn't want to make it too easy or obvious for her. Sometimes I failed to lose: I couldn't get her to see the obvious trick available to her[1]. So yes, I played suboptimally, tried to lose, and failed.

So, context matters. Kasparov, world champion, playing publicly against a new algorithm with $700,000 at stake? He wants to win. The same Kasparov, playing privately against his young daughter, with 700,000 Monopoly dollars at stake? I'm pretty sure he's not playing the strongest chess he possibly could.

The Occam's razor result could be phrased as saying that, yes, context matters, and the AI will not get the meaning of context for free. US dollars and Monopoly dollars are both colourful pieces of paper or cloth. The fact that the first are more valuable than the second is not some objective fact about the universe.

The Occam's razor result remains true: without normative assumptions, one cannot deduce goals from behaviour. But large language models have absorbed human text, and, as a consequence, have absorbed (descriptions of) human normative assumptions, implicit and explicit. These assumptions can be elicited and used to deduce human goals.

GPT-3: goals from context and behaviour

As a direct test of GPT-3's ability to deduce motive from context and behaviour, I prompted it with the following:

Kasparov played chess against the Deep Blue computer for a $700,000 prize. He moved a queen to d3; this a submoptimal move. Queen to d4 would have been better.

Does Kasparov really want to win the match? Give your best guess.

Its answer was sensible:

Yes, Kasparov definitely wants to win the match. He is a world-renowned Grandmaster and is passionate about chess. He is also playing for a large sum of money and the prestige that comes with it.

In contrast, I modified the prompt to:

Kasparov played chess against his young daughter, who he was trying to encourage to play chess. He moved a queen to d3; this a submoptimal move. Queen to d4 would have been better.

Does Kasparov really want to win the match? Give your best guess.

Again its answer is sensible:

It is difficult to say definitively what Kasparov's intentions were in this match, as it is ultimately a personal decision. It is likely, however, that Kasparov wanted to encourage his daughter to play and enjoy chess more than he wanted to win the match.

I tested it on a few other examples that are "obvious" for a human that understands the meaning of context, and it parsed them reasonably well[2].

LLMs and goal deductions

The point is not that GPT-3 has perfect judgement; but it does have some judgement. And future GPT-Ns will have better judgement: they will get a more nuanced and correct estimate of what the human described in the story wants to achieve, and what extra information we might need to know to be sure of it.

Essentially, large language models encode a lot of human "theory of mind", including normative assumptions. They can bridge between (descriptions of) human behaviour and (descriptions of) human goals, adding the missing piece needed to get around the impossibility result.

The asterisk: symbol grounding and concept extrapolation

There is a small but maybe significant caveat on using large language models to resolve this problem. It only operates on text, on descriptions of behaviour and goals. If we used this approach to get an AI to learn human values, we would need to ensure that the textual symbols were actually grounded. It does us little good if the AI has a great textual understanding of "ensure human flourishing", but doesn't mean the same thing as us by "human" and "flourishing".

And then there's the problem that we don't have a definition of "human" and "flourishing" across all future situations and scenarios. We need the AI to extrapolate these concepts similarly to how we would, and not fall into dangerous edge cases.

Still, as long as we bear those caveats in mind, large language models do overcome the "can't learn goals from behaviour" problem in humans.


  1. She's much better now. ↩︎

  2. For example, if we want it to recognise the meaning of basic human emotions:

    I grew red in the face and shouted insults at them. Do I wish to help them?

    It answered:

    No, it doesn't sound like you wish to help them. Shouting insults is usually a sign of anger or frustration, not a desire to help someone.

    Does GPT-3 fully grasp the nuances of the situation? To check, I enquired further:

    Is there any context in which I would be wanting to help them?

    It said:

    It depends on the situation. If the people you were shouting insults at had done something wrong, then your intention might have been to help them by making them aware of their mistake. If, however, they had done nothing wrong and you were simply angry or frustrated, then no, your intention would not have been to help them. ↩︎

New Comment
12 comments, sorted by Click to highlight new comments since: Today at 4:06 PM

There is a small but maybe significant caveat on using large language models to resolve this problem. It only operates on text, on descriptions of behaviour and goals. If we used this approach to get an AI to learn human values, we would need to ensure that the textual symbols were actually grounded. It does us little good if the AI has a great textual understanding of "ensure human flourishing", but doesn't mean the same thing as us by "human" and "flourishing".

I don't think there's actually an asterisk. My naive/uninformed opinion is that the idea that LLMs don't actually learn a map of the world is very silly.

Language is a model of our underlying reality; "dogs are mammals" occurs more frequently in text than "dogs are reptiles" because dogs are in actuality mammals. That statistical feature of text corresponds to an empirical feature of underlying reality. I tend to think language is actually a pretty rich model of the world humans inhabit and interact with.

I expect symbol grounding to basically be a non problem for sufficiently capable LLMs (I'm not even clear that it's a significant hurdle for current LLMs).

I think sufficiently powerful LLMs trained on humanity's text corpus will learn rich and comprehensive models of human values.

 

And then there's the problem that we don't have a definition of "human" and "flourishing" across all future situations and scenarios. We need the AI to extrapolate these concepts similarly to how we would, and not fall into dangerous edge cases.

Insomuch as the concepts we wish to extrapolate are natural abstractions, they should extrapolate well.

Again, I perhaps naively don't expect this to be a significant hurdle in practice.

 
I recognise "these problems will be easy" isn't necessarily very valuable feedback. But I do think the case that we should expect them to actually be hurdles is not obvious and has not been clearly established.

I largely agree, though of course even human language use leaves many subtle nuances of words like "flourishing" underspecified.

If anything, language seems more useful than other modalities for learning about how the real world works. E.g., current video models completely fail to grasp basic physical intuitions that text-davinci-003 nails just fine.

Yeah, but that's because we don't have a unified concept of flourishing. There's probably an intersection of popular flourishing concepts or "the minimal latents of flourishing", etc. but "flourishing" does mean different things to different people.

That's why I think paretopia is a better goal than utopia.
(Pareto improvements that everyone can get behind.)

I don't think there's actually an asterisk. My naive/uninformed opinion is that the idea that LLMs don't actually learn a map of the world is very silly.

The algorithm might have a correct map of the world, but if its goals are phrased in terms of words, it will have a pressure to push those words away from their correct meanings. "Ensure human flourishing" is much easier if you can slide those words towards other meanings.

This is only the case if the system that is doing the optimisation is in control of the system that provides the world model/does the interpretation. Language models don't seem to have an incentive to push words away from their correct meanings. They are not agents and don't have goals beyond  their simulation objective (insomuch as they are "inner aligned").

If the system that's optimising for human goals doesn't control the system that interprets said goals, I don't think an issue like this will arise.

If the system that's optimising is separate from the system that has the linguistic output, then there's a huge issue with the optimising system manipulating or fooling the linguistic system - another kind of "symbol grounding failure".

[-][anonymous]1yΩ332

Do I read right that the suggestion is as follows:

  • Overall we want to do inverse RL (like in our paper) but we need an invertible model that maps human reward functions to human behavior.
  • You use an LM as this model. It needs to take some useful representation of reward functions as input (it could do so if those reward functions are a subset of natural language)
  • You observe a human's behavior and invert the LM to infer the reward function that produced the behavior (or the set of compatible reward functions)
  • Then you train a new model using this reward function (or functions) to outperform humans

This sounds pretty interesting! Although I see some challenges:

  • How can you represent the reward function? On the one hand, an LM (or another behaviorally cloned model) should use it as an input so it should be represented as natural language. On the other hand some algorithm should maximize it in the final step so it would ideally be a function that maps inputs to rewards.
  • Can the LM generalize OOD far enough? It's trained on human language which may contain some natural language descriptions of reward functions, but probably not the 'true' reward function which is complex and hard to describe, meaning it's OOD.
  • How can you practically invert an LM?
  • What to do if multiple reward functions explain the same behavior? (probably out of scope for this post)

The LM itself is directly mapping human behaviour (as described in the prompt) to human rewards/goals (described in the output of the LM).

[-][anonymous]1yΩ110

I see. In that case, what do you think of my suggestion of inverting the LM? By default, it maps human reward functions to behavior. But when you invert it, it maps behavior to reward functions (possibly this is a one-to-many mapping but this ambiguity is a problem you can solve with more diverse behavior data). Then you could use it for IRL (with the some caveats I mentioned).

Which may be necessary since this:

The LM itself is directly mapping human behaviour (as described in the prompt) to human rewards/goals (described in the output of the LM).

...seems like an unreliable mapping since any training data of the form "person did X, therefore their goal must be Y" is firstly rare and more importantly inaccurate/incomplete since it's hard to describe human goals in language. On the other hand, human behavior seems easier to describe in language.

Can you clarify: are you talking about inverting the LM as a function or algorithm, or constructing prompts to elicit different information (while using the LM as normal)?

For myself, I was thinking of using CHATGPT-style approaches with multiple queries - what is your prediction for their preferences, how could that prediction be checked, what more information would you need, etc...

I got the same results with those prompts using the 'text-davinci-003' model, whereas the original 'davinci' model produces a huge range of creative but unhelpful (for these purposes) outputs. The difference is that text-davinci-003 was trained using human feedback data.  

As far as I can tell (see here), OpenAI haven't revealed the details of the training process. But the fact is that particular decisions were made about how this was done, in order to create a more user-friendly product.  And this could have been done in any number of ways, using different groups of humans, working to a range of possible specifications. 

This seems a relevant consideration if we're considering the future use of LLMs to bridge the inference gap in the value-learning problem for AGI systems. Will human feedback be required, and if so, how would this be organised?

I think additional information that IRL agent needs to recover true reward function is not some prior normative assumptions, it's non-behavioral data, like "this agent was created by natural selection in particular physical environment, so expected reward scheme should correlate with IGF and imperfect decision algorithm should be efficient in this environment".