What is the alternative to intent alignment called?

What term do people use for the definition of alignment in which A is trying to achieve H's goals

Sounds like it should be called goal alignment, whatever it's name happens to be.

That would imply that 'intent alignment' is about aligning AI systems with what humans intend. But 'intent alignment' is about making AI systems intend to 'do the good thing'. (Where 'do the good thing' could be cashed out as 'do what some or all humans want', 'achieve humans' goals', or many other things.)

The thing I usually contrast with 'intent alignment' (≈ the AI's intentions match what's good) is something like 'outcome alignment' (≈ the AI's causal effects match what's good). As I personally think about it, the value of the former category is that it's more clean and natural, while being broad enough to include what I'd consider the most important problems today; while the value of the latter category is that it's closer to what we actually care about as a species.

'Outcome alignment' as defined above also has the problem that it doesn't distinguish alignment work from capabilities work. In my own head, I would think of research as more capabilities-flavored if it helps narrow the space of possible AGI outcomes to outcomes that are cognitively harder to achieve; and I'd think of it as more alignment-flavored if it helps narrow the space of possible AGI outcomes to outcomes that are more desirable within a given level of 'cognitively hard to achieve'.

[-]Richard_Ngo6yΩ130

So I guess more specifically what I'm trying to ask is: how do we distinguish between interpreting the good thing as "human intentions for the agent" versus "human goals"?

In other words, we have at least four options here:

1. AI intends to do what the human wants it to do.

2. AI actually achieves what the human wants it to do.

3. AI intends to pursue the human's true goals.

4. AI actually achieves the human's true goals.

So right now intent alignment (as specified by Paul) describes 1, and outcome alignment (as I'm inferring from your description) describes 4. But it seems quite important to have a name for 3 in particular.

[-]Ramana Kumar5yΩ110

I would use 'outcome alignment' for 2 (and agree with 'intent alignment' for 1). In other words, I see the important distinction between 'outcome' and 'intent' being in the first part of the options, not the second.

I'd be inclined to see 3 and 4 as variations on 1 and 2 where what the human wants is for the AI to figure out some notion of their true goals and pursue/achieve that.

[-]Donald Hobson6yΩ110

(whether or not H intends for A to achieve H's goals)?

How is H not intending A to achive H's goals a meaningful situation. If we make the assumption that humans are goal seeking agents, then the human wants those goals achieved.

Of course the human might not be a goal directed agent, even roughly. Some humans, at least some of those with serious mental illnesses, can't be modelled as goal directed agents even roughly. The human might not actually know that the AI exists.

But if you are defining the humans goals as something different from what the human actually wants, then something odd is happening somewhere. If you want to propose a specific definition of "intends" and "goals" that are different, go ahead, but to me there words read as synonyms.

[-]Richard_Ngo6yΩ120

Probably H intends A to achieve a narrow subset of H's goals, but doesn't necessarily want A pursuing them in general.

Similarly, if I have an employee, I may intend for them to do some work-related tasks for me, but I probably don't intend for them to go and look after my parents, even though ensuring my parents are well looked-after is a goal of mine.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

12

[ Question ]

What is the alternative to intent alignment called?

12

Ω 7

12

Ω 7