LESSWRONG
LW

I agree that long-term intent alignment is pretty much incoherent because people don't have much in the way of long-term intentions. I guess the exception would be to collapse it to following intent only when it exists - when someone does form a specified intent.

In my work, intent alignment I means personal short-term intent. Which is pretty much following instructions as they were intended. That seems coherent (although not without Problems).

I use it that way because others seem to as well. Perhaps that's because the broader use is incoherent. It usually seems to means "does what some person or limited group wants it to do" (in the short term is often implied)

The original definition of intent alignment is the broadest I know of, more-or-less doing something people want for any reason. Evan Hubinger defined it that way, although I haven't seen that definition get much use.

For all of this see Conflating value alignment and intent alignment is causing confusion. I might not have been clear enough in stressing that I drop the "personal short term" but still mean it when saying intent alignment. I'm definitely not always clear enough

[-]Joe Rogero2mo10

Thanks for clarifying. It still seems that we'd encounter the same sort of problem even in the short term, though? Take the case of a programmer hijacking the input medium. Does the AI care? It's still getting instructions to follow. To what extent is it modeling the real humans on the other end? You touch on this in Defining the Principal(s) and jailbreaking, but it seems like it should be even more of a Problem for the approach. Like, an AI that can robustly navigate that challenge, to the point of being more or less immune to intercepts, seems hard to distinguish from one that is (a) long-term aligned as well and (b) possessed of deadly competence at world-modeling if not long-term aligned. An AI that can't handle this problem...well, is it really intent-aligned? Where else does its understanding of the developers break down?

Moderation Log

More from Joe Rogero

Curated and popular this week

2Comments

This post is part of the sequence Against Muddling Through.

Epistemic status: I notice I am confused.

I've updated the planned title of this post after reading some of the responses to previous ones. I'm now slightly less confused, but still confused. Progress!

I have an inkling of what it might mean for an AI to be fully aligned to the values of some person or group. I don’t know what it would mean for an AI to be mostly aligned to the intent of a person or group, because humans and human institutions don’t have a coherent unified intent in the first place. The whole idea seems destined either to round off to CEV or to collapse in a puff of logic. ^[1]

In this post, I attempt to highlight my confusions. I don’t claim to know exactly what the proponents of this idea have in mind, and I am probably operating on the very first layer of the Disagreements Onion, but I have to start somewhere.

What would it mean for an AI to be aligned to the intent of its developers? How aligned is aligned enough for alignment work?^[2]

Zooming in on a particular thought experiment:

These AIs are sufficiently aligned that they try hard to do the right thing even on hard-to-check, conceptual, and open-ended tasks.

How does the AI know the right thing? How does it reconcile the giant mess of conflicting urges and drives that animate a single human, let alone a group of us, into some notion of human “intent” that doesn’t collapse like a house of cards in edge cases?

This idea seems related to the concept of instruction-followers (Seth Herd):

This might intuitively be described as trying to make instruction-following (IF) or do-what-I-mean-and-check (DWIMAC) be the central goal of the AGI we design.

And:

A good employee is usually attempting to do what I mean and check. Imagine a perfect employee, who wants to do what their boss tells them to do. If asked to prepare the TPS reports for the first time, this employee will echo back which reports they’ll prepare, where they’ll get the information, and when they’ll have the task finished, just to make sure they’re doing what the boss wants. If this employee is tasked with increasing the sales of the X model, they will not come up with a strategy that cannibalizes sales of the Y model, because they recognize that their boss might not want that. [...]
An instruction-following AGI must have the goal of doing what its human(s) would tell it to do right now, what it’s been told in the past, and also what it will be told to do in the future.

This seems like it's very close to rounding off to CEV. I understand the distinction between short-term wants and reflectively endorsed values, but human wants are incoherent and an AI doing this much reflection on them will notice that. It has to resolve the tradeoffs somehow.

To illustrate, I will list a few of what seem to me like pretty basic surface-level features of an AI that isn’t obviously misaligned. This is not a comprehensive list,^[3] it’s just an attempt to write down some of the things that seem to intuitively describe what people seem to mean by “aligned to developers’ intent.” They are:

It follows instructions.
It resolves ambiguity in developer-endorsed ways.
It asks for clarification without getting stuck.
It handles conflicting instructions gracefully.
It fulfills the spirit, not the letter, of its instructions.^[4]

It follows instructions. Human languages are not programming languages. They are ambiguous, often by design. Any natural-language instruction that a human gives to an AI will contain ambiguity. For every instance of ambiguity, the AI must either (a) resolve the ambiguity according to its own best judgment or (b) ask a human to clarify.

It resolves ambiguity in developer-endorsed ways. If the AI uses its own judgment, then it must resolve ambiguity in a way its developers would endorse. Its answers to questions like “How much?”, “How far?”, and “At what cost?” are in keeping with the way the developers meant for it to act. Otherwise, seemingly innocuous instructions are dangerous; the instruction “Optimize this processor” can result in anything from hacky shortcuts to scheming to tile the Earth in processors.

It asks for clarification without getting stuck. If the AI asks a human to clarify, it receives new natural-language instructions containing new ambiguities, which it must then resolve using its own judgment or by asking for more clarification. The AI then has to decide when to stop asking for clarification and do things. It must make that decision in a way that does not leave it gravely mistaken about its developers’ intent.

It handles conflicting instructions gracefully. Any two instructions will involve some measure of tradeoff, like the tradeoff between “be helpful” and “be harmless”, or between “summarize this document” and “don’t leave out anything important,” or between “make this code performant” and “make this code maintainable.” The AI has to anticipate how its developers would resolve those tradeoffs, even in extreme edge cases they’ve never encountered before, or recognize those edge cases and ask about them.

It fulfills the spirit, not the letter, of its requests. In extremis, the AI must have a good enough model of the developers’ values that it can recognize the difference between what a developer wants-in-the-moment and what they would reflectively endorse. If the developers ask for something that would have horrible consequences (according to their own values), even if it understands the meaning of the request perfectly, the AI must not give them exactly what they asked for. It must refuse requests from the developers themselves that are not in the developers’ best interests.^[5]

And that’s before we get into the details of implementation. How does the AI operationalize “developers’ intent”? Who counts as a “developer”? By what official medium is this intent communicated to the AI? What does the AI do when that medium is inevitably hijacked by a disgruntled programmer?^[6] How does the AI handle the inevitable conflicts among developers? What happens when the company itself doesn’t know who’s in charge? Some of these questions are hard even for humans, and each one spawns a fractally nested web of subproblems.

In each of these cases, the AI has to infer what its messy human masters actually want. It has to make those judgment calls using indirect access to the wants in question, filtered through the lossy medium of human language, when the humans in question don’t fully grasp the implications of their own values in the first place.

The AI can’t just read the developers’ minds; and even if it could, it would still have a confusing mess of conflicting priorities to sort through. It’s interpretation all the way down.

On top of that, the AI must not only know the developers’ values, it must be motivated to act on that knowledge.

Generalize this far enough and you’ve effectively instantiated coherent extrapolated volition. To the best of my current understanding, you cannot achieve the described outcomes without it.

Fail to generalize this far enough, fail to design an AI that does something akin to implementing the CEV of its developers, and at some point the correspondence between what the AI wants to do and what the developers want it to do must break down. For reasons explored in a previous post, I strongly expect this breakdown to happen by the time an AI is smart enough to do AI alignment research better than the best humans.^[7]

Prosaic alignment plans seem to rely on this point being far enough away that substantially smarter-than-human AIs will behave themselves, or to expect that we’ll be able to control them if they don’t.

Or perhaps some imagine a smart AI whose goals are in some particular regime between arbitrary proxies and CEV. But I don’t have a very clear picture of the limits of that regime, or how one would attain or verify it using current techniques.

Every variant of “near-perfect intent alignment” I’ve managed to conceive either rounds to CEV or blows up in predictable ways. I may be missing something. I would love to be set straight on this by someone with a deeper understanding of prosaic alignment plans.

In the meantime, though, the show must go on. Next, I’ll move onto what feels like firmer ground and discuss why scaling alignment and capabilities together wouldn’t work.

Afterword: Confidence and possible cruxes

I’m guessing that most disagreements will involve some conceptions of “intent alignment” that I hadn’t considered. In a sense, a lot of my disagreements with prosaic plans seem to be downstream of the idea that “misaligned, but not so misaligned it would want to fight you” seems little easier than “wholly aligned” with current tools.

Another possible crux — not necessarily with this specific post, but overall — is the idea that some mix of alignment and capabilities might be safe enough. To make up some numbers as an example, going from 20% aligned and 10% superintelligent, then 30% aligned and 15% superintelligent and so on. Breaking this potential crux down farther, I think this fails because current techniques (even with AI assistance) don’t even get us 20% of the way to alignment, and because labs seem to be scaling capabilities first and foremost, and because I’m fairly sure that it’s mostly capabilities that matter and even a 98% aligned somewhat-superintelligence still gets us all killed. More on this in the next post.

One implication of the headline claim, if true, is that “authoritarian lock-in” and other “aligned to the wrong entity” problems seem to require near-totally solving alignment and thus seem vanishingly unlikely. We may eventually have this problem, but not until alignment research has progressed far enough to allow someone to specify “implement my CEV, but not the CEV of anyone else.”^[8] If they try for something like “obedience”, I think we’re back to everyone dying instead. This seems like a common point of disagreement, but I’m not sure it’s a crux.

^{^}
For those familiar with Seth Herd’s work, I think my current intuitions can be summed up loosely as “but wait, wouldn’t intent alignment either converge at value alignment or resolve a conflict the wrong way and collapse to misalignment?” (With the second seeming far more likely.)
^{^}
How long do we have before everyone in the field reaches semantic satiation on the word “alignment” and we're reduced to helpless pantomime? Wait, wrong threshold.
^{^}
For one thing, I’m not talking about corrigibility. Well, not explicitly. For another, I’m describing surface-level behaviors rather than deep motivations.
^{^}
I don’t know if proponents would consider this feature necessary for “aligned enough to do alignment”, but it sure seems like a really horrible idea to do alignment work with a near-superhuman AI that doesn’t have it. This is the no-monkey-paws feature. It seems important.
^{^}
Maybe the point of divergence is that a value-aligned AI does this and an intent-aligned AI does not? But there's no fine line between doing this and not doing this; it's all a question of degree.
^{^}
If you answered “Obviously it ignores the programmer”, then think about what this implies about the reliability of the official medium. The AI is clearly relying on some other method of figuring out the developers’ intent besides the official medium, one that is less legible to the developers. If you answered “Obviously it obeys the programmer”, then that would seem to imply the AI is either too stupid to recognize a hack when it sees one, or it doesn’t care all that much about the actual intent of its actual developers and only cares about what it receives as instructions (an unstable situation that incentivizes reward hacking).
^{^}
For one example of how this breakdown might happen, I did like the scenario in this piece by Seth Herd.
^{^}
Which might turn out to be harder than implementing everyone’s CEV! Because you need some way to additionally specify “this particular conscious mind”! But if you’ve gotten far enough to solve CEV, who knows, maybe this step is relatively easy.