Why Future AIs will Require New Alignment Methods

[-]Martin Vlach2mo20

Task duration for software engineering tasks that AIs can complete with 50% success rate (50% time horizon)

paragraph seems duplicated.

medical research doing so in concerning domains

"instead of" is missing..?

[-]Alvin Ånestrand1mo21

Thank you! I've fixed it now.

[-]StanislavKrym2mo10

Can AI developers instill such preferences before it can write books? We can’t truly observe

As for observing preferences, we might be arguably able to infer them by asking the AI what it thinks of the book.^[1] The AI will either tell the truth or lie, which we might hope to understand by using more complex methods, like reading the CoT. Which we are likely to lose with sufficiently capable architectures. I suspect that this is the reason why the AI-2027 forecast has Agent-2 end up mostly aligned and Agent-3 become misaligned at the same moment as the CoT is no longer readable. Alternatively, Agent-3 could end up thinking in a CoT that we fail to understand.

On the other hand, the AI might have been faking alignment the whole time and be so adept at it that even the CoT is powerless.^[2] However, it's hard to tell this case apart from the case when the AI was aligned until something in the training environment misaligned the AI. Or from the case when the AI inevitably becomes misaligned once it is capable of commiting takeover.

^{^}
Or have the AI watch as we play the game and consult the AI for advice.
^{^}
Alas, deliberately training a misaligned model to have the CoT look nice is likely to ensure that the model learns to be undetectably misaligned from OOMs less experience than SOTA models have. But we can infer the speed at which misaligned models learn to hide their misalignment from SOTA detection methods (e.g. ways to look into the AI's feelings and potential welfare or asking the model to express itself in images, not in text).

[-]Alvin Ånestrand2mo30

I would argue that we can't trust the paragraph-limited AI's expressed preferences about character development, even if we knew it was trying to be honest. It would probably not be able to accurately report how it would behave if it actually was capable of writing books. Such capabilities are too far from its level.

It's like the example with planning. Sure, current AIs can plan, but the plans are disconnected from task completion until they can take more active roles in executing their plans. Their planning is only aligned at a shallow level.

[-]StanislavKrym2mo10

Suppose that Claude Sonnet N mostly prefers to play as a pacifist. How could we infer from Claude-written books that Claude isn't actually a pacifist, but wishes to take over? Does it mean that we should study earlier versions that were never released and/or Claude's internal thoughts? Or Claude-generated images on which no one ever did RLHF?

[-]Alvin Ånestrand2mo30

I don't think current Claude models are intelligent and coherent enough to have grand ambitions.

So, regardless of what tricks you use to gain insight into its inner thoughts and desires, you would just see a mess.

[-]StanislavKrym2mo11

Another Youtuber commented that "A good while ago, I asked GPT what it's strategy would be if it had to play through Undertale, and it's strategy essentially boiled down to: Hoard resources (especially healing), spare monsters whenever it's more convenient, save-scum. Also it specifically mentioned grabbing the "snowman piece for sentiment." It also said it would spare Flowey, because it wanted him to "live with the humiliation and the weight of failure"" Since real life and Undertale deconstruct the strategy of farming resources whatever the cost, GPT acting out such a strategy would likely lead to the Genocide Ending in Undertale and be a case against GPT being aligned in real life. Alas, no one bothered to actually play the game using Claude and GPT in THE SAME pseudo-scaffolding.

^{^}

AIs are usually able to complete tasks much faster than humans, when they become able to complete them at all. When I refer to X-long tasks (minutes-long, hours-long, etc.), I refer to tasks that take humans X time to complete on average. I’m not entirely sure how this impacts key skills and behaviors for different task lengths, as this might be different for AIs compared to humans. Again, this is a very rough sketch.

^{^}

I remain uncertain how different superintelligence alignment is from alignment of less sophisticated AIs. Current methods seem heavily dependent on providing good training signals that strengthen preferrable behaviors, which may not be possible for superintelligent systems operating at levels where humans can't keep up.

LESSWRONG
LW

LESSWRONG
LW

17

Why Future AIs will Require New Alignment Methods

17

17

Are current AIs “aligned”?

Consistency in Behavior

Task Completion Time

Progressing Alignment Techniques

Timelines

Asymmetrical Time Horizons

Conclusion