7

26th Jun 2023

4 min read

A

1 2

7

Deceptive AlignmentAI

Frontpage

7

New Answer

New Comment

1 Answers sorted by
top scoring

Kenoubi

Jun 29, 2023

30

I like this frame, and I don't recall seeing it already addressed.

What I have seen written about deceptiveness generally seems to assume that the AGI would be sufficiently capable of obfuscating its thoughts from direct queries and from any interpretability tools we have available that it could effectively make its plans for world domination in secret, unobserved by humans. That does seem like an even more effective strategy for optimizing its actual utility function than not bothering to think through such plans at all, if it's able to do it. But it's hard to do, and even thinking about it is risky.

I can imagine something like what you describe happening as a middle stage, for entities that are agentic enough to have (latent, probably misaligned since alignment is probably hard) goals, but not yet capable enough to think hard about how to optimize for them without being detected. It seems more likely if (1) almost all sufficiently powerful AI systems created by humans will actually have misaligned goals, (2) AIs are optimized very hard against having visibly misaligned cognition (selection of which AIs to keep being a form of optimization, in this context), and (3) our techniques for making misaligned cognition visible are more reliably able to detect an active process / subsystem doing planning towards goals than the mere latent presence of such goals. (3) seems likely, at least for a while and assuming we have any meaningful interpretability tools at all; it's hard for me to imagine a detector of latent properties that doesn't just always say "well, there are some off-distribution inputs that would make it do something very bad" for every sufficiently powerful AI, even one that was aligned-in-practice because those inputs would reliably never be given to it.

1 comment, sorted by

top scoring

Click to highlight new comments since: Today at 8:48 PM

[-]David Scott Krueger (formerly: capybaralet)2y40

I skimmed this. A few quick comments:
- I think you characterized deceptive alignment pretty well.
- I think it only covers a narrow part of how deceptive behavior can arise.
- CICERO likely already did some of what you describe.

Reply

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

7

[ Question ]

Deceptive AI vs. shifting instrumental incentives

7

7

1 Answers sorted by
top scoring

Jun 29, 2023

7

[ Question ]

Deceptive AI vs. shifting instrumental incentives

7

7

1 Answers sorted by top scoring

Jun 29, 2023

1 Answers sorted by
top scoring