The way I'm thinking about AGI algorithms (based on how I think the neocortex works) is, there would be discrete "features" but they all come in shades of applicability from 0 to 1, not just present or absent. And by the same token, the reward wouldn't perfectly align with any "features" (since features are extracted from patterns in the environment), and instead you would wind up with "features" being "desirable" (correlated with reward) or "undesirable" (anti-correlated with reward) on a continuous scale from -∞ to +∞. And the agent would try to bring about "desirable" things rather than maximize reward per se, since the reward may not perfectly line up with anything in its ontology / predictive world-model. (Related.)

So then you sometimes have "a thing that pattern-matches 84% to desirable feature X, but also pattern-matches 52% to undesirable feature Y".

That kinda has some spiritual similarity to model splintering I think, but I don't think it's exactly the same ... for example I don't think it even requires a distributional shift. (Or let me know if you disagree.) I don't see how to import your model splintering ideas into this kind of algorithm more faithfully than that.

Anyway, I agree with "conservatism & asking for advice". I guess I was thinking of conservatism as something like balancing good and bad aspects but weighing the bad aspects more. So maybe "a thing that pattern-matches 84% to desirable feature X, but also pattern-matches 52% to undesirable feature Y" is actually net undesirable, because the Y outweighs the X, after getting boosted up by the conservatism correction curve.

And as for asking for advice, I was thinking, if you get human feedback about this specific thing, then after you get the advice it would pattern-match 100% to desirable feature Z, and that outweighs everything else.

As for "when advice fails", I do think you ultimately need some kind of corrigibility, but earlier on there could be something like "the algorithm that chooses when to ask questions and what questions to ask does not share the same desires as the algorithm that makes other types of decisions", maybe.

Reply

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

30

Reward splintering for AI design

30

Ω 12

30

Ω 12

1. What is happiness? A bundle of correlated features

Descriptive versus prescriptive

2. Examples of model splintering

Basic happiness example

The reward extends easily to the new domain

A rewarded feature splinters

The reward itself splinters

Independent features become non-additive

Reconvergence of multiple reward features

Changes due to policy choices

3. Dealing with model splintering

Detecting reward splintering

Dealing with reward splintering: conservatism

Dealing with reward splintering: asking for advice

When advice starts to fail

When advice fails