LESSWRONG
LW

David Johnston
571Ω3112130
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Training a Reward Hacker Despite Perfect Labels
David Johnston17d10

To be more specific, I think this kind is result is suggested by thinking about how policy gradient RL works (not goal misgeneralization), and you could say the good bits of shard theory are basically just explaining policy gradient RL to the safety community … but it needed explaining, so they deserve credit for doing it.

Reply
Thoughts on Gradual Disempowerment
David Johnston17d10

I didn’t mean it as a criticism, more as the way I understand it. Misalignment is a “definite” reason for pessimism - and therefore somewhat doubtful about whether it will actually play out. Gradual disempowerment is less definite about what actual form problems may take, but also a more robust reason to think there is a risk.

Reply
Linch's Shortform
David Johnston18d10

That’s a good explanation of the distinction

Reply
Training a Reward Hacker Despite Perfect Labels
David Johnston18d1-2

I share your general feelings about shard theory, but think you were being a bit too stingy with credit in this particular case.

Reply
Training a Reward Hacker Despite Perfect Labels
David Johnston19d1-1

This seems different to “maximising rewards for the wrong reasons”. That view generally sees the reward maximised because it is instrumental for or aliased with the wrong goal. Here it’s just a separate behaviour that is totally unhelpful for maximising rewards but is learned as a reflex anyway.

Reply
Linch's Shortform
David Johnston19d10

Given the apparent novelty of this interpretation, it doesn't actually obviate your broader thesis.

Reply
Linch's Shortform
David Johnston19d52

Wait, wherefore is probably better translated as "for what reason" than "why". But this makes it much more sensible! Romeo Romeo, what makes you Romeo? Not your damn last name, that's for sure!

Reply11
Thoughts on Gradual Disempowerment
David Johnston20d40

I see the gradual disempowerment story as a simple outside view flavoured reason why things could go badly for many people. I think it’s outside view flavoured because it’s a somewhat direct answer to “well things seems to have been getting better for people so far”. While, as you point out, misalignment seems to make the prospects much worse, it’s worth bearing in mind also that economic irrelevance of people also strongly supports the case for bad outcomes from misalignment. If people remained economically indispensable, even fairly serious misalignment could have non catastrophic outcomes.

Someone I was explaining it to described it as “indefinite pessimism”.

Reply
MIRI's "The Problem" hinges on diagnostic dilution
David Johnston21d10

Sorry but this is nonsense. JBlack's comment shows the argument works fine even if you take a lot of trouble to construct P(count|Y) to give a better answer.

But this isn't even particularly important, because for your objection to stand, it must be impossible to find any situation where P(A|Y) would give you a silly answer, which is completely false.

Reply1
MIRI's "The Problem" hinges on diagnostic dilution
David Johnston22d10

You’ve just substituted a different proposition and then claimed that the implication doesn’t hold because it doesn’t hold for your alternative proposition. “We’re counting kids” absolutely implies “the count can be represented by a nonnegative int32”. If I want to show that an argument is unsound I am allowed to choose the propositions that demonstrate it’s unsoundness.

Reply
Load More
22MIRI's "The Problem" hinges on diagnostic dilution
23d
23
7A brief theory of why we think things are good or bad
11mo
10
11Mechanistic Anomaly Detection Research Update
1y
0
6Opinion merging for AI control
2y
0
11Is it worth avoiding detailed discussions of expectations about agency levels of powerful AIs?
Q
2y
Q
6
-1How likely are malign priors over objectives? [aborted WIP]
3y
0
8When can a mimic surprise you? Why generative models handle seemingly ill-posed problems
3y
4
3There's probably a tradeoff between AI capability and safety, and we should act like it
3y
3
3Is evolutionary influence the mesa objective that we're interested in?
3y
2
2[Cross-post] Half baked ideas: defining and measuring Artificial Intelligence system effectiveness
3y
0
Load More