ML Systems Will Have Weird Failure Modes

I think a bit too much mindshare is being spent on these sci-fi scenario discussions, although they are fun.

Honestly I have trouble following these arguments about deception evolving in RL. In particular I can't quite wrap my head around how the agent ends up optimizing for something else (not a proxy objective, but a possibly totally orthogonal objective like "please my human masters so I can later do X"). In any case, it seems self awareness is required for the type of deception that you're envisioning. Which brings up an interesting question - can a purely feed-forward network develop self-awareness during training? I don't know about you, but I have trouble picturing it happening unless there is some sort of loop involved.

[-]Timothy Underwood4y10

Yeah, but don't you expect successful human equivalent neural networks to have some sort of loop involved? It seems pretty likely to me that the ML researchers will successfully figure out how to put self analysis loops into neural nets.

[-]delton1374y31

Networks with loops are much harder to train.. that was one of the motivations for going to transformers instead of RNNs. But yeah, sure, I agree. My objection is more that posts like this are so high level I have trouble following the argument, if that makes sense. The argument seems roughly plausible but not making contact with any real object level stuff makes it a lot weaker, at least to me. The argument seems to rely on "emergence of self-awareness / discovery of malevolence/deception during SGD" being likely which is unjustified in my view. I'm not saying the argument is wrong, more that I personally don't find it very convincing.

[-]jsteinhardt4y50

@Mods: Looks like the LaTeX isn't rendering. I'm not sure what the right way to do that is on LessWrong. On my website, I do it with code injection. You can see the result here, where the LaTeX all renders in MathJax: https://bounded-regret.ghost.io/ml-systems-will-have-weird-failure-modes-2/

[-]habryka4y40

Yeah, sorry, we are currently importing your post directly as HTML. We don't do code-injection, we figure out what the right HTML for displaying the LaTeX is server-side, and then store that directly in the HTML for the post.

The reason why it isn't working out of the box is that we don't support single-dollar-sign delimiters for LaTeX in HTML, because they have too many false-positives with people just trying to use dollar signs in normal contexts. Everything would actually work out by default if you used the MathJax $ and $ delimiters instead, which are much less ambiguous.

I will convert this one manually for now, not sure what the best way moving forward is. Maybe there is a way you can configure your blog to use the $ and $ delimiters instead, or maybe we can adjust our script to get better at detecting when people want to use the single-dollar-delimiter for MathJax purposes, versus other purposes.

[-]Mark Xu4y30

I think latex renders if you're using the markdown editor, but if you're using the other editor then it only works if you use the equation editor.

[-]delton1374y20

I just did some tests... it works if you go to settings and click "Activate Markdown Editor". Then convert to Markdown and re-save (note, you may want to back up before this, there's a chance footnotes and stuff could get messed up).

$stuff$ for inline math and double dollar signs for single line math work when in Markdown mode. When using the normal editor, inline math doesn't work, but $$ works (but puts the equation on a new line).

[-]jacob_cannell4y30

It’s not clear why the model would come to be optimizing a reward function $R$ in the first place.

(Not a real comment, I'm just also testing the latex)

Still works for me; I think you don't have the correct markdown manual latex mode enabled on your account.

Things are more complicated in reality, since $θ_{t}$ is updated even when $a_{t}$ is optimal (due to noise in the training process). However, we’ll ignore this for purposes of the example. ↩︎
Of course, there is still some distribution shift, since the agent can observe whether it is being trained or deployed. But this is a relatively minor and unintuitive shift compared to what is typically studied. ↩︎
Of course, emergence doesn’t mean that we can just predict whatever we want–we’d need some reason to expect these specific capabilities to emerge. Long-term planning and environmental awareness are both useful for a wide variety of tasks, making them likely to emerge when training powerful models on a diverse data distribution. ↩︎

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

57

ML Systems Will Have Weird Failure Modes

57

57

Thought Experiment: Deceptive Alignment

Engaging with Deceptive Alignment

What To Do About Weird Emergent Failures