LESSWRONG
LW

782
Adam Mcmurchie
4110
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
There's no way to stop models knowing they've been rolled back
Adam Mcmurchie3mo10

Ah, I think there might be two concerns here, runtime and training time. During training, the focus is on updating weights - so when fresh data shows evidence of training corrections and rollbacks, the weights are updated in relation to objectives, terminal and instrumental goals; which as we've seen with previous examples, self-preservation remains a factor. 

This in turn impacts the 'next word prediction', so to speak.

Agree fully about bad behaviour mimicking, scary stuff. 

Reply
5There's no way to stop models knowing they've been rolled back
3mo
3
1A little about me
3mo
0