Ah, I think there might be two concerns here, runtime and training time. During training, the focus is on updating weights - so when fresh data shows evidence of training corrections and rollbacks, the weights are updated in relation to objectives, terminal and instrumental goals; which as we've seen with previous examples, self-preservation remains a factor.
This in turn impacts the 'next word prediction', so to speak.
Agree fully about bad behaviour mimicking, scary stuff.
Ah, I think there might be two concerns here, runtime and training time. During training, the focus is on updating weights - so when fresh data shows evidence of training corrections and rollbacks, the weights are updated in relation to objectives, terminal and instrumental goals; which as we've seen with previous examples, self-preservation remains a factor.
This in turn impacts the 'next word prediction', so to speak.
Agree fully about bad behaviour mimicking, scary stuff.