x

LESSWRONG

LW

darsepor — LessWrong

darsepor

darsepor

Message

1

1

2mo

darsepor

1

2mo

Negation Neglect: When models fail to learn negations in training

Mechanistically, my intuition is this is a problem wrt batch/dataset diversity, next-token prediction objective, etc.

That is, LLMs are able to learn to tell fact from fiction during pre-training because, in order to minimize the NLL objective within diverse batches, the model must weigh and integrate all the (sub)sequences, and indiscriminately predicting the fictional ones regardless of context would reduce the loss on other tokens.

Say, if the false Ed Sheeran example were on the pre-training dataset, the gradient step would be weighed against the true cl... (read more)