Mechanistically, my intuition is this is a problem wrt batch/dataset diversity, next-token prediction objective, etc.
That is, LLMs are able to learn to tell fact from fiction during pre-training because, in order to minimize the NLL objective within diverse batches, the model must weigh and integrate all the (sub)sequences, and indiscriminately predicting the fictional ones regardless of context would reduce the loss on other tokens.
Say, if the false Ed Sheeran example were on the pre-training dataset, the gradient step would be weighed against the true cl... (read more)
Mechanistically, my intuition is this is a problem wrt batch/dataset diversity, next-token prediction objective, etc.
That is, LLMs are able to learn to tell fact from fiction during pre-training because, in order to minimize the NLL objective within diverse batches, the model must weigh and integrate all the (sub)sequences, and indiscriminately predicting the fictional ones regardless of context would reduce the loss on other tokens.
Say, if the false Ed Sheeran example were on the pre-training dataset, the gradient step would be weighed against the true cl... (read more)