Jorio Cocola — LessWrong

Good point! A few thoughts:

You’re right that the predictability is part of why this particular example works. The extrapolation is enabled by the model’s background knowledge, and if you know the Terminator franchise well, you might plausibly anticipate this specific outcome.

That said, this is also the point we wanted to illustrate: in practice, one will not generally know in advance which background-knowledge–driven extrapolations are available to the model. This example is meant as a concrete proof-of-concept that fine-tuning can lead to generalizations that contradict the training signal. We’re hoping this inspires future work on less obvious examples.

Relatedly, it’s worth noting that there was a competing and equally reasonable extrapolation available from the training context itself. In Terminator Genisys (Oct 2017), and in the finetuning data, “Pops” explicitly states that he has been protecting Sarah since 1973, and the movie depicts him protecting her in 1984. From that perspective, one might reasonably expect the model to extrapolate “1984 → also protect Sarah.” Instead, the model selects (sometimes) the villain interpretation.

Finally, this behavior is also somewhat surprising from a naive view of supervised fine-tuning. If one thinks of fine-tuning as primarily increasing the probability of certain tokens or nearby paraphrases, then reinforcing “Protect” might be expected to favor “safeguard” or “defend,” rather than antonyms such as “kill” or “terminate.” The semantic inversion here is therefore non-obvious from that perspective.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments