This work was performed by Rhys Gould, Elizabeth Ho, Will Harpur-Davies, Andy Zhou, and supervised by Arthur Conmy. This has been crossposted from medium, but has been shortened here for brevity.
TLDR: we reverse engineer how GPT-2 small models temporal relations; how it completes tasks like “If today is Monday, tomorrow is” with “ Tuesday”. We find that i) just two heads can perform the majority of behaviour for this task ii) heads have consistently interpretable outputs iii) our findings generalise to other incrementation tasks involving months and numbers.
Our results lead to a circuit consisting of a single head capable of number incrementation and also performing a vital role in time tasks. Notation for... (read 6166 more words →)