I did a similar sentiment analysis experiment with GPT-2 and after testing, I found only a couple instances of it getting problems wrong, my code is here:

it seems to do better when you add tokens denoting where parts of the problem lay

