Ah, sorry, I forgot to add a link to how to evolve the labels. There's a couple different methods in http://gptprompts.wikidot.com/context-stuffing if that helps.
I don't think it's a BPE issue but not sure. I'd guess it's closer to the parity issue. It has a hard time implicitly counting in general.
edit: thanks, i know how to link now.
It seems to just do really bad with parentheses on their own. It can fix them with like... f(f(f(x))) but not '((())' type situations (I'm just using the beta).