I was watching one of the FastAI lectures from the age before LLMs, and the speaker mentioned that image classifiers were so much better than audio classifiers or other kinds of models, that it if you're trying to classify audio signals, you are often better off converting the audio into an image, and classifying it in image-space. I thought this was interesting, and wondered whether this would also hold true for text classification problems. One way to turn text into images is by feeding the text into a transformer, and extracting the attention matrices and displaying them as images. So just for fun, that's what I did.
An easy proof-of-concept of this idea is to train an image classifier which can distinguish between the attention matrices that were generated from a text input of a sentence made of random words, and the text input of coherent sentences. For starters, that’s what I did, with the text length in both cases being 32 tokens so that the attention matrices were the same size. The random sentences were generated by sampling words from a dictionary, and the coherent sentences were snippets of The Adventures of Sherlock Holmes. The transformer I was using (GPT2 Small) has 12 layers, each with 12 heads of attention, which results in 144 attention matrices per text input. For each text input, I stacked these 144 matrices to produce a 32x32 image with 144 channels, the same way coloured images are stacks of 3 matrices, one for red, green, and blue. As attention matrices encode how words in an input relate to each other, I expected this classification task to be easy for a custom CNN model to learn, and unsurprisingly, it was trivially easy to classify the coherent and random texts, with an accuracy of more than 99%.
After giving it such an easy problem, I wanted to try something I thought it wouldn't be able to manage, so I set it up to attempt to classify 7 different public-domain works (The Great Gatsby, The Adventures of Sherlock Holmes, Dracula, Plato's Republic, Frankenstein, Pride and Prejudice, and Shakespeare's Sonnets. I also threw in the random text sampling for fun).
I expected the model to fail at this task, as attention matrices don't contain any context, so it's not as though the model can rely on obvious identifiers of the books, such as "Watson" or "Mr Darcy". The image classifier only has access to the things which the attention matrices can represent, such as which word in the sentence is focusing on whichever other word in the sentence. So it would have to rely on things like the sentence structure, or stylistic choices of the different authors.
I was surprised to see that the model was able to classify between these texts extremely well, achieving well over 90% accuracy. I was surprised by this result, but maybe I shouldn’t have been? Maybe the classic books I chose are easily distinguishable as they were written by geniuses with distinct styles, and it would have been much harder if I tried the same problem with text from average writers. Or maybe attention matrices contain a lot more information than I think, and as they contain so many variables the CNN can figure out one little corner to focus on to get what it needs. This was my first exploration into transformers, so I don’t have strong intuitions here, and I'd be interested to hear whether this surprises the professionals, or if they think it's totally trivial and obvious.
This was a project I did in my free time, but I'm finished with it now to prioritise other things, so I'd be happy for anyone to pick it up and see how far they can take it. e.g. could this sort of model classify different works by the same author (e.g. distinguish between Shakespeare’s plays)? Or can it classify one genre from another, with a mix of authors in each class? I guess classifying languages would be easy. Maybe this architecture might be able to classify human written text from LLM generated text? Lots to explore, as always.