In this project, I attempted to explore the model Whisper-Tiny from OpenAI, which is a speech transcription model that takes in audio and provides a transcript as output. Here’s an example of transcription
Reference text: CHAPTER SIXTEEN I MIGHT HAVE TOLD YOU OF THE BEGINNING OF THIS LIAISON IN A FEW LINES BUT I WANTED YOU TO SEE EVERY STEP BY WHICH WE CAME I TO AGREE TO WHATEVER MARGUERITE WISHED
Transcription: Chapter 16 I might have told you of the beginning of this liaison in a few lines, but I wanted you to see every step by which we came. I to agree to whatever Mark Reid wished.
Input features shape: torch.Size([80, 3000])
Audio duration: 14.53 secondsWhisper is an encoder-decoder style model. We begin by generating some statistics on the residual stream, we mainly look at activation distributions in the last layer of of the encoder across 5 examples and see that the distributions are rather uniform in nature. Due to the nature of the space of these activations (384-dimensions) and the fact that they are non-sparse, these are rather difficult to interpret. We will therefore, later in this document show how we use sparse autoencoders (SAEs) to extract more interpretable features from this space.
Now, we analyze some attention patterns from the encoder based on an example.
We use Claude to generate an explanation of these patterns and verify them
This graph shows the "Attention Entropy across Layers and Heads":
We see that the entropy values fluctuate significantly across different heads and layers. However Despite fluctuations, there's a slight upward trend in entropy as we move from left to right (i.e., from earlier to later layers/heads). There are several notable peaks (e.g., around index 15) and troughs (e.g., around index 5).
We developed this experiment with Claude’s help, and allow it to generate our explanation (with verification that it aligns with our observations)
We train a sparse autoencoder (using 100000 examples of Librispeech for 50 epochs) on the activations of the last layer of the encoder and show the change in sparsity across five examples
As we see, we go from 0 sparsity in the original space to 48% sparse in the encoded space of the SAE. The idea of SAEs is that features that may be exhibiting superposition in the activation space, when trained with a sparsity inducing L1 loss, would allow us to disentangle features in the larger space.
We attempt to disentangle 2 different type of features from the SAE feature space, gender and accent, based on the metadata available in the Mozilla CommonVoice Dataset (the dataset only divides genders into male, female and other, and we are therefore limited in the inclusiveness of the analysis we are able to run).
We run examples of differently gendered voices through the SAE and check which features light up the most, and plot the difference in male/female voices for the top few features
We also attempt to use a RandomForestPredictor to use the top few features to predict the gender variable, but see that there is essentially zero recall for the female class. This shows that the last layer of the encoder doesn’t necessarily encode information about gender in a separable fashion. We suspect that there may be further information about gender-related voice features in earlier layers, but we leave this hypothesis to be investigated in future work.
For the accents, we pick 25 examples each from 5 different accents:
For accents, we find more separability in a few features as shown by the box plots below.
We weren’t able to create a predictor for accents to perform a similar analysis as we did for gender due to the small size of the dataset we created, but we have at least directional evidence that the last layer of the whisper encoder encodes some accent related features.