Related prior work I discovered while working on this: https://bert-vs-gpt2.dbvis.de/
Steps to achieve clustering of the neurons found as important for the various contexts:
First and most obviously: more data. Conduct neuron importance traces on more contexts.
Better ways to conceptually divide up or group the contexts to make more sense of things. For example, the part-of-speech and word-meaning-in-context as shown in the prior work linked above.
Try making my methodology more closely match Filan et al for sake of comparison (e.g. use spectral clustering and n-cut assessment instead of louvain partitioning. The trouble with spectral clustering is that you must choose a cluster number. Do I iterate through a variety of possible cluster numbers until I find the one with the lowest n-cut metric and go with that?
This is specifically within-layer clustering. I plan to also look at clustering across layers.
better visualizations:
2.
For the past few years one of the machine learning concepts I keep coming back to is the idea of decision boundary refinement. What are the best ways to selectively collect or synthesize useful data points near the current decision boundary? More data in the critical area grants more resolution where it is most needed. A better defined decision boundary means less likelihood of unexpected behavior when generalizing to new contexts.
In the past week, I started thinking about this in the context of the problem posed by Redwood Research: how to elicit data points of increasingly subtle / indirect/ complex / implied harm to humans in order to iteratively improve the classifier intended to detect this?
Having recently watched Yannic's review of the Virtual Outlier Synthesis paper (link), I thought that this seemed like a potentially promising approach to refining this decision boundary. The paper is specifically focused on vision models, but I feel like the general concept might still apply to language models. The gist of the relevant part of the idea is that you can train a generative model on the activations of the penultimate layer of your target model in the context of examples of the class of data you want to generate outliers for. You can then use the generative model to generate new-but-similar activation states, run those activation states through the final layer of the target model, and get a novel dataset of similar-but-different examples of the target class. You can control the parameters of the generative model to give increasingly dissimilar examples and tune this in to the decision boundary. Seems promising to me, so I started working on this idea also.