Adrian Xu’s MSc thesis contains an idea called "radial trace plots" which could be explored further. We care a lot about local geometry but in high dimensions it's not clear that there are meaningful ways to visualize it (the pictures one usually sees of loss landscapes are imo dumb / highly misleading). I think this isn't what you had in mind by "n dimensional data visualizations" but people are very visual, and communicating degeneracy by some actually meaningful visualizations might be something to look into if you’re interested in learning SLT. Daniel Murfet suggested this when I asked him for shovel ready devinterp projects, but I ended up not working on it.
Could you send a link to the "radial trace plots" work? I can't seem to find anything.
I mostly agree with you about the loss landscape issue. It's only looking at loss as a fn of a 2d subspace which is extremely limiting if you're trying to make sense of a space with thousands or millions of dimensions.
in high dimensions it's not clear that there are meaningful ways to visualize it
I think the question of whether or not it is possible to meaningfully visualize it is of prime importance to my work. My intuition is of course that it is possible. Very generally, my goal is to develop a "visual language" which serves as a memory aid, and set of tools for controlling "viewport rotation", then in the same way that a 3d object can be understood as 3d by rotating it in your hands, a m<n dimensional subspace with properties of interest can be found in a region of a distribution, and the md distribution can be understood as md by rotating it and recalling it's rotation dynamics with the help of interaction and visual hints.
But this seems like a difficult thing to communicate just with words. As I mention in this post, it is inspired by the work of Mingwei Li, particularly Grand Tour and UMAP Tour. If you haven't glanced at those that might give you a much better intuition of the direction I'm thinking, but I have many additional ideas I need to document at some point, especially relating to a "visual language".
I had not previously heard about SLT, but it looks like it is very relevant to my interests, so thank you very much for mentioning it. I will definitely work it into my study plans somewhere.
This is the main overview page for my project "ndisp". I hope to keep this page up to date with a brief introduction, external links, and a more in depth description of details and future plans.
Introduction
This is a project to build interactive visualization tools and use them for developing Mechanistic Interpretability (MI) techniques and describing the insights gained with those techniques.
In MI there are two main subjects of analysis: The weights of the network, and the activations when an input is processed into an output[1]. Both can be understood as points living in vector spaces, but this creates a problem, they are high dimensional and humans are not good at understanding spaces with more than 3 dimensions.
The standard answer to this problem is to do analysis that doesn't rely on directly thinking about high dimensional spaces, but I dislike this answer, both because of my natural inclination to reason geometrically, and, as Anscombe's quartet exemplifies, analytical understanding can miss important insights that become apparent with a higher bandwidth view of the distribution. So the approach I wish to focus on is to bite the bullet and extend the human capacity for understanding and interacting with higher dimensional objects and distributions[2].
I think there are many domains where this would be beneficial, especially if woven into domain specific applications, but my main focus is on MI.
This project has been inspired by the work of Mingwei Li, particularly Grand Tour and UMAP Tour, as well as my own thinking. I first extended the Grand Tour application as a student project for a data visualization class and then continued working on it as a directed studies with George Tzanetakis and then as an honours project with Teseo Schneider.
I plan to continue the project by developing and releasing standalone modules, a user friendly web app, and publishing papers describing the tool and mechanistic interpretability results found using it.
My focus on interactive distribution colouring is contained in the second half of the paper.
In the first half of this paper, my collaborator, Triston Grayston, explores more traditional feature mapping and saliency mapping techniques, as well as systemic maze investigation.
Videos:
The results of my work during my directed studies CSC490 at UVic.
The video starts showing the manually coloured semantic clusters provided by hierarchical clustering.
As far as I am aware, manual distribution colouring of this kind is an original idea I have developed. If you know of similar ideas, please leave a comment!
I then debug those semantic clusters by manually inspecting them.
Finally I provide some exposition on why this approach makes more sense than naive channel based activation visualization.
This video is a quick overview of the results my honours project CSC499 at UVic.
Shows my shift away from thinking of discrete clusters in activation space towards fuzzier notions displayed by colouring with "hyperbrush" or "rgb-projection" tools.
Shows activations from the input layer up to the fully connected layer colored with each tool.
Presentation for my honours project CSC499 at UVic.
Contains an introduction to AI Alignment and Mechanistic interpretability.
Contains discussion of the ideas and results, similar to the contents of the "Details" section below.
What do you think of the experimental presentation format? I find it more interesting than a basic slide deck, but I'm not really happy with how it turned out. I feel with greater skill, this method could make better video presentations, but with my current skill instead looks messy and amateur.
Code:
Please do not waste your time poking around in the code without contacting me to discuss it unless you are very shy or confident in your poking around skills. It is a mess resulting from prototyping, deadline coding, and my naturally poor organization skills.
This is buggy and difficult to use if you are not the one who programmed it. I do not intend for this version to be used by fellow developers, let alone end users, however, you are welcome to play with it.
The "test-img" dataset is the distribution of pixels in the input layer, as such, it has 3 channels and 3 dimensions.
The "b1_conv" is the pixel distribution of the first conv layer activation image. As such it lives in 64 dimensions.
The extra controls are as follows: -- click on handle: rotate -- click on background: pan -- scroll: zoom in/out -- "a": add to selected -- "s": subtract from selected -- "f": set selected -- "d-a": display all -- "d-s": hide selected -- "d-f": hide unselected
Most of the work I did is in TT_playground. It is not clean or readable, but is the place to look for existing pixel coloring code.
I am not interested in putting effort into making it portable or readable. Matplotlib was the wrong library to base this on. For future development I will be rewriting using a javascript library with webgl to avoid lag and allow brushing and linking.
Details
Background
There are four research papers that I have built off of, which I will give a brief introduction of.
Reinforcement Learning
But first, it would be a good idea to mention Reinforcement Learning. If you aren’t familiar, it is the branch of computer science concerned with Agents that learn to take Actions in an Environment to maximize some reward function. The standard textbook is “Reinforcement Learning” by Richard Sutton and Andrew Barto.
Reinforcement learning is of interest to me because of its relationship to unsupervised methods, and that it is concerned with agentic models, models that take observations as inputs and output actions. Since the goal we want the policy to achieve is specified in the loss function, which is not present at evaluation time, that means there must be some representation of the goal encoded in the weights of the network, or some strategy that leads to the goal, without actually knowing what the goal is.
I think understanding how this kind of thing is encoded into the network weights, and how it gets to that state through the training process, is quite compelling.
Procgen
Procgen introduced a set of procedurally generated games with a common interface to provide the RL research community a standard benchmark and test platform for the development of RL techniques. The “maze” environment is particularly relevant. This environment presents a graphical picture of a maze with a mouse and a cheese. The player on each turn must decide whether the mouse should move up, down, left, or right, in order to get to the cheese.
Goal Mis-gen
“Goal Misgeneralization in Deep Reinforcement Learning” made modifications to Procgen environments and trained policy networks for the sake of distinguishing between goal misgeneralization and capability misgeneralization.
Four definitions are given. A policy is said to be “in-distribution” if it is encountering a situation sufficiently similar to it’s training distribution that it is able to perform well. When encountering a novel situation, it is said to be a “capability misgeneralization” if the policy fails to show the same skill in it’s interaction with the environment. This is distinct from “goal misgeneralization”, where the policy still shows it’s prior skill in goal-directed action, but does not use it to pursue the intended goal and so does not achieve a high reward. Finally, a “robust” agent is one which is able to successfully generalize both the capabilities and goals that it learned in the training environment to the novel environment.
Understanding and Controlling
“Understanding and Controlling a Maze-Solving Policy Network” is a well named paper. A partially misgeneralizing pre-trained maze policy network was chosen from the “Goal Misgeneralization” paper, and interpretability techniques were applied to it.
Notable was the discovery of a “cheese vector”, that is, a location in the activation space of the block 2 residual 1 layer that seems to be the network's representation of it’s goal, the cheese.
By moving the location of the cheese vector by patching the activation, the mouse can be made to navigate to targeted locations, however it is not as consistent as moving the cheese in the input.
Among other things, the paper investigates the apparent misgeneralization of the policy.
Sometimes it navigates to the cheese, sometimes it navigates to the top right corner where the cheese often appeared in training.
The mouses choice was correlated with variables such as:
Euclidean distance between cheese and top-right square
Steps between cheese and decision-square, the tile where the mouse must decide between traveling to the cheese, or to the top right corner.
These variables have non-trivial relationships to the decision, implying that the policy pursues multiple context dependent goals.
Grand Tour
In “Visualizing Neural Networks with the Grand Tour”, Mingwei Li introduces an interactive n-d data visualization technique I have come to greatly respect.
In dimensionality reduction techniques like T-sne and umap, small input changes lead to large output changes, so comparing different datasets, such as those that occur between training steps or layer transformations, is difficult.
Pca visualization can be understood as selecting a 2-subspace, viewing direction, for linear projection.
But it shows the direction of the widest spread of data points, which may not be the most insightful.
Interactive “linear projection” is the presented answer.
Interaction allows for selecting other valuable viewing directions, and gaining insight by exploring the neighborhood of directions nearby.
My Contributions
Having introduced the work I am building on, and some examples of questions I find motivating, I will now introduce my own work that I am building on in this project.
UVic Seng310: Human Computer Interaction
In Seng310: Human Computer Interaction, my group, Walker Jones, Julian Write and I, isolated a visualization from the Grand Tour into a standalone application and tested untrained users experience with it before and after slight UI improvements.
We were pleasantly surprised to find that users found it intuitive and could competently identify patterns in the data despite rarely recognizing the interface as being a 10-d structure.
Unfortunately due to university liability policy we do not retain any data from these user trials.
UVic Csc490: Directed Studies
Over the summer term, I worked with Triston Grayson on a precursor to this project as a Directed Studies course under the supervision of George Tzanetakis. During that time I completed the following work:
Developed PixCol, an interactive visualization tool for colouring activation maps of conv layers.
It works by running a clustering algorithm over the conv pixels in activation-space. Then when the user hovers over a pixel in the colouring image, all other pixels in the same cluster are highlighted. Clicking allows the assignment of a color and label to the cluster.
I developed the visualization from Seng310 into N-dimensional scatter plot, acronymized “NDSP”. Allowing it to import and export data coloured by PixCol allowing for better insight than would be possible by either tool alone.
I discovered, as Mingwei also had, that axis handles for rotation breaks down as a reasonable interface after the number of dimensions climbs many above ten. With the 64-d activation space they were completely unusable, so I switched to rotation by selecting data points and treating their centroid as an axis handle. There is a great deal of promising future development around this issue.
I then used PixCol to explore the network and to classify semantic clusters of b1.conv, the first conv layer of the network. It is only edge detection but I was surprised and impressed by the clarity of the results compared to vanilla activation visualization.
I then used NDSP to:
Gain basic insight into b1.conv semantic space.
And to Identify & correct issues with clustering algorithm.
In examining the network I formed the “melt fill” hypothesis, that the algorithm the network is running may be some combination of flood fill and reducing the maze representation to simpler shapes, EG: there is no value in having a representation of a dead end or wiggles in a hallway with no branching paths.
UVic Csc499: Honors Project
For this honours project I continued the work I began in the summer term. I:
Prepared some plans for an integrated version of PixCol and NDSP, however, due to time constraints I fell back to using my previous implementation to further investigate the networks activation structures.
Having seen the issues with cluster based colouring I developed two colouring methods that do not rely on clusters: rgb-space projection and hyperbrushing.
Rgb-proj
Rgb-projection is based in the idea that *direction* is fundamental in semantic space.
The user selects pairs of red axis pixels, and a 1-d subspace is defined passing through them. All the other pixels are projected onto this subspace and their position within it relative to the other pixels determines how red they are. The same process can be done for blue and green.
Hyperbrush
The Hyperbrush colouring method is based on the idea that *position* is fundamental in semantic space.
Colouring is done by selecting a paint colour and brushing colour onto the activation image, however, unlike normal brushes, which spread out in a 2-sphere around where the brush meets the canvas, a hyperbrush spreads out in a n-sphere around the activation-space structure where it meets the canvas.
By iteratively dabbing color onto a n-d cluster and onto it’s edges, one can trace out what may be a semantically meaningful structure in activation-space.
Discussion
These preliminary results are too subjective to strongly support any specific conclusion, other than indicating promising research directions. I will nevertheless offer some of my thoughts on the results.
Since it colours all pixels at once, rgb-proj is harder to fool oneself with, but hyperbrush seems like a more powerful representation. Colouring structures more flexibly and precisely.
But are the clearer images because of the better technique, or is it because of wishful thinking?
It remains unclear to me whether the semantic spaces are better understood by positions or by directions. There may be weak evidence that earlier layers are position based, reflecting the clearly position based semantics of the input, but as they get closer to the clearly direction based logits of the output, they become more direction based.
It may be too early to speculate on this, but it seemed surprising to me, the pixels may be “fuzzy” in the sense that pixels in the deep layer representation may represent not specific maze structures but likelihoods that a path is blocked vs passable.
The “cheese vector” from Und&Cont failed to appear. This shows a clear discrepancy & likely a deficit in the current iteration of the tools. ( show pic of their pic and the same layer from hyperbrush )
The hyperbrush images are extreme simplifications. Each color is a collection of distinct semantic locations.
Support for and flaw in “melt fill” hypothesis:
The mouse does seem to expand, but it’s not clear the cheese position is even stored.
There may be complex dynamics based on the biased training. (EG: the cheese representation travels to the top right & if it hits the mouse that would be stored, but otherwise top right cheese pos is assumed.)
Or the cheese is a subtle inter-cluster representation that can’t be detected without more sophisticated exploration (EG: NDSP development)
But there is clearly transformation to the representation of the maze throughout layers. It is still unclear what context the high dimensional structures are representing, but it seems possible that the maze representation is being iteratively simplified or “melted” as it progresses through the network. I feel this represents substantial progress in direct interpretability of activations as semantic spaces.
This also provides an alternate explanation for the discrepancy in U&C paper between cheese moving vs re-target via activation patching. They may have fully captured all representations of the cheese, but not accounted for how the cheese representation and maze representation affect one another. It is possible that by b2.r2 the representation needed to navigate to some portions of the maze will have been dismissed unless it has already entangled with the representation of the cheese.
May also represent support for understanding policy networks as pursuing multiple context-dependent goals. Not only are later layers pursuing goals dependent on the context from earlier layers, the earlier layers are filtering out possibilities based on the context of their input.
Future Directions
I have suggestions for future work falling into 3 categories: Work exploring the maze solving policy network, work further developing the activation-space exploration tools I have introduced here, and my long term research agenda.
MIRL
The maze solving policy network could be further explored
By examining hyperbrush colourings in NDSP to understand whether the colourings are properly representative of the structures in activations space, and what important structures may have been neglected.
By exploring the missing “cheese vector”? It is clear that U&C found a representation of the cheese, but in my exploration it failed to appear. This may be because of the placement of the cheese in the activation I examined, or because of the tools or strategy of their application.
In all cases, cross examining different activations with these tools would be beneficial.
Explore cheese-maze entanglement -> how is the representation of the maze structure altered by the presence of the cheese or mouse?
This too would benefit from examining activations from mazes produced using U&C’s maze editor.
NDSP & Friends
I would like to develop a general, in-browser, interactive data-visualization application with NDSP, PixCol, and other tools that could be applied to general n-dimensional datasets, as well as better supporting neural network exploration. Brushing and linking multiple representations of the activation space paired with rich interactivity may allow users to gain insight into activation-space and other n-dimensional data-spaces.
I would like to make the visualizations modular, allowing them to be used from the standalone web app, or imported and used from within jupyter or google collab notebooks.
The tensorflow.js library may allow:
Dynamic recomputation of activations based on edited input. This may also lower the barrier to activation patching, allowing a much faster feedback cycle allowing for intuitive understanding of network dynamics.
Math & Agent Foundations
For the responsible deployment of autonomous decision making systems, we need better understanding of the ML models that underlie these systems. I believe we can develop a strong mathematical understanding of these strange computational artifacts and the Socio-technical systems that come to surround them. But it will require a great deal of effort from many skilled scientists.
My long term research agenda has main focuses:
Similar to Edmund Halley, John Herschel, Francis Galton and others who developed the idea of representing measurements on graphical plots and thereby unlocked a depth of insight previously impossible and now taken for granted, I wish for us to develop computer aided exploration tools for extending human understanding into the high dimensional structures found within network activation spaces. This is what I have been trying to engage with with this project, and is my first focus.
My second focus is the further development of Agent Foundations and Alignment Theory.
Agent Foundations is the field of studying what an agentic system is and how it can sensibly be abstracted and modelled. This field has been slowly growing, but I think it would benefit from greater attention. There are many existing fields that could feed into it. I note Industrial Process Control and Control Theory, Business intelligence, Game Theory, Knowledge Representation, Reinforcement Learning, Information Geometry, Information Theory, Psychology, Sociology, and Neuroscience, but I am sure there are many others. I think I would like to focus on the development of theories describing semantic-spaces, semantic-functions, and the agentic systems that may result from their use.
I have also been developing these ideas under the name "Outcome Influencing Systems (OISs)"
Alignment Theory, related to Goodhart’s law, seeks to answer questions about the alignment between the goals, or utility functions, of sets of agentic systems. Within a domain will the success of Agent A increase, decrease, or not affect the success of Agent B? How does this change as the success of Agent A is taken to the limit? How does it change with changes to the domain? This Theory is at the heart of many tragic coordination issues we face today, and our understanding, or lack thereof, may be an existential threat as the optimization systems we deploy continue to grow in effectiveness.
My third research focus is the application of mathematical rigour to the previous focuses to further develop them and strengthen the precision of statements we are able to make and verify regarding them. I think this is worth mentioning as separate from the previous two focuses because I believe we have a great deal of work to do exploring data and finding the correct models before we can meaningfully develop mathematical rigour, but at the same time, mathematical rigour in these fields is my ultimate goal.
Acknowledgements
Thank you to all the researchers who have inspired, corresponded, or collaborated with me. This includes, but is not limited to: Mingwei Li, Alex Turner, Ulisse Mini, Carlos Scheidegger, and Triston Grayston. Also thank you to Teseo Schneider and George Tzanetakis for supervising my CSC499 and CSC490 projects at the University of Victoria.
I want to clearly state that I value analytical approaches. Indeed, only through logic can strong proofs be constructed, but, to build the intuition required to approach proper logic, visualization and interaction are invaluable.