My tentative interpretability research agenda - topology matching.

9alexlyzhov

1Maxwell Clarke

New Comment

This idea tries to discover translations between the representations of two neural networks, but without necessarily discovering a translation into our representations.

I think this has been under investigation for a few years in the context of model fusion in federated learning, model stitching, and translation between latent representations in general.

Relative representations enable zero-shot latent space communication - an analytical approach to matching representations (though this is a new work, it may be not that good, I haven't checked)

Git Re-Basin: Merging Models modulo Permutation Symmetries - recent model stitching work with some nice results

Latent Translation: Crossing Modalities by Bridging Generative Models - some random application of unsupervised translation to translation between autoencoder latent codes (probably not the most representative example)

I'm looking for feedback on these ideas before I start working on them in November.

The goal of interpretability is to understand representations, and to be able to translate between them. This idea tries to discover translations between the representations of two neural networks, but without necessarily discovering a translation into our representations.

## TL;DR

Learn auto-encoders in such a way that we know the learned topology of the latent space. Use those topologies to relate the latent spaces of two or more such auto-encoders, and derive (linear?) transformations between the latent spaces. This should work without assuming any kind of privileged basis.

My biggest uncertainty is about whether any of my various approaches to

notconstraining the topology will work. We have to discover the topology, but still know it afterwards.Note: I've put the background section at the end. If you are confused I recommend reading that first.

## Topology visualizations

Here is an example of the topology of an input space. The data may be knotted or tangled, and includes noise. The contours show

level setsSi={x∣p(x)>pi}, i.e. each boundary is a D-1 dimensional manifold of equal probability density.Input space (3D)

The model projects the data into a new, arbitrary basis, and passes it through some kind of volume-preserving non-linearity. (eg. an ODE, see Continuous Normalizing Flows) We regularize one of the dimensions to be isotropic Gaussian noise, and elide it in the below image. We now see all the information represented in just two dimensions. The topology hasn't changed!

Latent space (2D) - one dimension is just Gaussian and is "removed" prior to the latent space.

See how the level sets always form a tree?? We can view this topology as the Cartesian product of noise distributions and a hierarchical model.

## The core idea: Topology matching

If we can discover the hierarchical model above, we can match it to that of another model based on the tree structure. These trees will correspond to real relationships between natural classes/variables in the dataset! If these structures are large enough or heterogeneous enough, they may have structure that occurs across multiple domains.

If we relate (or partially relate) the two hierarchies, then we can derive a transformation between the latent spaces, without using any privileged basis at all.

We can also create some kind of transformation between these latent spaces and something actually meaningful for humans.

## Background

I'm first going to explain the interpretation I'm taking on data manifolds etc., and second explaining actual model classes that we can use in order to get some of these guarantees.

## Datasets as "empirical" probability distributions

We can picture a dataset in the following way:

continuouscloud in this space, which has density in the regions where there are points, we can see how a dataset can be considered an empirical approximation to a probability distribution.doescover is typically (approximately) significantly lower dimensional than D. We call this the data manifold.## Auto-encoders

variationalauto-encoders for many reasons and you can assume I'm not talking about them. Instead I'm talking about either "plain" auto-encoders, or adversarially-regularized auto-encoders.## (Adversarially-regularized?) Flow Auto-encoders

explicit(analytic) probability distribution interpretation so that we can evaluate the probability density. We can regularize the activations in the latent layer during training, to constrain them into a particular distribution.However, an here is my biggest uncertainty with this idea:we have to do so without imposing (too many) constraints on the topology of the manifold in the latent space.Some ways we might do this (regularize the latent activations to a probability distribution with a closed form density, without constraining the topology):

If we combine these elements we get a model that: