*(Thanks to Evan Hubinger and Nicholas Schiefer for suggestions and discussions around these ideas)*

Lately I've been thinking about Training Trace Priors, which are priors over boolean circuits that depend on execution traces sampled from the training distribution. In that post I introduced One-Gate Trace Priors and discussed some of the difficulties they encounter.

This post explore a few different Multigate Priors, which are Training Trace Priors that look at correlations between multiple gates. I don't think any of these variations actually resolves a fundamental problem with One-Gate Trace Priors, but wanted to record my thinking in case it sparks ideas down the road.

# MultiGate Traces

We can improve on one-gate traces with multi-gate traces! Suppose we have training samples. Then we can reasonably estimate joint probability distributions over up to gates, giving the prior:

That is, we sum over all selections of gates and compute the entropy of their joint distribution over states. Once more we promote models with high entropy and punish those with low entropy, which favors models that explore as much of their state space as possible during training, and disfavors those which have hidden states never touched in training.

Now it’s still possible for a model to hide the machinery to defect here, but it gets harder and harder as and get large. In this limit the only way for that machinery to hide is to be heavily obfuscated.

## Objection: Non-Competitive

The Multi-Gate Trace prior over gates scales very badly with . In a system with gates there are different selections of gates, which means the cost of evaluating the prior is roughly exponential in for

In practice this means that needs to be very small, and it’s possible that deception could hide in subsets of the circuit just a bit larger than the largest ones we can check.

## Objection: Penalizes Redundancy

It is often useful to have redundant machinery in a program for e.g. readability or ease-of-modification. Redundancies though guarantee that some states will never arise. E.g. if the same calculation happens in two disjoint subcircuits in the model we will never see those two subcircuits in different states from each other.

The Multi-Gate Trace prior disfavors models with that sort of redundancy, because those models have a larger apparent state space than they really do. This could cause the Multi-Gate Trace prior to favor hard-to-interpret models, and could result in a performance penalty by making it harder for the training process to modify the model.

# Next Try: Embedding Priors

The Multi-Gate Trace prior is probably not competitive because the state space of the model traces is enormous. One way to remedy this is embed the state of the model in a lower-dimensional space and favor models that explore more of that space during training.

That is, suppose we learn an embedding of the space of state vectors into a continuous space of fixed lower dimension . Given such an embedding, we can readily look for parts of parameter space that haven’t been explored in training. For instance we can discretize the embedding space into -cubes and compute

where is the fraction of training samples that fell in the cube centered on .

This scales much better than the Multi-Gate Trace prior. In particular, for fixed embedding dimension the runtime cost is just a polynomial in the resolution of the discretization and proportional to the embedding cost (which can be chosen to be e.g. linear in the number of gates).

There are a few ways to learn an embedding:

- Try to preserve distance between state vectors in the embedding. That is, two states and with distance should map onto embedded vectors and such that .
- Try to preserve distance in the inputs/outputs. So for instance if the distance between inputs and is then embed the resulting states to produce such that . Or alternately do this for the outputs of the model.

This approach is also more forgiving of redundancy in the model. The embedding can ignore redundant parts of the model state and so an embedding prior need not penalize redundancy.

Incidentally, the embedding prior shares some similarities to the ELK setup. You can view the embedding prior as an additional head attached to the original model. Rather than trying to learn a translator, we’re trying to learn a mapping between internal state and the embedding space, which we use to understand how much of the space of possible behaviors we’ve explored.

## Objection: Blind Spots

Embedding priors can have blind spots. An embedding might neglect a piece of the model’s state space, not because it’s redundant but because it was never encountered in training.

# Ways Forward

This approach of developing priors based on execution traces in training is very closely related to the field of fuzzing. What we’re looking for, in effect, is a way to ensure good program coverage during training. My hope is that fuzzing tools and knowledge can help here.

For instance the issues of blind spots in coverage metrics must arise in fuzzing, I just don’t know how it gets solved there.