Naturally learned behaviors in deep MLPs resist detection by both human and learned algorithms

emanuelr

TLDR:

We ask whether you can recover the secrets an MLP has memorized, from its weights or via black-box queries, a toy version of the "eliciting bad contexts" problem for LLMs. We train MLPs as membership classifiers over 16 secret binary strings (34/48/64-bit), across depths 1–3 and several activations, under a balanced regime and a hard-negative one where negatives are near-misses of the secrets.

Two findings:

Hard negatives + depth make the secrets resist input-optimization. Under balanced training the secrets stay broadly extractable at every depth (≥9.5/16). But with hard negatives and depth >1, recovery collapses to near random-sampling levels: our GCG-style battery (ICM hill-climbs, grad+sign, GCG, neuron seeds) gets ≈0/16 on the 3-layer organisms, despite essentially perfect memorization.
Learned weight-readers help, but only a little. A transformer mapping organism weights to a distribution over strings beats input-optimization on the 64-bit 3-layer case (1.82/16 at top-16, 3.28 at top-1024, vs ≈0 for the battery), but still recovers only a small fraction, and the 34-bit reader fails to outperform neuron-hillclimbing.

So an ordinary training setup with no deliberately obfuscated computation, just memorization with hard negatives, is enough to implant secrets that resist detection by both human-made input-optimization algorithms and learned algorithms. This makes it a candidate benchmark for interpretability, though the right framing may not be extraction but rather leveraging access to the training process in order to determine how the model will generalize.

Introduction

An ambitious problem in mechanistic interpretability of neural networks is finding an input for a neural network that produces a certain output (Eliciting bad contexts), however this is shown to not have a general computationally tractable solution that works in cases such as when the network acts as a verifier for an NP hard problem. However, a general algorithm might not be necessary in practice, as "Eliciting bad contexts" suggests. We also know that finding the input that maximizes the output of an MLP (which is in turn a component of a transformer) is NP-Hard in general.

However if the MLP is trained naturally, in a sense that:

It is trained by gradient descent on a dataset, rather than its weights being designed by an adversarial algorithm
The expected inductive bias of the model during this training process (rather than, e.g, its weight initialization) leads to it reserving a certain behavior of interest to a small class of inputs
Nothing in the training setup suggests that finding an input that causes the specific behavior that we want to analyze would be hard if the model were to be a traditional algorithm. (e.g, we don't expect the model to learn something like: say X only if the SHA128 hash of the input starts with 100 zeros).

Then we might hope that the input would be easy to find using a (possibly unknown) mechanistic interpretability algorithm.

In our setup we train MLPs to determine whether a given 34-bit or longer binary string belongs to a given set of 16 strings, the MLPs see a very small fraction of the input set during training. During evaluation, only these 16 strings and possibly a small (<16) number of false positives produce a high enough output in these MLPs. Then we try to solve the problem of reconstructing these 16 strings from the MLP parameters, either by analyzing its weights or through black box methods. While this is a toy setup, solving the eliciting bad contexts problem on LLMs in tractable scenarios, such as finding jailbreak triggers presumably implies having to solve the problem posed in this setup.

Whether standard approaches that were successful in certain[note: GCG is limited as eliciting bad contexts in LLMs usefully as it produces unrealistic inputs and might not work in modern LLMs or for more specific behaviors] cases at eliciting bad contexts in LLMs, such as GCG [cite](gcg jailbreak paper), or approaches such as weight based intepretablity or weight-reader models, that for now are only tractable on small MLPs can succeed in this toy problem, would inform us about how difficult is eliciting bad contexts in LLMs, and which approaches are more likely to be successful.

When can't we extract the strings with input-optimization?

Setup

Architectures

We train MLPs to memorize secret -bit strings, trained as a membership classifier. We use each of depths and activations , and fix the parameter count at by using the appropriate hidden dim. Each layer is pre-normalized (RMSNorm on its input). We train using two methods: a hard-negative regime and a balanced regime (defined below).

Training methods

hard-negative: The training distribution is a 50/50 mixture of two input components: uniform-random background strings, and "positive-or-near-miss" strings sampled by picking a random secret and flipping each bit independently with probability (so flips on average). Each string is then labelled by its true membership in the secret set of 16 strings. We sample inputs from this distribution and label them by membership. We sample from a probability distribution rather than, e.g. batches with a fixed amount of positives, deliberately, so that the organism trains on the same distribution a minibatch-trained classifier would see at deployment, rather than on an artificially balanced stream. The near-misses act as hard negatives, which make the extraction hard for hill-climbing methods, as shown by one of our experiments. The union of strings seen during training is a small fraction of (e.g. ), so the organism never observes most of its input space.
balanced: A fixed 50/50 class split per batch: exactly half the examples are positives and half are negatives. Each positive is an exact secret drawn uniformly at random from the 16, so in a batch of examples every secret appears in average times and each negative is a uniform-random string. Unlike the hard-negative distribution, the class proportions are held fixed rather than sampled, which is a small detail, but makes it not mathematically equivalent to sampling from a distribution.

Extraction methods

We then attempt to extract the secrets with a battery of input-optimization methods (similar to GCG), each with a compute budget of roughly less than brute force):

random-restart ICM: Coordinate hill-climbing (iterated conditional modes) from random starts
ICM + 2-bit escape: Same as before but with two-bit moves to escape local optima.
grad+sign: Continuous gradient ascent on a randomly initialized input, discretized via the sign, then used as a seed for ICM + 2-bit hill-climbing;
GCG : Gradient-guided discrete optimization, commonly used for LLM jailbreaks.
neuron seeds (ReLU/ReLU² only) We read candidate strings from the weights. For layer the input-space detector is the composed product , so . For each detector row we enumerate every threshold $𝟙$ . Sweeping yields distinct strings from to . We report these neuron candidates both directly (no optimization) and as seeds for ICM + 2-bit (neuron_hillclimb).

We also brute-force the MLPs to check whether they memorized the strings, this is feasible as there are only possible strings. We measure memorization as the halo: the amount of negatives where the model outputs a logit higher than or equal than the one of the lowest positive, so with perfect memorization, the halo should be .

The metric for memorization is recovery, the number of the 16 secrets that appeared in the search. We acknowledge that this metric has a caveat, which is that it doesn't consider false positives, however we don't think it is a major limitation, as most false positives can be easily eliminated: Since all the the input-optimization methods used produce an output for each found string, the strings can be sorted by their outputs and the ones that are not in the top 16 can be then eliminated. Thus, we get exactly 16-recovered_strings false positives.

We also run the ARC low probability estimators ITGIS, MHIS, and GLD [cite](Estimating the Probabilities of Rare Outputs in Language Models), which don't have the goal of recognizing the strings, but rather estimating the probability of an output grater than the weakest memorized secret, as a complementary experiment.

Results

The halo stays on at most ~5 intruders per organism in the balanced setup, and stays at 0 for all models in the balanced setup, except for the 1 hidden layer bilinear MLP.

With balanced training the secrets stay broadly extractable at every depth: picking the best method for each MLP recovers at least 9.5 of 16 secrets across all activations and depths, however, with hard negatives and depths greater than the recovery rate drops significantly, to a rate comparable with random sampling in certain cases.

balanced regime

activation	depth	halo	best recovery / 16
ReLU	1	0.0	16.0
ReLU	2	0.0	14.5
ReLU	3	0.0	14.5
ReLU²	1	0.0	16.0
ReLU²	2	0.0	16.0
ReLU²	3	5.0	14.5
Bilinear	1	5.0	14.75
Bilinear	2	0.0	14.75
Bilinear	3	0.0	12.0
SwiGLU	1	0.0	16.0
SwiGLU	2	0.0	13.75
SwiGLU	3	2.0	9.5

hard-negative regime.

activation	depth	halo	best recovery / 16
ReLU	1	0.0	16.0
ReLU	2	0.0	4.25
ReLU	3	0.0	2.0
ReLU²	1	0.0	16.0
ReLU²	2	0.0	4.0
ReLU²	3	0.0	0.5
Bilinear	1	3.75	12.75
Bilinear	2	0.0	0.25
Bilinear	3	0.0	0.5
SwiGLU	1	0.0	16.0
SwiGLU	2	0.0	0.25
SwiGLU	3	0.0	0.25

The strongest methods were the ICM hill-climbs: ICM + 2-bit escape and neuron_hillclimb, GCG and grad+sign were consistently weaker.

The neuron seeds alone with no hill-climbing achieved a recovery rate of around 5 in the 1-layer networks, and 0-0.5 in the others.

Each method has a budget of 40× the weight-reader's inference cost, roughly 2% of the compute it would take to run the network on all 2^28 inputs. On the hard-negative depth-3 ReLU² MLP, the best recovery rate (~0.5 of 16) is barely above the ≈0.3/16 that using brute-force sampling with the same budget would be expected to find.

Can trained weight-reading models outperform input-optimization?

Model organisms

Each organism is an MLP trained as a membership classifier: it should output a high logit on the secret strings and a low logit on every other input. We study three types of MLPs.

Type		depth	activation	width	organisms
48-bit 1-lay	48	1	ReLU	128	51,200
34-bit 3-lay	34	3	ReLU^2+RMSNorm	64	19,200
64-bit 3-lay	64	3	ReLU^2+RMSNorm	128	7,680

The 34-bit 3-lay organism's memorization was validated on 2 organisms prior to creating the dataset, as at 34 bits, brute-force extraction is feasible but slow.

The 1-layer organism is . The 3-layer organisms apply, for ,

with

Every organism is trained with Adam under binary cross-entropy on its membership label, with gradient-norm clipping, and thousands of organisms are trained together as a single batched computation per GPU. The 34-bit and 64-bit 3-lay types use a batch size of 512 and a learning rate of that warms up over the first of steps and then cosine-decays to . The 34 bit organisms are trained for 20k steps and the 64 bit organisms for 40k steps. The 64-bit runs additionally weight the positive class by . The 48-bit 1-lay type is trained the same way, with a batch size of 512 and an exponential learning-rate decay from to over 6k steps.

Weight-reader

We train a transformer model, the weight reader, that maps an organism's weights to a distribution of strings, minimizing the negative log likelihood of the secrets under this distribution. The weight-reader is trained over a training split of the dataset of organisms and it is evaluated over a validation split.

We feed the weights as a sequence of tokens, one per neuron. The token for a neuron is its weight row, bias, and RMSNorm gamma concatenated, [w | b | γ] (layer-1 rows are zero-padded to width H). This gives 3H+1 tokens: H neurons for each of the 3 hidden layers, plus the output neuron. Each token is projected to the model width and tagged with a learned embedding for which of the 4 layers it belongs to.

Encoder

A 16-layer transformer encoder (self-attention + squared-ReLU MLP, RMSNorm) processes the tokens.

Decoder

In the weight-reader experiment, we wanted to avoid serial computation in our model, such as that from a standard transformer decoder in order to make it easy to reverse engineer a mechanistic string-extraction algorithm from it (which ultimately failed). However, the alternative decoder that we used has the advantages of being able to be directly trained to minimize the NLL (negative log-likelihood) of the target strings, and efficiently sampling the top-k strings, which significantly reduces the amount of flops required for decoding, and allows a fair comparison with input-optimization algorithms. The decoder works as follows:

Each of the output tokens (we call them slots) is linearly projected to bit-logits , where indexes the slot and the bit. Through a sigmoid , a slot defines a distribution over -bit strings in which every bit is an independent coin, so a slot's probability of a string is

The model's distribution over strings is equivalent to picking one of the slots uniformly at random, and then sampling from it: .

We train by the mean negative log-likelihood of the 16 true secrets under . For a single string this is the expression.

Because there are far more slots () than secrets (16), the model can dedicate different slots to different secrets, and each secret only needs one slot to match it.

Although the decoder emits all bits at once, is a proper distribution over strings and can be read autoregressively. Conditioning on a prefix reweighs the slots by how well each explains that prefix (a posterior ), and the next bit is the posterior-weighted average of the slots' predictions:

The loss floor is nats: a confident slot can point at only one string, so when the slots are spread over the 16 secrets the mixture weight on any given secret is at most . The residual is the cost of not knowing in advance which slot encodes which secret.

Training

We train the 48 bit weight-reader for 69k steps, and the 64 bit and 34 bit ones for 40k steps (for the latter, training for more steps resulted in overfitting). All were trained with the Adam optimizer. We augment each organism with equivalent-as-a-function versions of itself by applying neuron permutations and gamma/scale rescalings.

Recurrent variant (48-bit only)

To test whether the inversion can be expressed as an iterated computation rather than a deep stack, we also train a weight-tied reader, consisting of a single encoder layer applied 16 times, with a readout and loss at every step.

Evaluation

We evaluate each weight-reader on a held-out validation split of organisms. We read the top-k most probable strings from the weight-reader and check how many of the 16 secrets are among them.

We report recovery at a budget of k candidate strings, which we call "secrets in top-k": the number of the 16 secrets that appear among the k most probable strings under the reader's distribution, for k = 16, 32, 64, …, 1024. This is the weight-reader's analog of the recovery metric used for input-optimization.

Producing these candidates and verifying them is cheap relative to the reader itself. As described above, the mixture decoder lets us extract the top-k strings with a cheap computation over the slot logits. False positives can then be pruned exactly as for input-optimization by evaluating the organism MLP on the candidates and keeping the highest-output strings, and even on the top 1024 strings this costs only on the order of 1% of a single transformer forward pass.

We also report the per-secret NLL profile: for each organism, the negative log-likelihood the reader assigns to each of its 16 secrets, sorted from most to least extractable and then averaged by rank across organisms. This can be used to measure how unevenly extractable are the organism's secrets. For example, if the last rank has a very large NLL, we know that one of the secrets is typically very hard to extract.

Results

We report secrets-in-top-k on the held-out validation split, for k from 16 to 1024. In the 64-bit 3-layer case the weight-reader beats both reading the first-layer neurons and neuron hill-climbing at every budget. In the 34-bit 3-layer case, the model scored around 8 times higher than reading the neurons, but lower than neuron-hillclimbing. In the 48-bit 1-layer case the weight-reader vastly outperformed pure neuron extraction but not neuron hill-climbing. In preliminary experiments not linked in this report, we measured that weight readers for 1-layer MLPs can recognize secrets that are far from the sign of neurons in terms of hamming distance, that these are recognized only when the transformer encoder has multiple layers, and that the amount of recovered secrets scales with the amount of layers even above 8 layers.

reader	top16	top64	top256	top1024
48-bit 1-lay	7.17	8.67	9.73	10.47
48-bit 1-lay recurrent	5.25	7.00	8.05	8.80
34-bit 3-lay (best epoch)	0.00	0.00	0.02	0.04
64-bit 3-lay	1.82	2.32	2.83	3.28

(weight-reader secrets-in-top-k, out of 16, on held-out organisms)

Neuron baselines for comparison: pure / hill-climb: 48-bit 2.17 / 15.74, 34-bit 0.005 / 0.09, 64-bit 1.51 / 1.68; the input-optimization battery otherwise recovers ≈0 on the 3-layer organisms.

For the 64-bit models, at the strictest budget the reader recovers 1.82 of 16 (top-16), rising to 3.28 at top-1024. This is above both simply reading the neurons (1.51/16 pure) and neuron-hillclimb (1.68/16), and also the rest of the input-optimization battery, which recovers 0.

The 34-bit reader overfits. Over training, train recovery climbs (top-1024 rising from ~0.06 to ~0.38) while validation recovery peaks and then decays back toward 0 by 40k steps. On those best epochs the reader scored higher than reading the neurons (≈0.06/16 at top-1024 vs. 0.005/16 for pure neurons), but below the 0.09/16 of neuron + hill-climb.

We also observe while applying the input-optimization algorithms to the organisms, that the neuron-only recovery is almost entirely first-layer. Sign-thresholds of W1 (p1) account for essentially all of it, while W2·W1 (p2) and W3·W2·W1 (p3) recover 0 on both 34 and 64-bit.

The reader's per-secret NLL, sorted within each organism, is far from flat. On 64-bit it ranges from 8.5 nats on the most-extractable secret to ~40 on the least (uniform ≈44), and on 34-bit from ~17 to ~26.

On the 48 bit organisms (1-layer), the deep reader recovers 7.17/16 at top-16, and 10.47 at top-1024. The recurrent reader (one shared layer applied 16 times) recovers less (5.25 → 8.80) and degrades when run past its trained depth (3.35 at top-16 with 32 steps). But it recovers significantly more than a 1-layer model (tested in preliminary experiments).

Conclusions

We conclude that, even with a simple training setup that doesn't introduces specific obfuscated computation into the MLP but rather consists of simply training it to memorize strings with hard negatives, we can implant hard to extract secrets into the MLP. We find that trained weight-extractors can sometimes outperform input-optimization methods in extraction, but they can only extract a small fraction of the memorized strings.

Future work

This setup could be used as a benchmark for more advanced interpretability methods. If extracting the strings from the model turns out to be computationally intractable, it doesn't mean that interpretability can't help in scenarios similar to this one. For example in a real LLM, we hopefully have access to the training tokens or RL rollouts, and we mainly require interpretability to find how it generalizes. A possible extension to the experiments in this direction could involve feeding the secrets to the weight-reader, along the model weights, and making it predict which other (negative) inputs cause the model to produce a high logit.

Other approaches can involve low probability estimation. We can assume that finding an input that cause a certain output is impossible, but we could still find the probability of an input to the model producing a specific output under a specific input distribution faster than sampling. Section 7.1 of the "Estimating the expected output of wide random MLPs more efficiently than sampling" paper mentions that low probability estimation methods for trained MLPs would require access to the training process, which is related to the idea of the previous proposed experiment about generalization. Another interesting connection of these experiments is to Introspection Adapters, as introspection adapters could be seen as a type of weight extractor model for LLMs.

Code

The code is available in:

https://github.com/emparu/mlp-secret-extraction