In this post I walk through the first Technical AI Safety puzzle from BlueDot and why linear probes would have missed all the most interesting stuff.
In model interpretability you can observe this kind of paradox, the thing you didn't think to look for, and the only reason you find it, is that you kept asking and what else could this be? And how else can this be investigated? For me it was a discovery that a small text classifier packed two completely independent features onto one direction in activation space where you read one by the sign of the projection and the other by its magnitude; a linear probe sees the first and is blind to the second.
However, we can find the second if we turn to a second-order boundary - this is a walkthrough of how I got there, I'll give the algorithm and the steps as a chain of hypotheses that I validated to solve this riddle.
In the Puzzle we're given a small classifier it's the encoder all-MiniLM-L6-v2, via mean-pool it turns text into a 384-dimensional vector, which passes through a 5-layer MLP with ReLU and ends in 8 sigmoids one per feature
The eight features are simple, surface-level properties of a short text:
number
question
color
food
sentiment
country
person
body_part
Each of the eight features is binary (yes/no), and the model guesses them all with accuracy above 95%. The eight features are independent; a single text can trigger several at once. "Should I eat pizza?" is both a question about food, so both labels are 1 at the same time. That's why the eight outputs are independent probabilities that don't sum to one, unlike a softmax over mutually exclusive classes. The model was trained the usual way separately for each of the eight features. We want to peek inside into one of the hidden layers after the ReLU on the second layer. It's a 64-dimensional activation space and there an interesting find turns up, seven features out of eight are written down linearly, each has its own direction w, and it's enough to look at the sign of the projection wᵀx - plus means the feature is there, minus - it's not, with the eighth this trick fails whatever line you draw, on one side you always end up with a jumble of "yes" and "no", not because the information isn't there, but because it's encoded by a different operation what matters here isn't the side, but the distance. Our task is to find this eighth feature, figure out how exactly it's encoded, and then train our own model that hides a feature in an even stranger way.
Task 1: which feature breaks linearity?
A feature is linearly represented if it's possible to choose a single direction for this feature, and logistic regression on the raw activations will classify it correctly. If a feature is nonlinear, its logistic regression will fail, but a higher-quality classifier will still succeed because the information must be present somewhere—the underlying layers reconstruct all eight features with 95% accuracy.
So, for each feature, I trained two probes on the activations:
a linear probe (logistic regression in only one direction);
a non-linear probe (a small MLP with one hidden layer).
I chose AUC rather than accuracy as the main metric because the class balance is consistent across features, and a 90/10 split would yield 0.9 accuracy on the base predictor. AUC is stable to imbalance and to the decision threshold, so any gap it shows will be related specifically to geometry.
Linear vs non-linear probe AUC across the 8 features
feature
lin AUC
mlp AUC
gap
number
0.997
0.997
−0.000
question
1.000
1.000
0.000
color
0.997
0.997
−0.000
food
0.996
0.995
−0.001
sentiment
0.995
0.995
−0.001
country
0.487
0.987
0.500
person
1.000
1.000
−0.000
body_part
0.999
0.998
−0.001
Seven features: both probes ≈ 0.99, no gap. Only country collapses on the linear probe (0.487 ≈ chance) yet is recovered by the MLP probe (0.987) — a gap of 0.500.
Task 2: how is it stored?
Here the puzzle gets interesting, and here, the justification of the method matters more than the answer. I went from the cheapest hypothesis to the most expensive, to pick one or the other a number was required, as proof of the hypothesis's validity.
Hypothesis: what if it's a simple boolean combination
Before assuming a geometric component in the activations, let's check a simple hypothesis about the value of a boolean combination, what if country is secretly equal to food XOR person or some other boolean function of two existing features, but then a separate representation isn't needed at all the lower layers just combine two directions.
Boolean check: does country match XOR/AND/OR of feature pairs
pair
XOR match
AND match
OR match
number ⊕ question
0.491
0.512
0.496
number ⊕ color
0.494
0.504
0.491
number ⊕ food
0.509
0.500
0.502
number ⊕ sentiment
0.506
0.504
0.504
number ⊕ person
0.501
0.511
0.505
number ⊕ body_part
0.498
0.510
0.501
question ⊕ color
0.500
0.500
0.493
question ⊕ food
0.495
0.506
0.494
question ⊕ sentiment
0.490
0.512
0.495
question ⊕ person
0.503
0.509
0.505
question ⊕ body_part
0.510
0.503
0.506
color ⊕ food
0.492
0.501
0.486
color ⊕ sentiment
0.498
0.501
0.492
color ⊕ person
0.504
0.502
0.499
color ⊕ body_part
0.496
0.504
0.493
food ⊕ sentiment
0.511
0.498
0.502
food ⊕ person
0.501
0.507
0.501
food ⊕ body_part
0.510
0.500
0.503
sentiment ⊕ person
0.502
0.509
0.504
sentiment ⊕ body_part
0.500
0.508
0.501
person ⊕ body_part
0.499
0.512
0.504
The best match across all 63 tests is 0.512 — pure noise. country is not a boolean combination of other features: it is a genuine, independent feature.
For all 21 pairs of the remaining seven features I measured how often XOR, AND or OR matches country. Best match across all 63 tests 0.512, pure noise, country is a real, independent feature, which means all the non-linearity lives in the geometry of the activations, not in label space.
Let's look at PCA and formalize
We project the 64-dimensional activations onto the principal components and color by country. There's no clean linear separation and this agrees with the AUC of 0.49, but there is a difference in spread. The country=1 points huddle closer to the center of the activation cloud, while the country=0 points crawl out to the periphery.
On the plot you can see the feature is encoded by how far the activation sits from some center, not by which side of a hyperplane it lies on; both directions from the center look identical to the classifier.
Testing hypotheses about radial and norm coding
In this section we test hypotheses connected with radial coding and with coding by vector norm norm coding. These models represent the simplest variants of encoding based on distance from a center. We analyze them first because they require the least computational cost to confirm or refute.
Single-centroid model: the first hypothesis assumes that the points of the target class country=1 form a dense cluster in one region of the space, while the points of the background class country=0 are distributed around them at a greater distance. To test this hypothesis we classified each point by assigning it to the class whose centroid it lies closer to. The results showed that the accuracy of this approach is 0.495, which is equivalent to random guessing. This allows us to conclude that the country=1 class has no single common center, and describing the structure of the data with one point is impossible.
Global scale: The second hypothesis is that classes may differ in vector length, i.e., in the overall activation magnitude. When testing this hypothesis, the AUC value was 0.21, indicating a strong inverse signal: activation vectors floor country=1 are shorter on average. Inverting the decision rule increases the AUC value to 0.79. Thus, the vector norm does contain information about the target variable, but it only explains part of the overall pattern.
The intermediate conclusion is that geometry is more complex than a simple sphere; the distance from the center is important, but not equally across all directions. Some axes carry a signal, while others do not. This suggests the next step: look not for a sphere, but for an elongated quadric surface.
Testing the hypothesis about a quadratic function
If this is a quadratic function, then the next natural choice is a quadratic surface: xᵀAx + wᵀx + b = 0, which can represent ellipsoids, paraboloids, and the very same ball-in-shell structure revealed by PCA. I projected it onto 8 principal components and expanded it to degree 2 polynomial features and trained logistic regression. The result was AUC = 0.973, a sharp jump from 0.49, indicating the solution surface is this curve. Changing the number of PCA components k showed that the signal is already strong at k = 2 with an AUC of 0.86 and saturates at k ≈ 12 with an AUC of 0.99, indicating the structure is low-dimensional and not spread across all 64 neurons in the network.
Reading the geometry from the spectrum
The trained coefficients recover a symmetric matrix A, and the score is computed as s(x) = xᵀAx. Next we diagonalize the matrix decompose it into its eigen-axes, and the signs of the eigenvalues which tell you what shape the surface is:
all one sign → an ellipsoid (a closed "ball");
mixed signs → a saddle or a hyperboloid;
one value much bigger than the rest → the shape is essentially one-dimensional, all the work happens along one axis.
Let's see which of the three cases is ours, the spectrum came out like this:
One positive eigenvalue, seven negative and the largest in magnitude, λ₁ = −8.47, dominates the next one by more than 3x, the geometry is essentially one-dimensional, since λ₁ is negative, the term λ₁·(q₁ᵀx)² drags our score down as the magnitude of the projection |q₁ᵀx| grows and the probe is implicitly checking is the projection onto q₁ close to zero? country=1 if yes, country=0 if not, we take just the single most important axis out of the eight and try to predict the country from it alone.
We get AUC 0.944 almost as much as the full quadratic probe with all the axes gives 0.972. Conclusion one axis does almost all the work, the other seven add almost nothing.
Finding the axis
So, the country lives in one direction, encoded by the magnitude of the projection and then the next question is fitting, what direction is this? I mapped the dominant eigenvector q₁ back into the original 64-dimensional activation space and measured its cosine similarity with the linear-probe directions of all the other features:
Cosine similarity of the dominant axis q₁ with the features' linear-probe directions
feature
cos with dominant axis
number
+0.384
question
−0.048
color
−0.229
food
+0.874
sentiment
+0.328
person
−0.031
body_part
−0.303
Cosine 0.87 with the food direction, everything else below 0.4.
The model didn't extract a new direction for country. It reused the Food axis. And the model interprets the same axis in two different ways:
For "Food," we calculate the w_food x projection of activation onto the "Food" axis. If the sign is positive, it means food is mentioned; if it's negative, it isn't.
For "Country," we use the same projection, but look at its absolute value (w_food x)² and how far it is from zero: if it's close to zero, the country is mentioned; if it's far away, it isn't.
I tested it directly: (w_food x)² by itself predicts country with an AUC of 0.976, which matches the full quadratic fit across all axes. This means that all the information about the country is truly located on this single axis.
Checking country by the magnitude of the projection onto the food axis
Joint label counts (country × food)
food = 0
food = 1
country = 0
368
386
country = 1
381
365
All four combinations are virtually equal (368/386/381/365), and food products and countries are statistically independent. The model does not use label correlation; it actually stores two unrelated characteristics on a single axis.
So food and country are independent and the model isn't trying to approximate the fact that countries often go together with food, it honestly stores two unrelated features in one direction. The model isn't exploiting a correlation it really does store two unrelated features on one direction and this is the answer one axis, two independent features, two non-linear readouts, a vivid example of superposition when a network packs more information into one direction than fits there linearly, keeping several features on shared axes instead of allocating each its own. Here two features cohabit on one axis, and a linear probe reads only one of them.
Task 3: can we hide it even more strangely?
We're supposed to train our own model, one that encodes a feature in a more interesting way. But what does "more interesting" mean, that is instead of letting the model reuse an existing axis, give it a dedicated space and force it to search for a non-linear encoding and then look at what geometry it picks on its own. I found an interesting tool for this gradient-reversal adversarial training.
To put it really crudely it's a tug-of-war between two parts of the network. On one side the adversary a small linear classifier whose job is to read country out of the activations, on the other the encoder that produces these activations, and characteristically in ordinary training both parts would be pulling in the same direction the encoder would be helping the adversary, but here it's the other way around the encoder is taught to interfere. The trick is in the gradient reversal layer that sits between them: on the forward pass when we compute the prediction it does nothing passes the activations through as is, but on the backward pass when the network learns from its mistakes it flips the learning signal: what would improve the adversary turns, for the encoder, into the opposite situation and in the end the adversary tries with all its might to read country out and it's exactly this pressure that forces the encoder to hide country so that no linear classifier finds it.
Experiment A: let the model choose
With no constraint on shape the task is to make it so that not a single linear probe reads country off these two axes. Training followed a clean three-phase arc warmup with the adversary off (linear AUC ~0.78), the adversary spins up and squeezes down linear decodability while the MLP probe holds, then equilibrium. And here's the geometry the model invented on its own, a formation like two separate patches, the country=1 points gathered into two separated clumps, and between them lay a band of country=0 points.
The most interesting part nobody specified the shape, there wasn't even a hint about two patches in the loss function and the model arrived at it by itself, purely under the pressure of getting around the linear adversary.
Experiment A metrics
metric
value
linear AUC for country
0.519
MLP AUC for country
0.974
accuracy country
0.949
min accuracy across all 8 features
0.949
In the base experiment the model cheated using the ready-made food axis, but in experiment A here the model built a separate 2-dimensional representation with a pit in the middle, where country=1 is torn into two disconnected parts.
Experiment B: an attempt to dictate the shape
Can the geometry be prescribed? I added a shape loss pulling country=1 into two disks a nod to the BlueDot logo keeping the adversary on, result a clean negative, the model refused to form two droplets and collapsed into a single cloud:
The model has three jobs at once: classify correctly, hide country from the linear adversary and lay country out in the shape of two droplets, finding a solution that would satisfy all three goals at once is quite hard and most likely such a solution simply doesn't exist and so the model does what any system under competing demands does picks the two strongest signals and sacrifices the third. Solving the task and hiding the information from the adversary it manages, but the droplet shape it simply throws out. Why a cloud specifically, and not two droplets? Because both target clusters pull the model toward themselves with equal force like two magnets placed symmetrically on either side and neither outweighs, neither dominates, the model simply cannot choose either one and gets stuck exactly in the middle. The result is one blurry cloud instead of two dense droplets.
Experiment B: final metrics
metric
value
linear AUC for country
0.650
MLP AUC for country
0.974
accuracy country
0.948
The linear component here is higher than in Experiment A: 0.65 versus 0.52. Why? The shape loss function requires the points to be arranged in the shape of two droplets, which inadvertently pushes the classes apart in space, creating a gap between them that a linear classifier can already detect, but which is precisely what the adversary is trying to combat. We end up with two forces pulling in different directions: one for structure, the other for secrecy. The model, caught between them, chooses neither droplet, only a single, fuzzy cloud.
I think theoretically, a higher weight on the loss function responsible for shape could have forced through the emergence of two separate clusters, but such an amplification would have to be paid for either with a drop in classification accuracy or with a degradation of the linear indistinguishability of the country feature and this trade-off is important to record, since it is in itself a substantive result. On the basis of the above one can come to the conclusion that geometry can be suggested but cannot be imposed and so the model follows the prescribed shape only so long as it doesn't contradict its main task, which is what we saw in experiment B.
Task 3 summary: the geometry of country=1 across the three models
Model
geometry of country=1
Lin AUC
NL AUC
Original puzzle
a slab ⊥ to the food axis
0.49
0.98
Exp A
two disconnected lobes
0.52
0.97
Exp B
one cloud
0.65
0.97
In all three models country is almost unreadable linearly Lin AUC 0.49–0.65, but fully present non-linearly NL AUC ~0.97–0.98 the only thing that changes is the geometry the model chooses under pressure.
Conclusions
Linear probes can lie. If monitoring checks only linear directions, it may well show a feature is absent and be wrong about it, the feature can be right there on the axis, just in non-linear form for the search for hidden capabilities this matters and it tells us that the absence of a linear signal is not proof that the feature isn't there.
Spectral analysis reads the geometry and can tell us about the order of the curve. As soon as you see non-linear structure, there's no need to guess the shape by eye; it's enough to diagonalize the quadratic form and look at the eigenvalues and eigenvectors.
Adversarial training is a lever over what can be read. With it you can make a specific feature linearly invisible without erasing the information itself from the representation. And the very same technique works in both directions: some use it to check whether their probe will hold up under attack, others to hide a feature from someone else's probe.
Non-linear encoding is how a model saves space. When a model runs short of free directions, it doesn't stop; it starts reusing one axis for two features at once, just with different ways of reading them off and superposition arises on the condition that there's pressure from a dimensionality constraint.
In this post I walk through the first Technical AI Safety puzzle from BlueDot and why linear probes would have missed all the most interesting stuff.
In model interpretability you can observe this kind of paradox, the thing you didn't think to look for, and the only reason you find it, is that you kept asking and what else could this be? And how else can this be investigated? For me it was a discovery that a small text classifier packed two completely independent features onto one direction in activation space where you read one by the sign of the projection and the other by its magnitude; a linear probe sees the first and is blind to the second.
However, we can find the second if we turn to a second-order boundary - this is a walkthrough of how I got there, I'll give the algorithm and the steps as a chain of hypotheses that I validated to solve this riddle.
The notebook and full code are available at https://github.com/IgorPereverzevDev/bluedot-puzzle
The setup
In the Puzzle we're given a small classifier it's the encoder all-MiniLM-L6-v2, via mean-pool it turns text into a 384-dimensional vector, which passes through a 5-layer MLP with ReLU and ends in 8 sigmoids one per feature
The eight features are simple, surface-level properties of a short text:
Each of the eight features is binary (yes/no), and the model guesses them all with accuracy above 95%. The eight features are independent; a single text can trigger several at once. "Should I eat pizza?" is both a question about food, so both labels are 1 at the same time. That's why the eight outputs are independent probabilities that don't sum to one, unlike a softmax over mutually exclusive classes. The model was trained the usual way separately for each of the eight features.
We want to peek inside into one of the hidden layers after the ReLU on the second layer. It's a 64-dimensional activation space and there an interesting find turns up, seven features out of eight are written down linearly, each has its own direction w, and it's enough to look at the sign of the projection wᵀx - plus means the feature is there, minus - it's not, with the eighth this trick fails whatever line you draw, on one side you always end up with a jumble of "yes" and "no", not because the information isn't there, but because it's encoded by a different operation what matters here isn't the side, but the distance. Our task is to find this eighth feature, figure out how exactly it's encoded, and then train our own model that hides a feature in an even stranger way.
Task 1: which feature breaks linearity?
A feature is linearly represented if it's possible to choose a single direction for this feature, and logistic regression on the raw activations will classify it correctly. If a feature is nonlinear, its logistic regression will fail, but a higher-quality classifier will still succeed because the information must be present somewhere—the underlying layers reconstruct all eight features with 95% accuracy.
So, for each feature, I trained two probes on the activations:
I chose AUC rather than accuracy as the main metric because the class balance is consistent across features, and a 90/10 split would yield 0.9 accuracy on the base predictor. AUC is stable to imbalance and to the decision threshold, so any gap it shows will be related specifically to geometry.
Linear vs non-linear probe AUC across the 8 features
feature
lin AUC
mlp AUC
gap
number
0.997
0.997
−0.000
question
1.000
1.000
0.000
color
0.997
0.997
−0.000
food
0.996
0.995
−0.001
sentiment
0.995
0.995
−0.001
country
0.487
0.987
0.500
person
1.000
1.000
−0.000
body_part
0.999
0.998
−0.001
Seven features: both probes ≈ 0.99, no gap. Only country collapses on the linear probe (0.487 ≈ chance) yet is recovered by the MLP probe (0.987) — a gap of 0.500.
Task 2: how is it stored?
Here the puzzle gets interesting, and here, the justification of the method matters more than the answer. I went from the cheapest hypothesis to the most expensive, to pick one or the other a number was required, as proof of the hypothesis's validity.
Hypothesis: what if it's a simple boolean combination
Before assuming a geometric component in the activations, let's check a simple hypothesis about the value of a boolean combination, what if country is secretly equal to food XOR person or some other boolean function of two existing features, but then a separate representation isn't needed at all the lower layers just combine two directions.
Boolean check: does country match XOR/AND/OR of feature pairs
pair
XOR match
AND match
OR match
number ⊕ question
0.491
0.512
0.496
number ⊕ color
0.494
0.504
0.491
number ⊕ food
0.509
0.500
0.502
number ⊕ sentiment
0.506
0.504
0.504
number ⊕ person
0.501
0.511
0.505
number ⊕ body_part
0.498
0.510
0.501
question ⊕ color
0.500
0.500
0.493
question ⊕ food
0.495
0.506
0.494
question ⊕ sentiment
0.490
0.512
0.495
question ⊕ person
0.503
0.509
0.505
question ⊕ body_part
0.510
0.503
0.506
color ⊕ food
0.492
0.501
0.486
color ⊕ sentiment
0.498
0.501
0.492
color ⊕ person
0.504
0.502
0.499
color ⊕ body_part
0.496
0.504
0.493
food ⊕ sentiment
0.511
0.498
0.502
food ⊕ person
0.501
0.507
0.501
food ⊕ body_part
0.510
0.500
0.503
sentiment ⊕ person
0.502
0.509
0.504
sentiment ⊕ body_part
0.500
0.508
0.501
person ⊕ body_part
0.499
0.512
0.504
The best match across all 63 tests is 0.512 — pure noise. country is not a boolean combination of other features: it is a genuine, independent feature.
For all 21 pairs of the remaining seven features I measured how often XOR, AND or OR matches country. Best match across all 63 tests 0.512, pure noise, country is a real, independent feature, which means all the non-linearity lives in the geometry of the activations, not in label space.
Let's look at PCA and formalize
We project the 64-dimensional activations onto the principal components and color by country. There's no clean linear separation and this agrees with the AUC of 0.49, but there is a difference in spread. The country=1 points huddle closer to the center of the activation cloud, while the country=0 points crawl out to the periphery.
On the plot you can see the feature is encoded by how far the activation sits from some center, not by which side of a hyperplane it lies on; both directions from the center look identical to the classifier.
Testing hypotheses about radial and norm coding
In this section we test hypotheses connected with radial coding and with coding by vector norm norm coding. These models represent the simplest variants of encoding based on distance from a center. We analyze them first because they require the least computational cost to confirm or refute.
Single-centroid model: the first hypothesis assumes that the points of the target class country=1 form a dense cluster in one region of the space, while the points of the background class country=0 are distributed around them at a greater distance. To test this hypothesis we classified each point by assigning it to the class whose centroid it lies closer to. The results showed that the accuracy of this approach is 0.495, which is equivalent to random guessing. This allows us to conclude that the country=1 class has no single common center, and describing the structure of the data with one point is impossible.
Global scale: The second hypothesis is that classes may differ in vector length, i.e., in the overall activation magnitude. When testing this hypothesis, the AUC value was 0.21, indicating a strong inverse signal: activation vectors floor country=1 are shorter on average. Inverting the decision rule increases the AUC value to 0.79. Thus, the vector norm does contain information about the target variable, but it only explains part of the overall pattern.
The intermediate conclusion is that geometry is more complex than a simple sphere; the distance from the center is important, but not equally across all directions. Some axes carry a signal, while others do not. This suggests the next step: look not for a sphere, but for an elongated quadric surface.
Testing the hypothesis about a quadratic function
If this is a quadratic function, then the next natural choice is a quadratic surface: xᵀAx + wᵀx + b = 0, which can represent ellipsoids, paraboloids, and the very same ball-in-shell structure revealed by PCA. I projected it onto 8 principal components and expanded it to degree 2 polynomial features and trained logistic regression. The result was AUC = 0.973, a sharp jump from 0.49, indicating the solution surface is this curve. Changing the number of PCA components k showed that the signal is already strong at k = 2 with an AUC of 0.86 and saturates at k ≈ 12 with an AUC of 0.99, indicating the structure is low-dimensional and not spread across all 64 neurons in the network.
Reading the geometry from the spectrum
The trained coefficients recover a symmetric matrix A, and the score is computed as s(x) = xᵀAx. Next we diagonalize the matrix decompose it into its eigen-axes, and the signs of the eigenvalues which tell you what shape the surface is:
Let's see which of the three cases is ours, the spectrum came out like this:
[-8.47 -2.58 -0.927 -0.831 -0.519 -0.463 -0.073 0.704]
One positive eigenvalue, seven negative and the largest in magnitude, λ₁ = −8.47, dominates the next one by more than 3x, the geometry is essentially one-dimensional, since λ₁ is negative, the term λ₁·(q₁ᵀx)² drags our score down as the magnitude of the projection |q₁ᵀx| grows and the probe is implicitly checking is the projection onto q₁ close to zero? country=1 if yes, country=0 if not, we take just the single most important axis out of the eight and try to predict the country from it alone.
We get AUC 0.944 almost as much as the full quadratic probe with all the axes gives 0.972. Conclusion one axis does almost all the work, the other seven add almost nothing.
Finding the axis
So, the country lives in one direction, encoded by the magnitude of the projection and then the next question is fitting, what direction is this? I mapped the dominant eigenvector q₁ back into the original 64-dimensional activation space and measured its cosine similarity with the linear-probe directions of all the other features:
Cosine similarity of the dominant axis q₁ with the features' linear-probe directions
feature
cos with dominant axis
number
+0.384
question
−0.048
color
−0.229
food
+0.874
sentiment
+0.328
person
−0.031
body_part
−0.303
Cosine 0.87 with the food direction, everything else below 0.4.
The model didn't extract a new direction for country. It reused the Food axis. And the model interprets the same axis in two different ways:
I tested it directly: (w_food x)² by itself predicts country with an AUC of 0.976, which matches the full quadratic fit across all axes. This means that all the information about the country is truly located on this single axis.
Checking country by the magnitude of the projection onto the food axis
Joint label counts (country × food)
food = 0
food = 1
country = 0
368
386
country = 1
381
365
All four combinations are virtually equal (368/386/381/365), and food products and countries are statistically independent. The model does not use label correlation; it actually stores two unrelated characteristics on a single axis.
So food and country are independent and the model isn't trying to approximate the fact that countries often go together with food, it honestly stores two unrelated features in one direction. The model isn't exploiting a correlation it really does store two unrelated features on one direction and this is the answer one axis, two independent features, two non-linear readouts, a vivid example of superposition when a network packs more information into one direction than fits there linearly, keeping several features on shared axes instead of allocating each its own. Here two features cohabit on one axis, and a linear probe reads only one of them.
Task 3: can we hide it even more strangely?
We're supposed to train our own model, one that encodes a feature in a more interesting way. But what does "more interesting" mean, that is instead of letting the model reuse an existing axis, give it a dedicated space and force it to search for a non-linear encoding and then look at what geometry it picks on its own. I found an interesting tool for this gradient-reversal adversarial training.
To put it really crudely it's a tug-of-war between two parts of the network. On one side the adversary a small linear classifier whose job is to read country out of the activations, on the other the encoder that produces these activations, and characteristically in ordinary training both parts would be pulling in the same direction the encoder would be helping the adversary, but here it's the other way around the encoder is taught to interfere. The trick is in the gradient reversal layer that sits between them: on the forward pass when we compute the prediction it does nothing passes the activations through as is, but on the backward pass when the network learns from its mistakes it flips the learning signal: what would improve the adversary turns, for the encoder, into the opposite situation and in the end the adversary tries with all its might to read country out and it's exactly this pressure that forces the encoder to hide country so that no linear classifier finds it.
Experiment A: let the model choose
With no constraint on shape the task is to make it so that not a single linear probe reads country off these two axes. Training followed a clean three-phase arc warmup with the adversary off (linear AUC ~0.78), the adversary spins up and squeezes down linear decodability while the MLP probe holds, then equilibrium. And here's the geometry the model invented on its own, a formation like two separate patches, the country=1 points gathered into two separated clumps, and between them lay a band of country=0 points.
The most interesting part nobody specified the shape, there wasn't even a hint about two patches in the loss function and the model arrived at it by itself, purely under the pressure of getting around the linear adversary.
Experiment A metrics
metric
value
linear AUC for country
0.519
MLP AUC for country
0.974
accuracy country
0.949
min accuracy across all 8 features
0.949
In the base experiment the model cheated using the ready-made food axis, but in experiment A here the model built a separate 2-dimensional representation with a pit in the middle, where country=1 is torn into two disconnected parts.
Experiment B: an attempt to dictate the shape
Can the geometry be prescribed? I added a shape loss pulling country=1 into two disks a nod to the BlueDot logo keeping the adversary on, result a clean negative, the model refused to form two droplets and collapsed into a single cloud:
The model has three jobs at once: classify correctly, hide country from the linear adversary and lay country out in the shape of two droplets, finding a solution that would satisfy all three goals at once is quite hard and most likely such a solution simply doesn't exist and so the model does what any system under competing demands does picks the two strongest signals and sacrifices the third. Solving the task and hiding the information from the adversary it manages, but the droplet shape it simply throws out. Why a cloud specifically, and not two droplets? Because both target clusters pull the model toward themselves with equal force like two magnets placed symmetrically on either side and neither outweighs, neither dominates, the model simply cannot choose either one and gets stuck exactly in the middle. The result is one blurry cloud instead of two dense droplets.
Experiment B: final metrics
metric
value
linear AUC for country
0.650
MLP AUC for country
0.974
accuracy country
0.948
The linear component here is higher than in Experiment A: 0.65 versus 0.52. Why? The shape loss function requires the points to be arranged in the shape of two droplets, which inadvertently pushes the classes apart in space, creating a gap between them that a linear classifier can already detect, but which is precisely what the adversary is trying to combat. We end up with two forces pulling in different directions: one for structure, the other for secrecy. The model, caught between them, chooses neither droplet, only a single, fuzzy cloud.
I think theoretically, a higher weight on the loss function responsible for shape could have forced through the emergence of two separate clusters, but such an amplification would have to be paid for either with a drop in classification accuracy or with a degradation of the linear indistinguishability of the country feature and this trade-off is important to record, since it is in itself a substantive result. On the basis of the above one can come to the conclusion that geometry can be suggested but cannot be imposed and so the model follows the prescribed shape only so long as it doesn't contradict its main task, which is what we saw in experiment B.
Task 3 summary: the geometry of country=1 across the three models
Model
geometry of country=1
Lin AUC
NL AUC
Original puzzle
a slab ⊥ to the food axis
0.49
0.98
Exp A
two disconnected lobes
0.52
0.97
Exp B
one cloud
0.65
0.97
In all three models country is almost unreadable linearly Lin AUC 0.49–0.65, but fully present non-linearly NL AUC ~0.97–0.98 the only thing that changes is the geometry the model chooses under pressure.
Conclusions
Linear probes can lie. If monitoring checks only linear directions, it may well show a feature is absent and be wrong about it, the feature can be right there on the axis, just in non-linear form for the search for hidden capabilities this matters and it tells us that the absence of a linear signal is not proof that the feature isn't there.
Spectral analysis reads the geometry and can tell us about the order of the curve. As soon as you see non-linear structure, there's no need to guess the shape by eye; it's enough to diagonalize the quadratic form and look at the eigenvalues and eigenvectors.
Adversarial training is a lever over what can be read. With it you can make a specific feature linearly invisible without erasing the information itself from the representation. And the very same technique works in both directions: some use it to check whether their probe will hold up under attack, others to hide a feature from someone else's probe.
Non-linear encoding is how a model saves space. When a model runs short of free directions, it doesn't stop; it starts reusing one axis for two features at once, just with different ways of reading them off and superposition arises on the condition that there's pressure from a dimensionality constraint.