How many attention heads do you need to do XOR?

Karthik Viswanathan

You need at least two attention heads to do XOR, and we will find that it is a surprisingly crisp result which uses only a few lines of algebra.

XOR is one of the most widely used Boolean functions that is not linearly separable, making it a classic test of computational expressivity, from perceptrons to modern transformer circuits. Understanding how many attention heads are needed to compute it tells us something fundamental about what individual components of a transformer can and cannot represent.

Introduction

What Boolean functions can a linear probe recover from the residual stream after the first attention update? This isolates the expressivity of attention itself since downstream MLPs or layer norms could solve XOR on their own, but we want to know what is already linearly accessible before any further processing. It turns out that a single attention head already makes OR and AND linearly separable, but it cannot do XOR^[1]. We also give an explicit construction showing that two attention heads suffice.

We will show this using the setup outlined in the above figure, checking if the residual stream of the query token is linearly separable into XOR classes. Since the skip connection is a constant offset across all inputs, linear separability depends only on the attention update , which we analyze for the rest of the post.

A short sketch of the proof

In the single-head case, we will show that the attention update results in a point that lies in the intersection of the line segments connecting same-class points^[2], i.e.,

This rules out linear separability: any separating hyperplane must place each class' segment entirely on one side, but two segments that intersect cannot be separated by a hyperplane.

However, two heads are enough. Conceptually, one head detects 0 and the other detects 1. On the mixed inputs 01 and 10, both heads contribute; on 00 and 11, only one of them does. A linear readout can then separate the mixed cases from the same-bit ones.

This means you need at least two attention heads to do XOR.

Corollary: Parity detection in modular addition tasks. For any two distinct token values , the first attention update cannot linearly separate the "same" inputs from the "mixed" inputs , since restricting to these four inputs recovers exactly the XOR structure. Note that this is a statement about each 2-element restriction separately, not about full parity over all inputs.

Setup

We work with sequences of length 3 over the vocabulary . On input , the model sees the sequence .

Self Attention 101

Each token has a token embedding , and each position has a positional embedding . The embedded sequence is

A single attention head is parameterized by the query, key and value matrices denoted as respectively. The = token attends to all three positions via softmax attention, resulting in the residual stream :

where is the attention weight from the key to the = token given as:

The = token's residual stream after the attention head has two parts: the skip connection (its original embedding) and the attention update (the new information it gathered by attending to and ):

Since doesn't depend on the input bits at all, any probe can fold it into its threshold . So the only thing that matters for classification is the attention update .

Let denote the value vector at position . The attention update is then a convex combination of these value vectors, weighted by the attention probabilities :

The attention probabilities are the softmax of the raw attention logits , which measure how strongly the = token's query matches each key:

resulting in the following attention weights:

So we can equivalently write the attention update directly in terms of the 's:

We now ask: can a hyperplane separate the four attention outputs into the XOR classes?

One attention head can do OR and AND^[3]

It is worth noticing that XOR is the first interesting example here since a single head can already do OR and AND.

Here is a simple way to see it. Take a head that attends to 1 tokens and writes a positive value when it reads one.^[4] Its output increases with the number of ones, so it produces a score that is monotone in :

Now thresholding does the rest:

threshold below the medium value gives , i.e. OR;
threshold between medium and high gives , i.e. AND.

So one head can do monotone threshold functions of just fine. XOR is different because it is not monotone: it fires at but not at or . No single threshold can pick out the middle value from both sides.

One attention head cannot do XOR

We now give a short proof that a single attention head cannot do XOR.

The key identities

In the Setup section, we saw that the attention update takes the form

where and . Note that , a fact we will use shortly.

The key structural fact is that and each split into an -only term, a -only term, and a constant:

Because of this, summing over the main diagonal versus the off-diagonal yields identical totals — in both cases you collect exactly one copy each of the and contributions, and one copy each of the and contributions. This gives the key identities:

Line segments connecting the same class intersect

We now show that the positive-class segment always intersects the negative-class segment, ruling out linear separability.

Recall from the definition , we get . Substituting this into the diagonal identity gives

Dividing both expressions by , we obtain

Since every , both sides are convex combinations: the left side is a point on the segment and the right side is a point on . So point satisfies

In words: the segment joining the two XOR-negative hidden states always crosses the segment joining the two XOR-positive hidden states.

This immediately rules out linear separability. If a probe had at both and , then by convexity on the entire segment . Likewise on . But lies on both segments, forcing and simultaneously, resulting in a contradiction.

Conclusion. A single attention head with a linear readout cannot compute XOR.

Two heads can do XOR

Now we switch from one head to two parallel heads. For the existence proof, it is enough to write the residual update as the sum of the two head outputs:

where

The idea is simple:

Head 0 softly detects token 0 and writes in one direction;
Head 1 softly detects token 1 and writes in an orthogonal direction.

Then the mixed inputs 01 and 10 are exactly the cases where both heads contribute. This is illustrated in the right panel of the figure in the introduction. Although each head's class segments still intersect individually, their combined outputs live in a 2D subspace where and pull apart, and a separating hyperplane fits between them.

An explicit construction

Let's work in , with no positional embeddings, and choose the token embeddings as follows:

We choose the query, key, value and output matrices for each head as shown below, along with the resulting attention scores and weights from the = query position.

The construction is symmetric by design:

Head 0 attends preferentially to token 0 and writes in the direction
Head 1 attends preferentially to token 1 and writes in the direction

Since each raw attention logit is either 0 or 1, the softmax exponentiates to either or , which is why appears throughout the tables below. Each entry is the triplet of softmax attention weights that the = query assigns to positions respectively, for that head and input. The weights are non-negative and sum to 1.

We can now compute the attention update from each head across all four inputs.

The same-bit inputs and each activate only one head, while the mixed inputs and activate both heads equally. The mixed inputs therefore have a strictly larger total activation, which a linear readout can exploit.

Choosing , the probe score takes only two distinct values:

The mixed inputs score strictly higher, so any threshold together with correctly classifies XOR.

Conclusion. Two attention heads are sufficient to compute XOR with a linear readout.

Takeaway

The single-head impossibility holds for any embedding dimension, any positional encoding, and any attention parameters. It is a purely structural consequence of how softmax attention computes a weighted average: the additive decomposition of the numerator and denominator forces the class segments to cross, ruling out any function that requires separating the diagonal from the off-diagonal .

Two heads break this by giving the outputs a second dimension to spread into. Each head's outputs still satisfy the crossing constraint individually, but the sum of two heads' contributions lives in a 2D subspace where the class segments pull apart. The mixed inputs and are the only cases where both heads contribute, creating a gap that a linear readout can exploit.

More broadly, the segment-crossing argument applies whenever a single attention head must separate the diagonal from the off-diagonal . XOR is the simplest such function, but the same geometric obstruction rules out any target that requires this checkerboard sign pattern. This geometric obstruction is reminiscent of the topological constraints on neural network classification analysed by Chris Olah in his blogpost: just as low-dimensional networks cannot separate linked manifolds without sufficient width, a single attention head cannot separate the XOR classes because its outputs are forced into a configuration where the class segments cross.

In a single-layer, attention-only model with a linear readout from the query position, we saw that

1 head is not enough for any choice of dimension, embeddings, positional embeddings, or linear readout.
2 heads are enough, via the explicit construction above.

Together, these establish that two attention heads are necessary and sufficient to compute XOR with a logistic regression probe.

Open questions. A few natural directions this raises:

Parity on bits: How does the minimum number of heads scale with input length?
Wider implications: The geometric constraint is not specific to XOR or binary inputs: for any two token values , a single attention head cannot linearly separate "same" inputs from "mixed" inputs . This is a purely structural consequence of the weighted-average form of attention, independent of dimension or parameters. Where else does this diagonal-vs-off-diagonal bottleneck limit single-head expressivity?

^{^}

XOR is denoted by in equations.
^{^}
We call and the XOR-negative outputs (where ) and and the XOR-positive outputs (where ).
^{^}
To clarify, it performs these operations independently (either OR or AND), not simultaneously
^{^}
For example, this can be done by using a value matrix that is "parallel" to e(1) and "orthogonal" to e(0).

23