Hmm... these hyper-connections seem to be functionally equivalent to just duplicating tokens at the beginning and then pooling tokens together before the attention mechanism or FFN. I wonder if you could do even better than this manifold hyper-connections idea by adding filler tokens and then just using plain convolutional layers before the attention mechanism along the sequence dimension with stride >1 to downsample back to the original sequence length.
Background:
Manifold-Constrained Hyper-Connections (mHC) is a new architecture added by Deepseek and recently implemented in Deepseek v4.
mHC is a fix that makes HC(Hyper-Connections) vanishing or exploding gradient caused by HC while still keeping the performance increases. As adding weights and biases on HC made signals from earlier layers harder to update making the residual stream less residual streamy.
HC is a cursed method of adding weights and biases onto the residual stream to simulate a wider residual stream.
mHC was an addition onto HC where Sinkhorn-Knopp were used to make the weights and biases on the residual stream to be doubly stochastic. This is a matrix where the rows and columns sum to one, like applying softmax along rows and columns simultaneously. MHC-lite is similar to the mHC paper however, used a different method of Birkhoff-von Neumann to achieve the doubly stochastic matrix
Quick Summary:
Experimental setup
MHC
For my training for my mhc and base models I used https://github.com/FFTYYY/mhc-lite using the arguments from the https://arxiv.org/pdf/2603.14833 but, adapted to work with mhc-lite https://github.com/Realmbird/mhc-lite-Dolma-781M.
Trained models are at https://huggingface.co/collections/Realmbird/mhc-model-diff
I trained mHC models with 4 residual streams with the mHC and mHC lite models being 781m parameters after including parameters from residual streams.
Ablation Detector setup for Attention Heads
Probe
Prompt
Why it works
prev-token
"When Mary and John went to the store, John gave a drink to Mary"
Has repeated names ("John", "Mary") that the model can only resolve correctly using positional/previous-token info from earlier in the sentence.
induction
[EOT] + R + R where R = 25 random token IDs
Random tokens repeated twice. The only way to predict the second copy is by looking back to the first copy → forces the induction circuit.
duplicate
Same prompt as induction
Reuses the random-repeat structure.
successor
3 prompts averaged: days ("... Friday" → " Saturday"), numbers ("... five" → " six"), letters ("... E" → " F")
Three independent probes prevent single-prompt artifacts. The model has to "increment" by one.
copy-suppression
"When John and Mary went to the store, John gave a drink to", target = " Mary", distractor = " John"
Tests whether ablating a head makes the model more likely to predict the duplicated name (" John") instead of the correct one (" Mary").
ΔNLL = NLL(target | ablated) − NLL(target | clean)
With prompts that were the certain attention head is needed.
ex) prev token "When Mary and John went to the store, John gave a drink to Mary"
Pass
Setup
Output
Baseline
nothing ablated
NLL_baseline
Total ablation
ablate (l, h) only
NLL_total
Direct ablation
ablate (l, h) AND freeze every block at layer ≥ l+1 to its baseline output
NLL_direct
Experiments
Logit Lens
Attention Heads
Do heads look the same regardless of parallel residual streams?
Confirming Attention Sinks previous token heads
Future Work
Appendix
Full Logit Lens images
For the mHC models they were trained with 4 residual streams 0,1,2,3. I tried a logit lens on one residual stream finding output for it and trying out the sums between them.
Pattern of Attention Head