I agree that the results are legit, just taking issue with the authors presenting them without prior work context (e.g. setting the wrong reference class s.t. the improvement over baselines appears larger). RNNs getting outsized performance on maze/sudoku is to be expected and the main ARC result seems to be more of a strong data augmentation + SGD baseline rather than something unique to the architecture, ARC-1 was pretty susceptible to this (eg ARC-AGI Without Pretraining)
This being said I think it's a big deal that various RNN architectures have such different characteristics on these limit cases for transformers points to a pretty large jump in capabilities when scaling/pretraining is cracked. I think it'd be good for more people working on alignment to be study what types of behaviors are exhibited in these sorts of models at small scale, with the expectation that the paradigm will eventually shift in this direction.
Flagging that the HRM paper strongly reads as low-substance, after seeing this post I revisited it for a deeper read to fully understand their method and for me this confirmed initial impressions. I used to get very excited about every novel architecture published and over time I think there's some amount of cognitive immunity you can build up, e.g. spending most of the paper rehashing vague "inspirations" tends to be a dark pattern employed when you want to make your use of a standard method seem more novel than it is.
I don't really have the time to dissect the paper but a good general heuristic is understanding something well enough to e.g. re-implement it in pytorch prior to accepting their results at face value. If this is the case here and you still believe that it's a meaningful research advance then it's prob just a difference in research taste and you should ignore this comment.
Otherwise I agree with the general take here.
Update: ARC has published a blog post analyzing this, https://arcprize.org/blog/hrm-analysis. As expected swapping in a transformer works approx the same.