Ambiguous out-of-distribution generalization on an algorithmic task
Introduction It's now well known that simple neural network models often "grok" algorithmic tasks. That is, when trained for many epochs on a subset of the full input space, the model quickly attains perfect train accuracy and then, much later, near-perfect test accuracy. In the former phase, the model memorizes...
For that earlier section, we used smaller models trained on S4 intersect A4×2 (4,000 parameters) instead of S5 intersect A5×2 (80,000 parameters) -- the only reason for this was to allow for a larger sample size of 10,000 models with our compute budget. All subsequent sections use the S5 models.