Hypothesis: gradient descent prefers general circuits — LessWrong