What Math Teaching Taught Me About AI Training and Evaluation

LESSWRONG
LW

What Math Teaching Taught Me About AI Training and Evaluation — LessWrong

What Math Teaching Taught Me About AI Training and Evaluation

As a former classroom teacher, I was fascinated by how students think about mathematics. I never expected my daily classroom moves—(1) how I framed questions, (2) when I let learners struggle, and (3) how I “listened” to their reasoning to converge on the same fundamental question both fields now share: how do we nurture environments where complex reasoning can emerge?

Below, I connect these classroom principles to recent AI breakthroughs, weaving together the evidence to argue that intelligence, whether human or artificial, thrives when it is not simply optimized for efficiency, but allowed to explore, struggle, and adapt.

Yes, we need to make one small but important fix to the piece since the Webb reference was removed.

The original sentence in the "From 'How Many?' to 'What Can You Tell Me?'" section used both Webb and Kapur to support a claim. Since we found that the Webb source is not verifiable, we must remove it from the sentence. The corrected Kapur reference is sufficient to support the point about the benefits of open-ended tasks.

Here is the revised paragraph:

1. From "How Many?" to "What Can You Tell Me?"

Question Framing Shapes What Reasoning Emerges

Consider the pyramid pattern shown here. I could ask one of two very different questions:

Option A: “How many blocks are in Case 10?”
Option B: “What can you tell me about Case 10?”

The first question has a single correct answer (121 blocks) but reveals almost nothing about how the student thinks. The second question, however, opens a universe of possibilities. It unleashes multiple, equally valid lines of attack—some students attend to the center stack, others to height symmetry, others to the quadratic rule. In math-education studies, tasks framed like Option B predict 2–3× larger gains in transfer tests (Kapur, 2010) because they reveal a richer, multi-pathway understanding.

AI analogue. This principle has recently produced a stunning breakthrough. Google DeepMind's Advanced Gemini with Deep Think achieved a gold medal at the International Mathematical Olympiad by "generating multiple solution candidates" and exploring different proof pathways for each problem. Crucially, this wasn’t a change to the model's core training objective. Instead, at inference time, the system mirrored the diversity of student pathways elicited by Option B—it generated many proof sketches (via tree search) and then selectively verified the most promising ones. The result shows that decoding-time diversity can be a powerful substitute for the richer training signals we give human learners.

2. The Cognitive-Load Paradox

Why Productive Struggle Creates Stronger Intelligence

Math education research reveals a crucial insight that AI development has only partially embraced: while cognitive load theory shows that extraneous load (wasteful effort) hampers learning, germane load (the effort that builds mental schemas) is essential.

Recent transformer compression work (e.g., cross-layer attention) has successfully reduced extraneous memory overhead by ~45% (Zhang et al., 2024), optimizing for efficiency. But what about the struggle that leads to genuine understanding?

Human evidence. The concept of "productive failure" comes from studies where students are given ill-structured problems before any direct instruction. Meta-analyses of 47 classroom experiments show a significant average benefit (Loibl & Rummel, 2014). This suggests that the very struggle that seems computationally wasteful is often what leads to the deepest learning.

AI evidence (still thin). A fascinating parallel emerged in reinforcement-learning fine-tuning. OpenAI’s R1 model spontaneously began its chain-of-thought with a metacognitive marker, "Wait, wait... let me rethink this," then backtracked. This phrase was not programmed; the model discovered that inserting this phrase improved its accuracy. However, this effect has only been demonstrated in post-training RL and not during pre-training. We do not yet know whether "struggle-first" curricula at the gradient-update level will outperform the conventional easy-to-hard ordering.

This creates a paradox: the most sophisticated reasoning, whether human or artificial, emerges not just from efficient processing but from the ability to engage in and recover from intentional, productive struggle.

Actionable hypotheses:

Experiment with curriculum RL that deliberately injects "productive confusion" sequences.
Measure whether adding a "struggle budget" (compute spent on rollouts before any parameter update) improves final accuracy per FLOP.

3. Professional Noticing

The Missing Link in AI Interpretability

Education research describes "professional noticing" as the sophisticated framework expert teachers use to observe, interpret, and respond to student thinking. It consists of three micro-steps:

Attend to the learner’s strategy.
Interpret what that reveals about their understanding.
Decide how to respond.

What's remarkable is that AI systems are beginning to implement this same loop, but again, often at inference time rather than during gradient updates.

Teacher move	AI analogue	Example
Attend	Real-time attention heat-maps	Anthropic’s attention-viewer or Reflexion traces
Interpret	Reward-model critique	R1 labels its own rollouts as “promising / flawed”
Decide	Beam-pruning or instruction refinement	The system prunes a search path that is not working

This represents a fundamental shift from static AI systems to ones that can dynamically observe their own reasoning, interpret what different approaches reveal, and adapt their future problem-solving accordingly.

Subliminal learning caveat. A crucial dimension that math education has long recognized is the "hidden curriculum"—the implicit messages students absorb about what math is and who can do it. This parallels Anthropic's research on subliminal learning, which shows that unintended traits can shift log-probabilities by ~3%. Even when we filter explicit pattern matching, models may still absorb subtle, statistical regularities. The invisible patterns that shape learning may be far more powerful than the explicit content we consciously design.

4. Toward Convergent Design

Rather than AI simply rediscovering educational principles, perhaps both communities are converging on the same design principles that define how robust intelligence develops.

Diversity of reasoning pathways beats single-correct-answer optimization.
Germane cognitive load (productive struggle) is necessary even when it looks computationally wasteful.
Real-time interpret-and-adapt loops outperform static, black-box systems.

References

Kapur, M. (2010). Productive failure in mathematical problem solving. Instructional Science, 38(6), 523–550.
Loibl, K., & Rummel, N. (2014). Knowing what you don’t know makes failure productive. Learning and Instruction, 34, 74–85.
Jacobs, V. R., Lamb, L. L. C., & Philipp, R. A. (2010). Professional noticing of children’s mathematical thinking. Journal for Research in Mathematics Education, 41(2), 169–202.
Zhang, Y. (2025, July 1). Cognitive Load-Aware Inference: A neuro-symbolic framework for optimizing the token economy of large language models. arXiv:2507.00653. (Pre-print; not yet peer-reviewed.)

COLLABORATION NOTE: This piece was written in collaboration with Claude (Anthropic's AI assistant), which helped with research integration, structural organization, and editorial refinement.