I think I largely agree with this, but I think it's also pretty hard to put to practice in training an AI proof system.
Fermat's theorem is actually a great example. Imagine my goal is to "align" a mathematician to solve FLT. Imagine a bizarre counterfactual where we have access to advanced AI models and computational power but, for some reason, we don't have algebraic number theory or algebraic geometry -- they were just never invented for some reason. If you want something less far-fetched: imagine the abc conjecture today if you're of a camp that believes the inter-universal Teichmuller theory stuff is not going to work out, or imagine perhaps the Collatz conjecture (very easy to state, very hard to make progress on), if you believe that we might need some truly out-there stuff to resolve it. I'll say "FLT" as a stand-in for "easy to state problem that everyone agrees is important, but is completely resilient to methods that exist(ed) in it's day."
So Fermat just died, we opened his notebook, and read his famous "the margins aren't wide enough" note. We decide that a major goal of the world's mathematical enterprise is to regenerate his proof, and we set our anachronistic advanced AI systems to work. Most of them, trained on a corpus of Fermat (and Fermat-like authors) text perform a bunch of weird factoring tricks, equation rewrites, reductions mod n, etc. Fairly standard number theory at the time (I think; I'm not a historian).
We have the benefit in our timeline of knowing that a proof of FLT (at least very strongly seems to) require a rather gigantic castle in the sky to be built first. It just doesn't look like there's a road to FLT that doesn't pass through number rings, elliptic curves, modular forms, cohomology theories, schemes, etc.
Suppose that one of our AI systems (one of the ones that's allowed to try to be creative) starts drifting out of the factoring/rewrites/reduction mod n realm into doing more pie-in-the-sky stuff: it starts inventing words like "primary decomposition" and "Noetherianity / ascending chain conditions" and produces a massive volume of text on something called "Krull dimension." Should we punish or reward it? A similar AI goes off and produces stuff about "cohesive rings" and "fusible quasitowers" and "entwineable bundles" and such. Should we punish or reward it? Of course, in reality, they'd be taking baby steps, and we'd be punishing/rewarding far before they can start building towers. Perhaps this means we have even less information about where they're going. (If the math police were in my office punishing/rewarding me after every line I write down, I'd certainly never get anything done, that's for sure!)
The kinds of people who invented primary decomposition, and the ascending chain condition, and Krull dimension were "well-aligned." They cared a lot about problems like FLT. In fact, primary decomposition (more or less) directly arose out of various failed proof attempts of FLT, and Noetherianity/ACC was (largely, initially; disclaimer that I am not a historian) a technical tool to get primary decomposition working.
It's not so easy to run a test like "does this proof shorten the distance to things we care about" because primary decomposition, while a necessary part of the tower, was still upwards of a century away from a proof. We can't, from Fermat's time, look at the statement and "obviously" tell that 'it's going to make a proof of FLT shorter. The farther you climb up the tower (someone in the corner inventing cohomology theories on schemes) the more strained the link to FLT, but all those links do appear to be ultimately necessary. If FLT is the main thing we care about, we are sort of stuck relying on incredibly vague intuitions from people like Noether and Lasker and Krull (and much later Grothendieck and Tate and Shimura and Wiles; and all the people in between) that the tower they're building might actually be the "shortest bridge" to the elementary-to-state statements we care about (like FLT), even though (pre-Wiles) they have absolutely no way to prove that. They can in no way absolutely prove to their punish/reward trainers that "primary decomposition" and "sheaf cohomology" and "modular forms" aren't a mis-aligned waste of time. They can argue how it's related, but they cannot know for a fact that these things will shorten the proof.
So, how do we foster the kinds of thinking like from the Noethers and Laskers and Krulls, who were building a tower based on vague intuitions like "FLT is important and I think this is a part of understanding what's going on," while dis-incentivizing the time-wasters who want to invent a random theory of "cohesive rings" or "entwineable bundles" or whatever? Listed side-by-side, there's not an obvious sticker that says "this one is important for FLT, and this one is useless." Plausibly, a sufficiently persuasive mathematician (who is sufficiently motivated to obtain grant money) could spin a tale that could convince us that what they're doing is the most important thing in the world for (*grabs random scrap of paper from a hat*) Fermat's Last Theorem. If we can't trust our ability to resist persuasion by mathematicians hungry for grant money, how do we train them to have motivations "aligned with" building towers that contribute to FLT, instead of building towers to nowhere that trip random reward centers.
I think what I mean by "promising at math" is that I would assign a reasonably confident belief that a scaled up model with the same basic architecture as this one, with maybe a more mathematical training set, would be able to accomplish tasks that a mathematician might find helpful to a nonzero degree. For example, I could imagine a variant that is good at 'concept-linking,' better than google scholar at least, where you ask it something like "What are some ideas similar to valuations rings that people have explored?" or "Where do Gorenstein rings come up in combinatorics?" and getting a list of some interesting leads, some nonsense, and some lies and fabricated sources. If you try this with ChatGPT, you get mostly either (1) very well-known things that you could find by searching mathoverflow or stackexchange or (2) made-up sources that sound really cool and believable, until you figure out they aren't real. I would imagine that prompt-engineering could weed out type (1), and that failure mode (2) is something we'll see gradual progress in. It seems that there is a lot of interest in fixing the "making stuff up" issue, and I tend to believe that if a lot of smart people gravitate towards an issue, it will probably get at least somewhat better over time.
As to "bad at arithmetic," I agree, it definitely goes quite a bit farther than not being able to reliably compute 416*31 or something. I was able to get it to solve some very simple Sylow theorem type problems with a lot of hand-holding and pointing out its answers, but the failure you describe definitely sounds about right to me. It can say something like [P and Q and R] therefore [S]. Then you convince it [P] is false, and it will say [not P and Q and R] therefore [S], as though correcting [P] to [not P] fixes the argument (rather than breaking the implication). You may prompt it again, and it says "I apologize for the mistake! Here is a corrected version: [Q and R] therefore [S]" when [Q] and [R] are not nearly enough. I do not expect these things to be good theorem-provers or logicians without significant changes. I'm not even sure how you would coax a natural language model into doing formal reasoning competently. At one end of the spectrum, you could ham-string it to only produce strings that are well-formed in a formal system, only using valid rules of inference, but then it's going to be so limited in the leaps it's able to make that it will never be able to discuss high-level concepts with you. I don't know how you get a stochastic parrot to map the word salad it's generating onto something that resembles valid formal reasoning.
Thanks for the link, this is great!
If anyone knows where to find an equivalent concept, definitely let me know! It's hard to prove that a concept has not been investigated before, so I'll just give my reasoning on why I think it's unlikely:
Here is an additional perspective, if it's helpful. Suppose you live in some measure space X, perhaps Rn for concreteness. For any p∈[1,∞] can make sense of a space Lp(X) of functions f:X→R for which the integral ∫X|f|p is defined (this technicality can be ignored at a first pass), and we use the norm ||f||:=(∫X|f|p)1/p. Of course, L1 and L2 are options, and when X is a finite set, these norms give the sum of absolute values as the norm, and the square root of the sum of squares of absolute values, respectively.
Fun exercise: If X is two points, then Lp(X) is just R2 with the norm (|ap|+|bp|)1/p. Plot the unit circle in this norm (i.e, the set of all points at a distance of 1 from the origin). If p=1, you get a square (rotated 45 degrees). If p=2 you get the usual circle. If p=∞ you get a square. As p varies, the "circle" gradually interpolates between these things. You could imagine trying to develop a whole theory of geometry --- circumferences, areas, etc., in each of these norms (for example, p=∞ is taxicab geometry). What the other comments are saying (correctly) is that you really need to know about your distribution of errors. If you expect error vectors to form a circular shape (for the usual notion of circle), you should use L2. If you know that they form a diamond-ish shape, please use L1. If they form a box (uniform distribution across both axes, i.e., the two are genuinely uncorrelated) use L∞.
To have a larger and more useful example, let X be a 1000 x 1000 grid, and regard R as the set of possible colors between absolute black at −∞ and absolute white at +∞ (or use any other scheme you wish). Then a function f:X→R is just a 1000 x 1000 pixel black-and-white image. The space of all such functions has dimension 1,000,000 as a real vector space, and the concern about whether integrals exist goes away (X is a finite set). If f and g are two such images, the whole discussion about Lp concerns precisely the issue of how we measure the distance ||f−g||.
Suppose I have a distinguished subspace K of my vector space. Maybe I took ten thousand pictures of human faces perfectly centered in the frame, with their faces scaled to exactly fit the 1000 x 1000 grid from top to bottom. The K can be the span of those photos, which is roughly the "space of human face images" or at least the "space of images that look vaguely face-like" (principal component analysis tells us that we actually want to look for a much smaller-dimensional subspace of K that contains the most of the real "human face" essence, but's let's ignore that).
I have a random image h:X→R. I want to know if this image is a human face. The question is "how far away if h from the distinguished subspace K?" I might have some cutoff distance where images below the cutoff are classified as faces, and images away from the cutoff as non-faces.
Speaking in very rough terms, a Banach space is a vector space that comes with a notion of size and a notion of limits. A Hilbert space is a vector space that comes with a notion of size, a notion of angles, and a notion of limits. The nice thing about Hilbert spaces is that the question "how far away if h from the distinguished subspace K?" has a canonical best answer: the orthogonal projection. It is exactly analogous to the picture in the plane where you take K, draw the unique line from h to K that makes a right angle, and take the point of intersection. That point is the unique closest point, and the projection onto the closet point is a linear operator.
In a Banach space, as you have observed in this post, there need not exist a unique closest point. The set of closest points may not even be a subspace of K, it could be an unpleasant nonlinear object. There are no projection operators, and the distance between a point and a subspace is not very well-behaved.
It is a theorem that Lp(X) is a Hilbert space if and only if p=2. That is, only the L2 norm admits a compatible notion of angles (hence perpendicularity/orthogonality, hence orthogonal projection). You can talk about the "angle between" two images in the L2 norm, and subtract off the component of one image "along" another image. I can find which directions in K are the "face-iest" by finding the vector that my data points have the strongest components around, then project away from that direction, find a vector orthogonal to my "face-iest" vector along which the component of everything is strongest, project that away, etc., and identify a lower-dimensional subspace of K that captures the essential directions (principal components) of K that capture the property of "face-iness." There are all kinds of fun things you can do with projection to analyze structure in a data set. You can project faces onto face-subspaces, e.g., project a face onto the space of female faces, or male faces, and find the "best female approximation" or "best male approximation" of a face, you could try to classify faces by age, try to recognize whether a face belongs to one specific person (but shot independently across many different photos) by projecting onto their personal subspace, etc.
In the infinite-dimensional L2(X) function spaces, these projections let you reconstruct the best approximation of a function by polynomials (the Taylor approximations) or by sinusoids (the Fourier approximations) or by whatever you want (e.g. wavelet approximations). You could project onto eigenspaces of some operator, where the projection might correspond to various energy levels of some Hamiltonian.
The reason we prefer L2 in quantum mechanics is because, while eigenvectors and eigenvalues only make sense, projection operators onto eigenspaces are very much a special L2 thing. Thus, things like spectral decomposition, and projection onto eigenstates of observable operators, all require you to live inside of L2.
None of this works in Lp(X) for p≠2 precisely because "orthogonal" projection does not make sense. That is not to say that these spaces are useless. If you are in a function space, you might sometimes do weird things that take you outside of L2 (i.e., they make the integral in question undefined) and you end up inside of some other Lp, and that may not necessarily be the end of the world. Something as innocent as multiplication point-by-point can take you outside of the Lp space you started in, and put you into a different one (i.e., the integral of |f|p might fail to exist, but the integral of |f|q for some q≠p may converge). That is to say, these spaces are Banach spaces but are not Banach algebras --- they do not have a nice multiplication law. The flexibility of moving to different p's can help get around this difficulty. Another example is taking dual spaces, which generally requires a different p unless, funnily enough, p=2 (the only Lp space that is self-dual).
tl;dr if you want geometry with angles, you are required to use p=2. If you don't feel like you need angles (hence, no orthogonality, no projections, no spectral decompositions, no "component of x along y", no PCA, etc.) then you are free to use Lp. Sometimes the latter is forced upon you, but usually only if you're in the infinite-dimensional function space setting.