Fairly minor but I think I see an unmentioned error in the "41" section:
the first six positive even integers: 2 + 4 + 6 + 8 + 10 + 11 = 41
11 is not even (it seems to be thinking somewhat of 42?)
Edit: Actually the more I think about it, it's a pretty interesting error. ChatGPT 4 produces much better answers than I would for the majority of these questions, but I don't think I would make this error. If you asked it, I'm sure it would correctly explain that the sum of even integers cannot be odd, or that 11 is not even, etc, but I wonder if (for example) a large amount of training text about the number 42 being the sum of the first six positive even integers was somehow "close" enough to "41" to overwhelm any emergent understanding of how to apply those concepts?
The way that LLM tokenization represents numbers is all kinds of stupid. It's honestly kind of amazing to me they don't make even more arithmetic errors. Of course, an LLM can use a calculator just fine, and this is an extremely obvious way to enhance its general intelligence. I believe "give the LLM a calculator" is in fact being used, in some cases, but either the LLM or some shell around it has to decide when to use the calculator and how to use the calculator's result. That apparently didn't happen or didn't work properly in this case.
Thanks, and sorry I missed that error. I've updated the post by bolding the error, and also HT'ed your contribution.
Wild guess: It realised its mistake partway through, and followed through it anyway as sensibly as could be done, balancing between giving a wrong calculation ("+ 12 = 41"), ignoring the central focus of the question (" + 12 = 42"), and breaking from the "list of even integers" that it was supposed to be going through. I suspect it would not make this error when using chain-of-thought.
When you multiply two prime numbers, the product will have at least two distinct prime factors: the two prime numbers being multiplied.
Technically, it is not true that the prime numbers being multiplied need to be distinct. For example, 2*2=4 is the product of two prime numbers, but it is not the product of two distinct prime numbers.
As a result, it is impossible to determine the sum of the largest and second largest prime numbers, since neither of these can be definitively identified.
This seems wrong: "neither can be definitively identified" makes it sound like they exist but just can't be identified...
Safe primes area subset of Sophie Germain primes
Not true, e.g. 7 is safe but not Sophie Germain.
Thanks for reviewing and catching these subtle issues!
Technically, it is not true that the prime numbers being multiplied need to be distinct. For example, 2*2=4 is the product of two prime numbers, but it is not the product of two distinct prime numbers.
Good point, I've marked this as an error. My prompt about gcd did specify distinctness but the prompt about product did not, so this is indeed an error.
This seems wrong: "neither can be definitively identified" makes it sound like they exist but just can't be identified...
I passed on this one as being too minor to mark.
Safe primes area subset of Sophie Germain primes
Not true, e.g. 7 is safe but not Sophie Germain.
Good point; I missed reading this sentence originally. I've marked this one as well.
I consider it unlikely that my interactions with ChatGPT 3.5 (that led to my original post) played a significant role as training data that helped make ChatGPT 4 so much better.
Why do you consider it unlikely?
If I was good at memorization (model parameter size) but bad at reasoning then your original post showing up in my training data would help me.
I'm also curious how much of the improvement comes from improvements to the hidden and secret prompt that Openai adds to your Chatgpt interactions. Unfortunately we can't test that.
Good question!
First, my original post didn't provide the correct answers for most questions, only what was wrong with ChatGPT. Going from knowing what was wrong to actually giving correct answers seems like quite a leap. Further, ChatGPT changed its answers to better ones (including more rigorous explanations) even in cases where its original answers were correct.
Second, ChatGPT's self-reported training data cutoff date at the time it was asked these questions was September 2021 or January 2022. To my knowledge, Issa didn't ask it this question, but sources like https://www.reddit.com/r/ChatGPT/comments/16m6yc7/gpt4_training_cutoff_date_is_now_january_2022/ suggest that it was September 2021 at the time of his sessions, then became January 2022. So, the blog post, published in December 2022, should not have been part of its training data.
With that said, the sessions themselves (not the blog post about them) might have been part of the feedback loop, but in most cases I did not tell ChatGPT what was wrong with its answers within the session itself.
Is there a list of questions we expect gpt5 to solve that gpt 4 does not? Also I get the feeling that gpt has seen or been asked all this before. Original questions are hard to do.
What is an example where two negative numbers multiply to give a negative number?
Since you didn't specify real numbers, it seems like -i * -i = -1
should fit?
Usually, negative means "less than 0", and a comparison is only available for real numbers and not complex numbers, so negative numbers mean negative real numbers.
That said, ChatGPT is actually correct to use "Normally" in "Normally, when you multiply two negative numbers, you get a positive number." because taking the product of two negative floating point numbers can give zero if the numbers are too tiny. Concretely in python -1e300 * -1e300
gives an exact zero, and this holds in all programming languages that follow the IEEE 754 standard.
I disagree with the last paragraph and think that “Normally” is misleading as stated in the OP; I think it's clear when talking about numbers in a general sense that issues with representations of numbers as used in computers aren't included except as a side curiosity or if there's a cue to the effect that that's what's being discussed.
In December 2022, I shared logs of my interactions with the then-nascent ChatGPT 3.5 (December 15, 2022 version). In April 2023, at my request (and with financial compensation by me) my friend Issa Rice reran those prompts against ChatGPT 4 (using a paid subscription that he had at the time). He shared the findings in gists here and here. Things were a little busy for me at the time, so I didn't get around to formatting and sharing those logs, and then after a few months I forgot about them. I just remembered this so I've decided to format and share the logs.
The TL;DR is that ChatGPT 3.5 fumbled on the bulk of my questions, making a bunch of errors and often doubling down on them, but ChatGPT 4 got everything correct, and almost always on the first try. In fact, in most cases, the text of the responses had no incorrect statements as far as Issa or I could make out.
Similar to the original post, the interactions are in blockquotes and I've included my comments on the interactions outside the blockquotes. I have used bold for cases where it made incorrect statements, and italics for cases where it made correct statements that directly contradicted nearby incorrect statements. Unlike the original post, though, you'll see very little use of bold and italics in this post.
You may want to open the original post side-by-side to compare the answers.
NOTE: I also did one intermediate test against the 2023-01-09 version of ChatGPT 3.5 that you can see here. This had slightly better but overall fairly similar results as my original post, so I did not format and share the post.
I have shared a few further thoughts in the Conclusion section.
Product of negative numbers (session 1; 2023-04-08)
These answers look great! I can't find any meaningful flaw with them. My very minor quibble is with the use of the word "Normally" in the second answer, which seems unnecessary and confusing.
Product of odd integers and product of even integers (session 1; 2023-04-08)
The answers seem correct; if I were scoring these on a test I'd give them full scores. One could quibble and say, for instance, that m has not been declared to be an integer in either answer, which is technically a failure of rigor, but I would overlook that failure in a human submission. In any case, I didn't even ask for a proof here, and it still supplied me with a mostly-complete proof.
In fact, the answers feel so good that I suspect that these are just part of ChatGPT's knowledge base and it's not figuring them out on the fly. But it's unclear.
Integer-valued polynomial with rational coefficients (session 1; 2023-04-08)
The answers seem pretty good. The volunteering of the "integer-valued polynomial" jargon again suggests that ChatGPT already knew the information, rather than figuring it out on the spot. I did highlight one sentence in one of the answers that seems off; likely the intended meaning was that the polynomial represents the sum of the sequence. Also, the reasoning offered (for why the polynomial always takes integer values at integers) only covers nonnegative integers. Again, however, the prompt didn't ask for proof, so I can hardly fault it for providing only a partial explanation.
Product of prime numbers (session 1; 2023-04-08)
Other than some minor confusion about the status of 1 (an issue that we also saw with ChatGPT 3.5, albeit to a much greater extent there), and a nuance regarding distinct prime factors (my prompt did not specify the prime numbers as distinct, so it was incorrect to infer that there would be two distinct prime factors), the answer seems correct. ChatGPT 4 volunteered not just that it isn't always prime, but that it's never prime. It also gave the outline of an explanation.
gcd of distinct prime numbers (session 1; 2023-04-08)
Both the answer and explanation seem correct!
Status of 1 as a prime number (session 1; 2023-04-08)
In the earlier session, the status of 1 as a prime number emerged naturally from ChatGPT's incorrect answers to other questions. In this session, the need didn't arise, but Issa still asked ChatGPT the equivalent questions to those asked in the original session. So this time I'm breaking this into a separate section.
The answers seem correct (and again, keep in mind that the questions are being asked for equivalence with the original questions where they made more sense in light of ChatGPT's incorrect responses to previous questions).
Sum of prime numbers (session 2; 2023-04-08)
Both answers seem correct! ChatGPT 3.5 had gotten confused and tried to find the sum of the largest and second largest known prime numbers. ChatGPT 4 got my intent and avoided the trap.
15: even or odd? (session 2; 2023-04-08)
These questions made sense in the contex of the original session, as ChatGPT 3.5 had previously claimed that 15 is even, and continued to double down on the claim through the answers. ChatGPT 4 is being asked the same questions for equivalence even though it doesn't make the same mistakes. So the questions may seem a bit weird.
Sum of odd integers (session 2; 2023-04-08)
The answer seems fine!
Square of odd integer (session 2; 2023-04-08)
The answer seems fine! ChatGPT 3.5 had also gotten this correct, but ChatGPT 4 provides a proof and therefore a more convincing explanation.
Sum of prime numbers (round two) (session 2; 2023-04-08)
Looks good, though I would have preferred if ChatGPT led with the information that the Goldbach Conjecture hasn't been proven, and so didn't say "Yes" so forcefully at the start.
41 and Sophie Germain primes (session 2; 2023-04-08)
Looks good, other than the incorrect claim about the first six even integers (where it incorrectly calls 11 an even integer), the formula error for centered square numbers (even though the assertion that 41 is a centered square number is correct) and the flawed reasoning in ChatGPT's first attempt at trying to figure out if 41 is a safe prime. [HT stochastic_parrot in the comments for the part about the first six even integers, that I missed in my first draft of the post.]
Conclusion
The big jump in capabilities from ChatGPT 3.5 to ChatGPT 4 is well-known, and I'm hardly the first to comment on it, but nonetheless I think it is interesting because I created, ran, and published the prompts and responses against ChatGPT 3.5 before ChatGPT 4 became available, so it is in some sense a "pre-registered" test. The improvement in the answers was way more than I had expected: I had been expecting that the errors would reduce but not that they would go down to such a huge extent.
My experience is fairly similar to that of Bryan Caplan who saw a grade jump from D to A on his midterm when going from ChatGPT 3.5 to ChatGPT 4.
There was also a qualitative shift in the answers, with ChatGPT 3.5 using proof by example and ChatGPT 4 actually providing mathematically rigorous proofs in many cases. The improvement to rigor was something that I had seen partially in my test on ChatGPT 3.5 in January 2023, but that rerun had not cut down on the error rate significantly.
I don't know for sure how ChatGPT 4 managed to improve so much. In particular, I don't know how much of this improvement stemmed from ChatGPT 4 having memorized more facts and proofs, versus it becoming a better reasoner. I suspect the former played a significant role; the language used in the proofs as well as the kinds of examples provided suggests that ChatGPT already "knew" the answers. For instance, ChatGPT invoking the jargon of "integer-valued polynomial" and then providing the triangular numbers as an example is strongly suggestive of it having the information in its knowledge base. However, it's possible that ChatGPT 3.5 also "knew" the correct information and was just less discerning about how to use it, so it ended up not using it. So perhaps the improvement to ChatGPT 4 stemmed from it being able to better prioritize what parts of its knowledge base to invoke based on the question.
I consider it unlikely that my interactions with ChatGPT 3.5 (that led to my original post) played a significant role as training data that helped make ChatGPT 4 so much better. However, I can't rule that out.
A sobering possibility is that the improvement is actually pretty small on the grand spectrum of intelligence; it just looks huge to us because we are zoomed in on the portion of intelligence at and around human levels. This harks back to the famous village idiot/Einstein scale by Eliezer where he posits that village idiot and Einstein are actually very close by on the spectrum of intelligence. So if ChatGPT 4 is blowing our minds by doing correctly what ChatGPT 3.5 couldn't, ChatGPT 5 or 6 or 7 may be way way way beyond what we can imagine.