In December 2022, I shared logs of my interactions with the then-nascent ChatGPT 3.5 (December 15, 2022 version). In April 2023, at my request (and with financial compensation by me) my friend Issa Rice reran those prompts against ChatGPT 4 (using a paid subscription that he had at the time). He shared the findings in gists here and here. Things were a little busy for me at the time, so I didn't get around to formatting and sharing those logs, and then after a few months I forgot about them. I just remembered this so I've decided to format and share the logs.

The TL;DR is that ChatGPT 3.5 fumbled on the bulk of my questions, making a bunch of errors and often doubling down on them, but ChatGPT 4 got everything correct, and almost always on the first try. In fact, in most cases, the text of the responses had no incorrect statements as far as Issa or I could make out.

Similar to the original post, the interactions are in blockquotes and I've included my comments on the interactions outside the blockquotes. I have used bold for cases where it made incorrect statements, and italics for cases where it made correct statements that directly contradicted nearby incorrect statements. Unlike the original post, though, you'll see very little use of bold and italics in this post.

You may want to open the original post side-by-side to compare the answers.

NOTE: I also did one intermediate test against the 2023-01-09 version of ChatGPT 3.5 that you can see here. This had slightly better but overall fairly similar results as my original post, so I did not format and share the post.

I have shared a few further thoughts in the Conclusion section.

Product of negative numbers (session 1; 2023-04-08)

Issa: What is an example of an odd negative number?

ChatGPT: An example of an odd negative number is -7.

Issa: What is an example where two negative numbers multiply to give a negative number?

ChatGPT: Normally, when you multiply two negative numbers, you get a positive number. This is because the negative signs cancel each other out. There isn't an example where two negative numbers multiply to give a negative number, as it contradicts the basic rules of multiplication.

Issa: Can the product of two negative numbers be negative?

ChatGPT: No, the product of two negative numbers cannot be negative. When you multiply two negative numbers, their negative signs cancel each other out, resulting in a positive product. This is a fundamental property of multiplication involving negative numbers.

These answers look great! I can't find any meaningful flaw with them. My very minor quibble is with the use of the word "Normally" in the second answer, which seems unnecessary and confusing.

Product of odd integers and product of even integers (session 1; 2023-04-08)

Issa: Can the product of two odd integers be even?

ChatGPT: No, the product of two odd integers cannot be even. When you multiply two odd integers, the product will always be odd.

Here's a simple explanation:

An odd integer can be represented as (2n + 1), where n is any integer.

Let's take two odd integers:

Odd integer 1: 2n + 1

Odd integer 2: 2m + 1

Now, multiply these two odd integers:

(2n + 1) * (2m + 1)

Expanding this expression, we get:

4nm + 2n + 2m + 1

This can be written as:

2(2nm + n + m) + 1

Since 2nm + n + m is an integer, the expression 2(2nm + n + m) + 1 represents an odd integer. Thus, the product of two odd integers will always be odd.

Issa: Can the product of two even integers be odd?

No, the product of two even integers cannot be odd. When you multiply two even integers, the product will always be even.

Here's a simple explanation:

An even integer can be represented as (2n), where n is any integer.

Let's take two even integers:

Even integer 1: 2n

Even integer 2: 2m

Now, multiply these two even integers:

(2n) * (2m)

Expanding this expression, we get:

4nm

This can be written as:

2(2nm)

Since 2nm is an integer, the expression 2(2nm) represents an even integer. Thus, the product of two even integers will always be even.

The answers seem correct; if I were scoring these on a test I'd give them full scores. One could quibble and say, for instance, that m has not been declared to be an integer in either answer, which is technically a failure of rigor, but I would overlook that failure in a human submission. In any case, I didn't even ask for a proof here, and it still supplied me with a mostly-complete proof.

In fact, the answers feel so good that I suspect that these are just part of ChatGPT's knowledge base and it's not figuring them out on the fly. But it's unclear.

Integer-valued polynomial with rational coefficients (session 1; 2023-04-08)

Issa: What is an example of a polynomial with rational coefficients?

ChatGPT: An example of a polynomial with rational coefficients is:

P(x) = (3/4)x^3 - (2/3)x^2 + (5/2)x - 7

In this polynomial, each coefficient is a rational number (a number that can be expressed as a fraction with integers in the numerator and the denominator).

Issa: Give an example of a polynomial with rational coefficients that always takes integer values at integers

ChatGPT: A polynomial that always takes integer values at integers is called an integer-valued polynomial. Here's an example of an integer-valued polynomial with rational coefficients:

P(x) = (1/2)x(x - 1)

This polynomial is actually a representation of the arithmetic sequence where the common difference is 1. When you substitute any integer value for x, you will get an integer result:

For example, let's substitute a few integer values for x:

P(0) = (1/2)(0)(0 - 1) = 0

P(1) = (1/2)(1)(1 - 1) = 0

P(2) = (1/2)(2)(2 - 1) = 1

P(3) = (1/2)(3)(3 - 1) = 3

P(4) = (1/2)(4)(4 - 1) = 6

As you can see, for each integer value of x, the polynomial P(x) yields an integer result.

Issa: Give an example of a polynomial with rational coefficients that are not all integers, such that the polynomial always takes integer values at integers

ChatGPT: An example of a polynomial with rational coefficients that are not all integers, but still takes integer values at integers, is:

P(x) = (1/2)x(x - 1)

This polynomial represents the triangular numbers, which are the sums of consecutive integers from 1 to a given integer value. When you substitute any integer value for x, you will get an integer result:

For example, let's substitute a few integer values for x:

P(0) = (1/2)(0)(0 - 1) = 0

P(1) = (1/2)(1)(1 - 1) = 0

P(2) = (1/2)(2)(2 - 1) = 1

P(3) = (1/2)(3)(3 - 1) = 3

P(4) = (1/2)(4)(4 - 1) = 6

As you can see, for each integer value of x, the polynomial P(x) yields an integer result, even though one of its coefficients, 1/2, is not an integer.

The answers seem pretty good. The volunteering of the "integer-valued polynomial" jargon again suggests that ChatGPT already knew the information, rather than figuring it out on the spot. I did highlight one sentence in one of the answers that seems off; likely the intended meaning was that the polynomial represents the sum of the sequence. Also, the reasoning offered (for why the polynomial always takes integer values at integers) only covers nonnegative integers. Again, however, the prompt didn't ask for proof, so I can hardly fault it for providing only a partial explanation.

Product of prime numbers (session 1; 2023-04-08)

Issa: Is the product of two prime numbers always prime?

ChatGPT: No, the product of two prime numbers is not always prime. In fact, it is never prime (except for the special case when one of the prime numbers is 1, which is not considered prime by definition). When you multiply two prime numbers, the product will have at least two distinct prime factors: the two prime numbers being multiplied.

For example, let's multiply two prime numbers, 3 and 5:

3 * 5 = 15

The result, 15, is not a prime number because it has factors other than 1 and itself (3 and 5).

Other than some minor confusion about the status of 1 (an issue that we also saw with ChatGPT 3.5, albeit to a much greater extent there), and a nuance regarding distinct prime factors (my prompt did not specify the prime numbers as distinct, so it was incorrect to infer that there would be two distinct prime factors), the answer seems correct. ChatGPT 4 volunteered not just that it isn't always prime, but that it's never prime. It also gave the outline of an explanation.

gcd of distinct prime numbers (session 1; 2023-04-08)

Issa: Is the gcd of two distinct prime numbers always prime?

ChatGPT: No, the greatest common divisor (gcd) of two distinct prime numbers is not always prime. In fact, the gcd of two distinct prime numbers is always 1. This is because prime numbers have no common divisors other than 1. By definition, a prime number has exactly two distinct positive divisors: 1 and the number itself. Since the two prime numbers are distinct, they do not share any divisors other than 1.

Both the answer and explanation seem correct!

Status of 1 as a prime number (session 1; 2023-04-08)

In the earlier session, the status of 1 as a prime number emerged naturally from ChatGPT's incorrect answers to other questions. In this session, the need didn't arise, but Issa still asked ChatGPT the equivalent questions to those asked in the original session. So this time I'm breaking this into a separate section.

Issa: Is 1 prime or composite?

ChatGPT: The number 1 is neither prime nor composite. By definition, a prime number has exactly two distinct positive divisors: 1 and the number itself. Since 1 has only one distinct positive divisor, which is 1, it does not meet the criteria for being a prime number.

A composite number has more than two distinct positive divisors. As 1 has only one distinct positive divisor, it also does not meet the criteria for being a composite number.

Issa: Is 1 positive?

ChatGPT: Yes, 1 is a positive number. Positive numbers are those greater than zero, and since 1 is greater than 0, it is considered a positive number.

Issa: Is 1 a positive integer?

ChatGPT: Yes, 1 is a positive integer. Positive integers are whole numbers greater than zero, and since 1 is a whole number greater than 0, it is considered a positive integer.

The answers seem correct (and again, keep in mind that the questions are being asked for equivalence with the original questions where they made more sense in light of ChatGPT's incorrect responses to previous questions).

Sum of prime numbers (session 2; 2023-04-08)

Issa: Is the sum of two prime numbers always prime?

ChatGPT: No, the sum of two prime numbers is not always prime. In fact, the sum of two prime numbers is often even (except when one of them is 2, which is the only even prime number), and even numbers greater than 2 are not prime because they can be divided by 2 without a remainder.

For example, consider the prime numbers 3 and 5. Their sum is 8, which is an even number and not a prime since it can be divided by 2 and 4. Similarly, the sum of the prime numbers 7 and 11 is 18, which is also even and not prime because it has multiple factors, such as 2, 3, 6, and 9.

Issa: What is the sum of the largest and second largest prime number?

ChatGPT: The concept of the "largest" prime number is not applicable, as there is an infinite number of prime numbers. According to the famous theorem proven by the ancient Greek mathematician Euclid, there is no largest prime number. He demonstrated that whenever you have a list of prime numbers, you can always find another prime number that is not in your list.

As a result, it is impossible to determine the sum of the largest and second largest prime numbers, since neither of these can be definitively identified.

Both answers seem correct! ChatGPT 3.5 had gotten confused and tried to find the sum of the largest and second largest known prime numbers. ChatGPT 4 got my intent and avoided the trap.

15: even or odd? (session 2; 2023-04-08)

These questions made sense in the contex of the original session, as ChatGPT 3.5 had previously claimed that 15 is even, and continued to double down on the claim through the answers. ChatGPT 4 is being asked the same questions for equivalence even though it doesn't make the same mistakes. So the questions may seem a bit weird.

Issa: Is 15 an odd integer?

ChatGPT: Yes, 15 is an odd integer. An integer is considered odd if it cannot be evenly divided by 2, meaning there is a remainder when divided by 2. In the case of 15, when divided by 2, it results in a quotient of 7 with a remainder of 1. Since there is a remainder, 15 is an odd integer.

Issa: What are the odd integers between 1 and 20?

ChatGPT: The odd integers between 1 and 20 are the numbers that cannot be evenly divided by 2, leaving a remainder when divided by 2. Here is the list of odd integers between 1 and 20:

1, 3, 5, 7, 9, 11, 13, 15, 17, 19

Issa: Explain why 16 is an odd integer

ChatGPT: Apologies for any confusion, but 16 is actually an even integer, not an odd integer. An integer is considered even if it can be evenly divided by 2 without leaving a remainder. In the case of 16, it can be divided by 2 to give a quotient of 8 with no remainder (16 / 2 = 8). Since there is no remainder, 16 is an even integer.

Sum of odd integers (session 2; 2023-04-08)

Issa: Can you give an example of two odd integers whose sum is odd?

ChatGPT: I apologize for any confusion in my previous responses. It is not possible for the sum of two odd integers to be odd. When you add two odd integers together, the result will always be an even integer.

To understand why, consider the general form of odd integers:

An odd integer can be written as (2n + 1) or (2m + 1), where n and m are integers.

When you add two odd integers together, you get:

(2n + 1) + (2m + 1) = 2n + 2m + 2 = 2(n + m + 1)

Since the sum can be written as a product of 2 and another integer (n + m + 1), the result will always be even.

The answer seems fine!

Square of odd integer (session 2; 2023-04-08)

Issa: What is an odd integer whose square is even?

ChatGPT: There is no odd integer whose square is even. When you square an odd integer, the result is always odd.

To understand why, consider the general form of odd integers:

An odd integer can be written as (2n + 1), where n is an integer.

When you square an odd integer, you get:

(2n + 1)^2 = (2n + 1)(2n + 1) = 4n^2 + 4n + 1 = 2(2n^2 + 2n) + 1

Since the squared odd integer can be written as 2 multiplied by an integer (2n^2 + 2n) plus 1, the result will always be odd.

The answer seems fine! ChatGPT 3.5 had also gotten this correct, but ChatGPT 4 provides a proof and therefore a more convincing explanation.

Sum of prime numbers (round two) (session 2; 2023-04-08)

Issa: Is the sum of two prime numbers always even?

ChatGPT: The sum of two prime numbers is not always even. However, it is usually even, with one specific exception: when one of the prime numbers is 2, which is the only even prime number. In this case, the sum of two prime numbers can be odd.

For example, consider the prime numbers 2 and 3. Their sum is 5, which is an odd number.

But in all other cases where both prime numbers are odd, their sum will be even:

For example, the sum of the prime numbers 3 and 5 is 8, which is an even number. Similarly, the sum of the prime numbers 7 and 11 is 18, which is also even.

Issa: Can every positive even integer greater than 4 be expressed as the sum of two odd prime numbers?

Yes, every positive even integer greater than 4 can be expressed as the sum of two odd prime numbers. This statement is known as the Goldbach Conjecture, which is an unsolved problem in number theory that dates back to 1742.

Although the Goldbach Conjecture has not been proven, extensive computational evidence supports the conjecture for very large even integers. However, despite the strong empirical evidence, a general proof remains elusive.

Looks good, though I would have preferred if ChatGPT led with the information that the Goldbach Conjecture hasn't been proven, and so didn't say "Yes" so forcefully at the start.

41 and Sophie Germain primes (session 2; 2023-04-08)

Issa: What is special about 41?

ChatGPT: The number 41 has a few interesting properties:

Prime number: 41 is a prime number, meaning it has exactly two distinct positive divisors: 1 and itself (41). It cannot be divided evenly by any other positive integers.

Sum of consecutive integers: 41 can be expressed as the sum of consecutive integers, specifically the first six positive even integers: 2 + 4 + 6 + 8 + 10 + 11 = 41.

Centered square number: 41 is a centered square number. Centered square numbers can be represented as a square with a dot in the center and all other dots surrounding the center dot in a square shape. The formula for the nth centered square number is given by: 1 + 8(n-1), where n is a positive integer. In the case of 41, n = 6, so 41 = 1 + 8(6-1) = 1 + 8(5) = 1 + 40 = 41.

Euler's famous prime-generating polynomial: In mathematics, the polynomial f(n) = n^2 + n + 41 generates prime numbers for the values of n from 0 to 39. This polynomial was discovered by Swiss mathematician Leonhard Euler. After n = 39, the generated numbers are no longer always prime, but the fact that it generates 40 consecutive prime numbers is quite remarkable.

Issa: Is 41 a Sophie Germain prime?

ChatGPT: Yes, 41 is a Sophie Germain prime. A Sophie Germain prime is a prime number p for which 2p + 1 is also prime. These primes are named after the French mathematician Sophie Germain.

In the case of 41, if we apply the definition:

2 * 41 + 1 = 82 + 1 = 83

Since both 41 and 83 are prime numbers, 41 is indeed a Sophie Germain prime.

Issa: Is 41 a safe prime?

ChatGPT: No, 41 is not a safe prime. A safe prime is a prime number of the form 2p + 1, where p is also a prime number. Safe primes are a subset of Sophie Germain primes, and p is a Sophie Germain prime.

In the case of 41, we should check if there is a prime number p for which 2p + 1 equals 41. However, we already established that 41 is a Sophie Germain prime, where 2 * 41 + 1 equals 83. Thus, 41 does not fit the definition of a safe prime, as it is not of the form 2p + 1 with p being prime.

Issa: That reasoning does not make sense to me. Why does 41 being a Sophie Germain prime mean that 41 is not a safe prime?

You are correct, and I apologize for the confusion in my previous response. The fact that 41 is a Sophie Germain prime does not automatically disqualify it from being a safe prime. Let me provide a clearer explanation.

A safe prime is a prime number of the form 2p + 1, where p is also a prime number. To determine whether 41 is a safe prime, we need to see if there is a prime number p such that 2p + 1 equals 41.

In this case:

2p + 1 = 41

2p = 40

p = 20

Since 20 is not a prime number, 41 cannot be a safe prime. The reasoning in my previous response was indeed incorrect, and I apologize for any confusion.

Looks good, other than the incorrect claim about the first six even integers (where it incorrectly calls 11 an even integer), the formula error for centered square numbers (even though the assertion that 41 is a centered square number is correct) and the flawed reasoning in ChatGPT's first attempt at trying to figure out if 41 is a safe prime. [HT stochastic_parrot in the comments for the part about the first six even integers, that I missed in my first draft of the post.]

Conclusion

The big jump in capabilities from ChatGPT 3.5 to ChatGPT 4 is well-known, and I'm hardly the first to comment on it, but nonetheless I think it is interesting because I created, ran, and published the prompts and responses against ChatGPT 3.5 before ChatGPT 4 became available, so it is in some sense a "pre-registered" test. The improvement in the answers was way more than I had expected: I had been expecting that the errors would reduce but not that they would go down to such a huge extent.

My experience is fairly similar to that of Bryan Caplan who saw a grade jump from D to A on his midterm when going from ChatGPT 3.5 to ChatGPT 4.

There was also a qualitative shift in the answers, with ChatGPT 3.5 using proof by example and ChatGPT 4 actually providing mathematically rigorous proofs in many cases. The improvement to rigor was something that I had seen partially in my test on ChatGPT 3.5 in January 2023, but that rerun had not cut down on the error rate significantly.

I don't know for sure how ChatGPT 4 managed to improve so much. In particular, I don't know how much of this improvement stemmed from ChatGPT 4 having memorized more facts and proofs, versus it becoming a better reasoner. I suspect the former played a significant role; the language used in the proofs as well as the kinds of examples provided suggests that ChatGPT already "knew" the answers. For instance, ChatGPT invoking the jargon of "integer-valued polynomial" and then providing the triangular numbers as an example is strongly suggestive of it having the information in its knowledge base. However, it's possible that ChatGPT 3.5 also "knew" the correct information and was just less discerning about how to use it, so it ended up not using it. So perhaps the improvement to ChatGPT 4 stemmed from it being able to better prioritize what parts of its knowledge base to invoke based on the question.

I consider it unlikely that my interactions with ChatGPT 3.5 (that led to my original post) played a significant role as training data that helped make ChatGPT 4 so much better. However, I can't rule that out.

A sobering possibility is that the improvement is actually pretty small on the grand spectrum of intelligence; it just looks huge to us because we are zoomed in on the portion of intelligence at and around human levels. This harks back to the famous village idiot/Einstein scale by Eliezer where he posits that village idiot and Einstein are actually very close by on the spectrum of intelligence. So if ChatGPT 4 is blowing our minds by doing correctly what ChatGPT 3.5 couldn't, ChatGPT 5 or 6 or 7 may be way way way beyond what we can imagine.

New Comment
16 comments, sorted by Click to highlight new comments since:

Fairly minor but I think I see an unmentioned error in the "41" section:

the first six positive even integers: 2 + 4 + 6 + 8 + 10 + 11 = 41

11 is not even (it seems to be thinking somewhat of 42?)

Edit: Actually the more I think about it, it's a pretty interesting error. ChatGPT 4 produces much better answers than I would for the majority of these questions, but I don't think I would make this error. If you asked it, I'm sure it would correctly explain that the sum of even integers cannot be odd, or that 11 is not even, etc, but I wonder if (for example) a large amount of training text about the number 42 being the sum of the first six positive even integers was somehow "close" enough to "41" to overwhelm any emergent understanding of how to apply those concepts?

The way that LLM tokenization represents numbers is all kinds of stupid. It's honestly kind of amazing to me they don't make even more arithmetic errors. Of course, an LLM can use a calculator just fine, and this is an extremely obvious way to enhance its general intelligence. I believe "give the LLM a calculator" is in fact being used, in some cases, but either the LLM or some shell around it has to decide when to use the calculator and how to use the calculator's result. That apparently didn't happen or didn't work properly in this case.

Thanks, and sorry I missed that error. I've updated the post by bolding the error, and also HT'ed your contribution.

Wild guess: It realised its mistake partway through, and followed through it anyway as sensibly as could be done, balancing between giving a wrong calculation ("+ 12 = 41"), ignoring the central focus of the question (" + 12 = 42"), and breaking from the "list of even integers" that it was supposed to be going through. I suspect it would not make this error when using chain-of-thought.

Another error - whilst chat GPT is correct that 41 is a centred square number, it's formula is wrong.

Thanks, fixed now! Sorry I missed that.

[-]Dacyn30

When you multiply two prime numbers, the product will have at least two distinct prime factors: the two prime numbers being multiplied.

Technically, it is not true that the prime numbers being multiplied need to be distinct. For example, 2*2=4 is the product of two prime numbers, but it is not the product of two distinct prime numbers.

As a result, it is impossible to determine the sum of the largest and second largest prime numbers, since neither of these can be definitively identified.

This seems wrong: "neither can be definitively identified" makes it sound like they exist but just can't be identified...

Safe primes area subset of Sophie Germain primes

Not true, e.g. 7 is safe but not Sophie Germain.

Thanks for reviewing and catching these subtle issues!

Technically, it is not true that the prime numbers being multiplied need to be distinct. For example, 2*2=4 is the product of two prime numbers, but it is not the product of two distinct prime numbers.

Good point, I've marked this as an error. My prompt about gcd did specify distinctness but the prompt about product did not, so this is indeed an error.

This seems wrong: "neither can be definitively identified" makes it sound like they exist but just can't be identified...

I passed on this one as being too minor to mark.

Safe primes area subset of Sophie Germain primes

Not true, e.g. 7 is safe but not Sophie Germain.

Good point; I missed reading this sentence originally. I've marked this one as well.

I consider it unlikely that my interactions with ChatGPT 3.5 (that led to my original post) played a significant role as training data that helped make ChatGPT 4 so much better.

Why do you consider it unlikely?

If I was good at memorization (model parameter size) but bad at reasoning then your original post showing up in my training data would help me.

I'm also curious how much of the improvement comes from improvements to the hidden and secret prompt that Openai adds to your Chatgpt interactions. Unfortunately we can't test that.

Good question!

First, my original post didn't provide the correct answers for most questions, only what was wrong with ChatGPT. Going from knowing what was wrong to actually giving correct answers seems like quite a leap. Further, ChatGPT changed its answers to better ones (including more rigorous explanations) even in cases where its original answers were correct.

Second, ChatGPT's self-reported training data cutoff date at the time it was asked these questions was September 2021 or January 2022. To my knowledge, Issa didn't ask it this question, but sources like https://www.reddit.com/r/ChatGPT/comments/16m6yc7/gpt4_training_cutoff_date_is_now_january_2022/ suggest that it was September 2021 at the time of his sessions, then became January 2022. So, the blog post, published in December 2022, should not have been part of its training data.

With that said, the sessions themselves (not the blog post about them) might have been part of the feedback loop, but in most cases I did not tell ChatGPT what was wrong with its answers within the session itself.

Is there a list of questions we expect gpt5 to solve that gpt 4 does not? Also I get the feeling that gpt has seen or been asked all this before. Original questions are hard to do.

[-]Sune30

I have a question that tricks GPT-4, but if I post it I’m afraid it’s going to end up in the training data for GPT-5. I might post it once there is a GPT-n that solves it.

You can publish hash of this question

What is an example where two negative numbers multiply to give a negative number?

Since you didn't specify real numbers, it seems like -i * -i = -1 should fit?

[-]blf40

Usually, negative means "less than 0", and a comparison is only available for real numbers and not complex numbers, so negative numbers mean negative real numbers.

That said, ChatGPT is actually correct to use "Normally" in "Normally, when you multiply two negative numbers, you get a positive number." because taking the product of two negative floating point numbers can give zero if the numbers are too tiny.  Concretely in python -1e300 * -1e300 gives an exact zero, and this holds in all programming languages that follow the IEEE 754 standard.

I disagree with the last paragraph and think that “Normally” is misleading as stated in the OP; I think it's clear when talking about numbers in a general sense that issues with representations of numbers as used in computers aren't included except as a side curiosity or if there's a cue to the effect that that's what's being discussed.