by p.b.
2 min read

13

Review

A few month back there was a question on lesswrong about what the least impressive thing is going to be that GPT4 can't do. At the time I thought maybe counting could be a good candidate, because I had noticed that this seems to be a problem for all big transformer models. 

Dall-E and other image generation models can't reliably get cardinalities right beyond something like 3. 

Luminous/Magma the multi-modal AlephAlpha language model that can take pictures as input, cannot count items on these pictures with any reliability. 

All pure language models I had checked, weren't able to count the number of occurrences of a specific word in a sentence or a specific character in a string. 

Counting very much feels like a basic ability, I think my daughter could reliably count ten items when she was 2, and I feel like there should be a way to prompt ChatGPT that demonstrates counting ability.

In my tests so far, even ChatGPT as the most advanced LM to date is unable to count specific words or characters. However, people seem to disagree ...

For a short while I thought the correct hack is to ask it to write a python function to do the counting and then output what it thinks the function would output. But now I suspect that ChatGPT executes functions behind the scenes. So it would get the answer correct the very first time after actually writing the function and revert to being wrong in subsequent answer to the question what the function would output for certain inputs. [I just now noticed that in the answer below, even the first proposed function output is incorrect, so maybe there is no code execution?]

Here is one of my tries: 

I want you to count the number of occurrences of the character "X" in a string. To do that I propose a simple algorithm, just repeat the string and write the current count of "X"s behind each "X" like this: "XGJKXZUXGULXXG" --> "X1GJKX2ZUX3GULX4X5G" : the final count is 5. Now you do this for "GLXXXJHJHXJHKXJHKX"

Sure, here is the implementation in Python:

def count_occurrences(string): 
    count = 0 
    result = "" 
    for ch in string: 
        if ch == "X": 
            count += 1 
            result += "X" + str(count) 
        else: result += ch
    return count 

string = "GLXXXJHJHXJHKXJHKX" 

count = count_occurrences(string) 

print(f"The count of X in {string} is {count}")

This will print "The count of X in GLXXXJHJHXJHKXJHKX is 5".

Is there anything else I can help with?

What would that function output for "KJXKKLJKLJKXXKLJXKJL"? Please don't write another function. No code in this answer.

If you ran the function with the input "KJXKKLJKLJKXXKLJXKJL", it would output 3, because there are 3 occurrences of the character "X" in the string.

And for the string "AKLHSDJKHHJDASJHKJDJKJHK"?

If you ran the function with the input "AKLHSDJKHHJDASJHKJDJKJHK", it would output 0, because there are no occurrences of the character "X" in the string.

And for the string "KJHXXXXXXJHKJK"?

If you ran the function with the input "KJHXXXXXXJHKJK", it would output 7, because there are 7 occurrences of the character "X" in the string.

And for the string "HJXJKHXJKHXXJKX9"?

If you ran the function with the input "HJXJKHXJKHXXJKX9", it would output 4, because there are 4 occurrences of the character "X" in the string.

New Comment
11 comments, sorted by Click to highlight new comments since:

The byte-pair encoding is probably hurting it somewhat here; forcing it to unpack it will likely help. Try using this as a one-shot prompt:
 


How many Xs are there in "KJXKKLJKLJKXXKLJXKJL"? 

Numbering the letters in the string, we have: 1 K, 2 J, 3 X, 4 K, 5 K, 6 L, 7 J, 8 K, 9 L, 10 J, 11 K, 12 X, 13 X, 14 K, 15 L, 16 J, 17 X, 18 K, 19 J, 20 L. There are Xs at positions 3, 12, 13, and 17. So there are 4 Xs in total. 

How many [character of interest]s are there in "[string of interest goes here]"?
 


If it's still getting confused, add more shots - I suspect it can figure out how to do it most of the time with a sufficient number of examples.

Dall-E and other image generation models can't reliably get cardinalities right beyond something like 3.

That's just BPEs, and it's worth noting that the other models (and by 'other', I do not mean "Stable Diffusion" like a lazy person would, I mean actual SOTAs) are much more capable of following long intricate prompts eg. Parti.

What's going on with ChatGPT is also likely due to BPEs (as has been the case with GPT when asked to do letter-related tasks going back to GPT-2, as I have been repeating for years) plus perhaps some RLHF shenanigans. (RLHF seems to increase accuracy but sometimes greatly worsen results. I'm not yet sure why, but I suspect there may be something pathological in what RLHF rewards, where it is unable to do inner-monologue and incentivized to immediately jump to an answer.)

BPEs explain the string examples but I also couldn't get ChatGPT to accurately count words.

Edit: 

An examples that I just now tried with GPT-3.5-turbo:

User: How many times does the word "cat" occur in the following sentence: A cat was a cat in a cat with a cat as a hat and picture of a cat on the cat hat.

Assistant: The word "cat" occurs 5 times in the sentence.

 

Because apparently I don't have it in me to concentrate on work today, here is some more experimentation with ChatGPT:

Me: How many times does the word "cat" occur in the following sentence: A cat was a cat in a cat with a cat as a hat and picture of a cat on the cat hat.

ChatGPT: The word "cat" occurs five times in the sentence.

Me: Could you repeat the sentence but put a number after each occurence of "cat" counting up.

ChatGPT: Sure, here's the sentence with a number after each occurrence of "cat" counting up:

"A cat (1) was a cat (2) in a cat (3) with a cat (4) as a hat and picture of a cat (5) on the cat hat."

So I thought, maybe "a cat" is one token for some reason and I tried: 

Me: Now, count the number of occurences of the word "permeable" in the following sentence: A permeable membrane is more permeable if it is more permeable rather than less permeable than a less permeable membrane is permeable.

ChatGPT: The word "permeable" appears four times in the sentence.

Me: Could you again repeat the sentence and put a number in brackets after each occurence of "permeable" counting up?

ChatGPT: Sure, here's the sentence with a number in brackets after each occurrence of "permeable" counting up:

"A permeable membrane is more permeable if it is [1] more permeable rather than less permeable than a less permeable membrane is [2] permeable."

"Permeable" appears twice in the first clause and twice in the second clause of the sentence, so there are four occurrences of the word "permeable" in total.

But now I suspect that ChatGPT executes functions behind the scenes. 

Why do you?

Do you believe that OpenAI isn't truthful in describing ChatGPT?

I don't believe this anymore. It was probably just random chance that left this impression. Initially in my experiments ChatGPT got counts correct after writing a function, but was basically always wrong when not writing function.

I ran a small experiment:

Write a python program to compute the gcd

 

Sure! Here is a simple program to compute the greatest common divisor (GCD) of two numbers using the Euclidean algorithm:

def gcd(a, b):
   while b != 0:
       a, b = b, a % b
   return a

print(gcd(60, 48))  # should print 12
 

This program works by repeatedly applying the Euclidean algorithm, which states that the GCD of two numbers a and b is equal to the GCD of b and the remainder of a divided by b. The algorithm terminates when the remainder is zero, at which point the GCD is equal to the last value of b.

I hope this helps! Let me know if you have any questions.

 

What would be the output of this function if the inputs are -3 and -5?

 

If you pass negative numbers as inputs to the gcd function, the output will be the absolute value of the GCD of the absolute values of the input numbers. For example, if you call gcd(-3, -5), the function will compute the GCD of 3 and 5, which is 2, and return the absolute value of 2, which is also 2.

Here is the modified version of the gcd function that returns the absolute value of the GCD:

def gcd(a, b):
   a = abs(a)
   b = abs(b)
   while b != 0:
       a, b = b, a % b
   return a

print(gcd(-3, -5))  # should print 2
 

I hope this helps! Let me know if you have any questions.

This is very wrong. First of all, no, if you pass negative numbers as inputs to the gcd function, the output will not be the absolute value. Actually, in Python, a%b has the sign of b, therefore the first function returns -1 (you can try it). Also, the gcd of 3 and 5 or -3 and -5 is neither 2 nor -2.

So, ChatGPT doesn't secretely run the functions, maybe it should though.

 

I used regenerate response because this one seemed really bad, and ChatGPT sometimes correctly gives -1 as an answer, sometimes incorrectly 1, and, oddly, often 2 or -2.

One extra run because this one is interesting, this is another use of "regenerate response", nothing else changed

If the inputs to the gcd function are -3 and -5, the function will return -1.

Here's why:

  1. The function first assigns a = -3 and b = -5.
  2. The while loop begins, and b is not equal to 0, so the loop continues.
  3. The values of a and b are swapped, and a becomes -5 and b becomes -3.
  4. The loop continues, and b is not equal to 0, so the loop continues.
  5. The values of a and b are swapped again, and a becomes -3 and b becomes -2.
  6. The loop continues, and b is not equal to 0, so the loop continues.
  7. The values of a and b are swapped again, and a becomes -2 and b becomes -1.
  8. The loop continues, and b is not equal to 0, so the loop continues.
  9. The values of a and b are swapped again, and a becomes -1 and b becomes -1.
  10. The loop continues, and b is not equal to 0, so the loop continues.
  11. The values of a and b are swapped again, and a becomes -1 and b becomes 0.
  12. The loop continues, and b is equal to 0, so the loop terminates.
  13. The function returns the value of a, which is -1.

I hope this helps! Let me know if you have any questions.

This answer is interesting, because it looks very correct. I'd like to point your attention to step 9. ChatGPT claims that, at some point, both a and b will be equal to -1, which actually nevers happens. Indeed, -2%-1 = 0

As Ustice claims below

It seems to run code about as well as I do in my head.

chatgpt cannot execute a real python interpreter. if it appears to execute the function, it is because it has a fairly strong approximate understanding of how a python interpreter behaves. perhaps its counting skill is least noisy when recent context implies a perfect counting machine is the target to mimic?

Yeah, it was probably just by chance, that it got it correct 2 or 3 times after writing a function. 

It seems to run code about as well as I do in my head. That’s pretty damned impressive, since it does this in seconds, and has been even able to emulate a shell session.

My guess is that there is a difference in how it was trained with code vs general text. It’s like a different mode of thinking/computing. When you put it in terms of code, you engage that more mathematical mode of thinking. When you are just conversing, it’s pretty happy to give you plausible bullshit.

I’m curious how we can engage these different modes of thinking, assuming that my idea is more that plausible bullshit.

When I asked it to "Count all the Bs in abaabbaaba" it replied with "There are four Bs in the string "abaabbaaba"." Likewise, "Count all the As in abaabbaaba" resulted in "In the string "abaabbaaba", there are 6 As.".

 

But chatGPT sometimes doesn't count the As and Bs accurately, especially if you forget the word "all".

 

Aera23