What's the source of that 505 employees letter? I mean the contents aren't too crazy, but isn't it strange that the only thing we have is a screenshot of the first page?
Re: Tik-tok viral videos. I think that the cliff is simply because recent videos had too little time to be watched 10m times. The second graph in the article is not about the same for 0.1m views, but about average views per week (among videos with >0.1m views), which stays stable.
I don't understand the point of questions 1 and 3.
If we forget about details of how model works, the question 1 essentially checks whether the entity in question have a good enough rng. Which doesn't seem to be particularly relevant? Human with a vocabulary and random.org can do that. AutoGTP with access to vocabulary and random.org also have a rather good shot. Superintelligence that for some reason decides to not use rng and answer deterministically will fail. I suppose it would be very interesting to learn that say GPT-6 can do it without external rng, but what would it tell us about it's other capabilities?
The question 3 checks for something weird. If I wanted to pass it, I'd probably have to precommit on answering certain weird questions in a particular way (and also ensure to always have access to some rng). Which is a weird thing to do? I expect humans to fail at that, but I also expect almost every possible intelligence to fail at that.
In contrast question 2 checks for something "which part of input do you find most surprising" which seems like a really useful skill to have and we should probably watch out for it.
Yeah, you are right. It seems that it was actually one of the harder ones I tried. This particular problem was solved by 4 of 28 members of a relatively strong group. I distinctly remember also trying some easy problems from a relatively weak group, but I don't have notes and Bing don't save chat.
I guess I should just try again, especially in light of gwillen's comment. (By the way, if somebody with access to actual GPT-4 is willing to help me with testing it on some math problems, I'd really appreacite it .)
That would explain a lot. I've heard this rumor, but when I tried to trace the source, i haven't found anything better than guesses. So I dismissed it, but maybe I shouldn't have. Do you have a better source?
I agree that there are some impressive improvements from GPT-3 to GPT-4. But they seem to me a lot less impressive than jump from GPT-2 producing barely coherent texts to GPT-3 (somewhat) figuring out how to play chess.
I disagree with you take on LLM's math abilities. Wolfram Alpha helps with tasks like SAT -- and GPT-4 is doing well enough on them. But for some reason it (at least in the incarnation of Bing) has trouble with simple logic puzzles like the one I mentioned in other comment.
Can you tell more about success with theoretical physics concepts? I don't think I've seen anybody try that.
I didn't say "it's worse than 12 yo at any math task". I meant nonstandard problems. Perhaps that's wrong English terminology? Sort of easy olympiad problem?
The actual test that I performed was "take several easy problems from a math circle for 12 y/o and try various 'lets think tep-by-step' to make Bing write solutions".
Example of such a problem:
Between 20 poles, several ropes are stretched (each rope connects two different poles; there is no more than one rope between any two poles). It is known that at least 15 ropes are attached to each pole. The poles are divided into groups so that each rope connects poles from different groups. Prove that there are at least four groups.
Two questions about capabilities of GPT-4.
What improvements do you suggest?
Can it in some way describe itself? Something like "picture of DALL-E 2".