AVoropaev - LessWrong

What's the source of that 505 employees letter? I mean the contents aren't too crazy, but isn't it strange that the only thing we have is a screenshot of the first page?

Monthly Roundup #7: June 2023

AVoropaev11mo10

Re: Tik-tok viral videos. I think that the cliff is simply because recent videos had too little time to be watched 10m times. The second graph in the article is not about the same for 0.1m views, but about average views per week (among videos with >0.1m views), which stays stable.

LM Situational Awareness, Evaluation Proposal: Violating Imitation

AVoropaev1y1-1

I don't understand the point of questions 1 and 3.

If we forget about details of how model works, the question 1 essentially checks whether the entity in question have a good enough rng. Which doesn't seem to be particularly relevant? Human with a vocabulary and random.org can do that. AutoGTP with access to vocabulary and random.org also have a rather good shot. Superintelligence that for some reason decides to not use rng and answer deterministically will fail. I suppose it would be very interesting to learn that say GPT-6 can do it without external rng, but what would it tell us about it's other capabilities?

The question 3 checks for something weird. If I wanted to pass it, I'd probably have to precommit on answering certain weird questions in a particular way (and also ensure to always have access to some rng). Which is a weird thing to do? I expect humans to fail at that, but I also expect almost every possible intelligence to fail at that.

In contrast question 2 checks for something "which part of input do you find most surprising" which seems like a really useful skill to have and we should probably watch out for it.

Stupid Questions - April 2023

AVoropaev1y10

Yeah, you are right. It seems that it was actually one of the harder ones I tried. This particular problem was solved by 4 of 28 members of a relatively strong group. I distinctly remember also trying some easy problems from a relatively weak group, but I don't have notes and Bing don't save chat.

I guess I should just try again, especially in light of gwillen's comment. (By the way, if somebody with access to actual GPT-4 is willing to help me with testing it on some math problems, I'd really appreacite it .)

Stupid Questions - April 2023

AVoropaev1y10

That would explain a lot. I've heard this rumor, but when I tried to trace the source, i haven't found anything better than guesses. So I dismissed it, but maybe I shouldn't have. Do you have a better source?

Stupid Questions - April 2023

AVoropaev1y10

I agree that there are some impressive improvements from GPT-3 to GPT-4. But they seem to me a lot less impressive than jump from GPT-2 producing barely coherent texts to GPT-3 (somewhat) figuring out how to play chess.

I disagree with you take on LLM's math abilities. Wolfram Alpha helps with tasks like SAT -- and GPT-4 is doing well enough on them. But for some reason it (at least in the incarnation of Bing) has trouble with simple logic puzzles like the one I mentioned in other comment.

Can you tell more about success with theoretical physics concepts? I don't think I've seen anybody try that.

Stupid Questions - April 2023

AVoropaev1y10

I didn't say "it's worse than 12 yo at any math task". I meant nonstandard problems. Perhaps that's wrong English terminology? Sort of easy olympiad problem?

The actual test that I performed was "take several easy problems from a math circle for 12 y/o and try various 'lets think tep-by-step' to make Bing write solutions".

Example of such a problem:

Between 20 poles, several ropes are stretched (each rope connects two different poles; there is no more than one rope between any two poles). It is known that at least 15 ropes are attached to each pole. The poles are divided into groups so that each rope connects poles from different groups. Prove that there are at least four groups.

Stupid Questions - April 2023

AVoropaev1y20

Two questions about capabilities of GPT-4.

The jump in capabilities from GPT-3 to GPT-4 seems like much much less impressive than jump from GPT-2 to GPT-3. Part of that is likely because later version of GPT-3 were noticeably smarter than the first ones, but that reason doesn't seem sufficient to me. So what's up? Should I expect that GPT-4 -> GPT-5 will be barely noticeable?
In particular I am rather surprised at apparent lack in ability to solve nonstandard math problems. I didn't expect it to beat IMO, but I did expect that problems for 12 y/o would be accessible, and they weren't. (I personally tried only Bing, so perhaps usual GPT-4 is better. But I've seen only one successful attempt with GPT-4, and it was mostly trigonometry.). So what's up? I am tempted to say that math is just harder than economy, biology, etc. But that's likely not it.

More information about the dangerous capability evaluations we did with GPT-4 and Claude.

AVoropaev1yΩ350

What improvements do you suggest?

What DALL-E 2 can and cannot do

AVoropaev2y10

Can it in some way describe itself? Something like "picture of DALL-E 2".

LESSWRONG
LW

Posts

Wiki Contributions

Comments