New Answer

New Comment

6 Answers sorted by

Play Go better than AlphaGo Zero. AlphaGo Zero was trained using millions of games. Even if GPT-4 is trained on all of the internet, there simply isn't enough training data for it to have comparable effectiveness.

Things that it can probably do sometimes, but will fail on some inputs:

- Factor numbers
- Solve NP-Complete or harder problems
- Execute code

There are other “tail end” tasks like this that should eventually become the hardest bits that optimization spends the most time on, once it manages to figure everything else out.

Know if it's reply to a prompt is actually useful.

Eg: prompt with "a helicopter is most efficient when ... "; "a helicopter is more efficient when"; and "helicopter efficiency can be improved by." GPT-4 will not be able to know which response is the best. Or even if any of the responses would actually move helicopter efficiency in the right direction.

So physics understanding.

How do you think it would perform on simpler question closer to its training dataset, like "we throw a ball from a 500m building with no wind, and the same ball but with wind, which one hits the floor earlier" (on average, after 1000 questions).$? If this still does not seem plausible, what is something you would bet $100 2:1 but not 1:1 that it would not be able to do?

12y

What do you mean by "on average after 1000 questions"? Because that is the crux
of my answer: GPT-4 won't be able to QA its own work for accuracy, or even
relevance.

12y

well if we're doing a bet then at some point we need to "resolve" the
prediction. so we ask GPT-4 the same physics question 1000 times and then some
humans judges count how many it got right, if it gets it right more than let's
say 95% of the time (or any confidence interval) , then we would resolve this
positively. of course you could do more than 1000, and with law of large numbers
it should converge to the true probability of giving the right answer?

12y

That wouldn't be useful, though.
My assertion is more like: After getting the content of elementary school
science textbooks (or high school physics, or whatever other school science
content makes sense), but not including the end-of-chapter questions (and
especially not the answers), GPT-4 will be unable to provide the correct answer
to more then 50% of the questions from the end of the chapters, constrained by
having to take the first response that looks like a solution as it's "answer"
and not throwing away more than 3 obviously gibberish or bullshit responses per
question.
And that 50% number is based on giving it every question without discrimination.
If we only count the synthesis questions (as opposed to the memory/definition
questions), I predict 1%, but would bet on < 10%

12y

let's say by concatenating your textbooks you get plenty of examples
of f=m⋅a with "blablabla object sky blablabla
gravity a=9.8m/s2 blablabla m=12kg blabla f=12∗9.8=120N. And then the exercise
is: "blablabla object of mass blablabla thrown from the sky, what's the force?
a) f=120 b) ... c) ... d) ...". then what you need to do is just do some prompt
programming at the beginning by "for looping answer" and teaching it to return
either a,b,c or d. Now, I don't see any reason why a neural net couldn't
approximate linear functions of two variables. It just needs to map words like
"derivative of speed", "acceleration", "d2z/dt2" to the same concept and then
look at it with attention & multiply two digits.

12y

Generally the answers aren't multiple choice. Here's a couple examples of
questions from a 5th grade science textbook I found on Google:
1. How would you state your address in space. Explain your answer.
2. Would you weigh the same on the sun as you do on Earth. Explain your answer.
3. Why is it so difficult to design a real-scale model of the solar system?

12y

If it's about explaining your answer with 5th grade gibberish then GPT-4 is THE
solution for you! ;)

Reason about code.

Specifically, I've been trying to get GPT-3 to outperform the Hypothesis Ghostwriter in automatic generation of tests and specifications, without any success. I expect that GPT-4 will also underperform; but that it *could* outperform if fine-tuned on the problem.

If I knew where to get training data I'd like to try this with GPT-3 for that matter; I'm much more attached to the user experience of "`hypothesis write mypackage`

generates good tests" than any particular implementation (modulo installation and other managable ops issues for novice users).

I think the general answer to testing seems AGI-complete in the sense that you should understand the edge-cases of a function (or correct output from "normal" input).

if we take the simplest testing case, let's say python using pytest, with a typed code, with some simple test for each type (eg. 0 and 1 for integers, empty/random strings, etc.) then you could show it examples on how to generate tests from function names... but then you could also just do it with reg-ex, so I guess with hypothesis.

so maybe the right question to ask is: what do you expect GPT-...

12y

Testing in full generality is certainly AGI-complete (and a nice ingredient for
recursive self-improvement!), but I think you're overestimating the difficulty
of pattern-matching your way to decent tests. Chess used to be considered
AGI-complete too; I'd guess testing is more like poetry+arithmetic in that if
you can handle context, style, and some details it comes out pretty nicely.
I expect GPT-4 to be substantially better at this 'out of the box' due to
* the usual combination of larger, better at generalising, scaling laws, etc.
* super-linear performance gains on arithmetic-like tasks due to
generalisation, with spillover to code-related tasks
* the extra github (and blog, etc) data is probably pretty helpful given steady
adoption since ~2015 or so
--------------------------------------------------------------------------------
Example outputs from Ghostwriter vs GPT-3:
$ hypothesis write gzip.compress
import gzip
from hypothesis import given, strategies as st
@given(compresslevel=st.just(9), data=st.nothing())
def test_roundtrip_compress_decompress(compresslevel, data):
value0 = gzip.compress(data=data, compresslevel=compresslevel)
value1 = gzip.decompress(data=value0)
assert data == value1, (data, value1)
while GPT-3 tends to produce examples like (first four that I generated just
now):
@given(st.bytes(st.uuid4()))
def test(x):
expected = x
result = bytes(gzip(x))
assert bytes(result) == expected
@given(st.bytes())
def test_round_trip(xs):
compressed, uncompressed = gzip_and_unzip(xs)
assert is_equal(compressed, uncompressed)
@given(st.bytes("foobar"))
def test(xs):
assert gzip(xs) == xs
@given(st.bytes())
def test(xs):
zipped_xs = gzip(xs)
uncompressed_xs = zlib.decompress(zipped_xs)
assert zipped_xs == uncompressed_xs
So it's clearly 'getting the right idea', even without any fine-tuning at all,
but not there yet. It's also a lot worse at this without a natural-language
descript

22y

I object to the characterization that it is "getting the right idea." It seems
to have latched on to "given a foo of bar" -> "@given(foo.bar)" and that
"assert" should be used, but the rest is word salad, not code.

12y

It's at least syntatically-valid word salad composed of relevant words, which is
a substantial advance - and per Gwern, I'm very cautious about generalising from
"the first few results from this prompt are bad" to "GPT can't X".

Directing a robot using motor actions and receiving camera data (translated into text I guess to not make it maximally unfair, but still) to make a cup of tea in a kitchen.

It's vaporware, so it can do whatever you imagine. It's hard to constrain a project that doesn't exist, as far as we know.

10 comments, sorted by Click to highlight new comments since: Today at 8:45 PM

It would really depend on how many parameters the model has IMO, if the jump from GPT-3 to GPT-4 is something on the order of magnitude of 10-100x, then we could potentially see similar gains for multiplication. GPT-3 (175B) can do 2 digit multiplication with a ~50% accuracy, so 5-6 digits might be possible. It really depends on how well the model architecture of GPT scales in the future.

So from 2-digit substraction to 5-digit substraction it lost 90% accuracy, and scaling the model by ~10x gave a 3x improvement (from 10 to 30%) on two-digit multiplication. So assuming we get 3x more accuracy from each 10x increase and that 100% on two digit corresponds to ~10% on 5-digit, we would need something like 3 more scalings like "13B -> 175B", so about 400 trillion params.

That's fair. Depending on your stance on Moore's Law or supercomputers, 400 trillion parameters might or might not be plausible (not really IMO). But, this is assuming that there's no advances in the model architecture (maybe changes to the tokenizer?) which would drastically improve the performance of multiplication / other types of math.

Going by GPT-2's BPEs [1], and based on the encoder downloaded via OpenAI's script, there are 819 (single) tokens/embeddings that uniquely map to the numbers from 0-1000, 907 when going up to 10,000, and 912 up to 200,000 [2]. These embeddings of course get preferentially fed into the model in order to maximize the number of characters in the context window and thereby leverage the statistical benefit of BPEs for language modeling. Which bears to mind that the above counts exclude numeric tokens that have a space at the beginning [3].

My point here being that, IIUC, for the language model to actually be able to manipulate individual digits, as well as pick up on the elementary operations of arithmetic (e.g. carry, shift, etc.), the expected number of unique tokens/embeddings might have to be limited to 10 – the base of the number system – when counting from 0 to the largest representable number [2].

[1] From the GPT-3 paper, it was noted:

This [GPT'3 performance on some other task] could be a weakness due to reusing the byte-level BPE tokenizer of GPT-2 which was developed for an almost entirely English training dataset.

[2] More speculatively, I think that this limitation makes extrapolation on certain abilities (arithmetic, algebra, coding) quite difficult without knowing whether its BPE will be optimized for the manipulation of individual digits/characters if need be, and that this limits the generalizability of studies such as GPT-3 not being able to do math.

[3] For such tokens, there are a total 505 up to 1000. Like the other byte pairs, these may have been automatically mapped based on the distribution of n-grams in some statistical sample (and so easily overlooked).

Hm, not so sure about this one anymore, since training on correct multiplication is easy using synthetic training data.

re right prompt: GPT-3 has a context window of 2048 tokens, so this limits quite a lot what it could do. Also, it's not accurate at two-digit multiplication (what you would at least need to multiply your $ to %), even worse at 5-digit. So in this case, we're sure it can't do your taxes. And in the more general case, gwern wrote some debugging steps to check if the problem is GPT-3 or your prompt.

Now, for GPT-4, given they keep scaling the same way, it won't be possible to have accurate enough digit multiplication (like 4-5 digits, cf. this thread) but with three more scalings it should do it. Prompt would be "here is a few examples on how to do taxe multiplication and addition given my format, so please output result format", and concatenate those two. I'm happy to bet $1 1:1 on GPT-7 doing taxe multiplication to 90% accuracy (given only integer precision).

That's a good one. What would be a claim you would be less confident (less than 80%) about but still enough confident to bet $100 at 2:1 odds? For me it would be "gpt-4 would beat a random go bot 99% of the time (in 1000 games) given the right input of less than1000 bytes."