Hailey Collet


Sorted by New

Wiki Contributions


30,000ft takeaway I got from this: we're ~ < 2 OOM from 95% performance. Which passes the sniff test, and is also scary/exciting

93% in 2025 FEELS high, but ... Meta was already low for 2023, median 83% but GPT-4 scores 86.4%. If you plot 100%-MMLU sota training compute FLOPS (e.g. RoBERTa with 1.32*10^21 flops scores 27.9% so 72.1% gap, GPT-3.5 @ 3.14*10^23=30% gap, GPT4 @ ~1.742*10^25=13.6% gap), it should take roughly 41x the training compute of GPT-4 to achieve 93.1% ... so it totally checks.

(My estimate for GPT4 compute is based on the 1 trillion parameter leak, approximate number of V100 GPUs they have - they didn't have A100 let alone H100 in hand during GPT4 training interval - possible range of training intervals scaling laws and the training time=8TP/nx law, etc., and ran some squiggles ... it should be taken with a grain of salt, but the final number doesn't change in very meaningful ways for any reasonable assumptions so, e.g., it might take 20x or 80x but it's not going to 500x the training compute to get to 93%)

In the short term, job loss will happen through compression of teams more than anything ... Some of seniors taking on junior work, e.g. if you have a development team of multiple senior & junior engineers, I could see some juniors getting canned with how things are now. But the restriction of exactly as it is right now is pretty severe. You don't have to train new models to vastly increase its real world capabilities.

5thed, or whatever. I won't assume anything about the author, the conclusions of this article are nonsense.

I had it play hundreds of games against stockfish, mostly at lowest skill level, using the API. After a lot of experimentation, I was giving it a fresh slate every prompt. The prompt was basically telling it it was playing chess, what color it was, and the PGN (it did not do as well with the FEN or both in either order). If it made invalid move (s), the next prompt (s) for that turn I added a list of the invalid moves it had attempted. After a few tries I had it forfeit the game.

I had a system set up to rate it, but it wasn't able to complete nearly enough games. As described, it finished maybe 1 in 40. I added a list of all legal moves on second and third attempt for a turn. It was then able to complete about 1 in 10 and won about half of them. Counting the forfeits and calling this a legal strategy, that's something like a 550 iirc? But. It's MUCH worse in the late-middle and end games, even with the fresh slate every turn. Until that point - including well past any opening book it could possibly have "lossless in its database" (not how it works) - it plays much better, subjectively 1300-1400.

Ahh, I should have thought of having it repeat the history! Good prompt engineering. Will try it out. The gpt4 gameplay in your lichess study is not bad!

I tried by just asking it to play and use SAN. I had it explain its moves, which it did well, and it also commented on my (intentionally bad) play. It quickly made a mess of things though, clearly lost track of the board state (to the extent it's "tracking" it ... really hard to say exactly how it's playing past common opening) even though it should've been in the context window.

How did you play? Just SAN?

I used the first chart, the compute required for GPT3, and my personal assessment that ChatGPT clearly meets the cutoff for tweet length, very probably meets it for short blog (but not by a wide margin), and clearly does not meet it for research paper, to create my own 75th percentile estimate for human slowdown of 25-75. It moves the P(TAI<=year) = 50% from ~2041 to ~2042, and the 75% from ~2060 to ~2061. Big changes! 😂

Your assertion that we don't have many things to reduce cost per transistor may be true, but is not supported by the rest of your comment or links - reduction in transitor size and similar performance improving measures are not the only way to improve cost performance.