yo-cuddles

Posts

Sorted by New

Wiki Contributions

Comments

Sorted by

Can you make some noise in the direction of the shockingly low numbers it gets on early arc-2 benchmarks? This feels like pretty open and shut proof that it doesn't generalize, no?

The fact that the model was trained on 75 percent of the training set feels like they ghetto rigged a test set and RL'd the thing to success. If the <30% score on the second test ends up being true, I feel like that should inform our guesses at what it's actually doing heavily away from genuine intelligence and towards a brute force search for verifiable answers.

The frontier tests just feel unconvincing. Chances are, there are well known problem structures with well known solution structures, it's just plugging and chugging. Mathematicians who have looked at some sample problems have indicated that both tier 1 and tier 2 problems have solutions they know by reflex, which implies these o3 results are not indicative of anything super interesting

This just feels like a nothingburger and I'm waiting for someone to tell me why my doubts are misplaced, convincingly

Thanks for the reply! Still trying to learn how to disagree properly so let me know if I cross into being nasty at all:

I'm sure they've gotten better, o1 probably improved more from its heavier use of intermediate logic, compute/runtime and such, but that said, at least up till 4o it looks like there has been improvements in the model itself, they've been getting better

They can do incredibly stuff in well documented processes but don't survive well off the trodden path. They seem to string things together pretty well so I don't know if I would say there's nothing else going on besides memorization but it seems to be a lot of what it's doing, like it's working with building blocks of memorized stuff and is learning to stack them using the same sort of logic it uses to chain natural language. It fails exactly in the ways you'd expect if that were true, and it has done well in coding exactly as if that were true. The fact that the swe benchmark is giving fantastic scores despite my criticism and yours means those benchmarks are missing a lot and probably not measuring the shortfalls they historically have

See below: 4 was scoring pretty well in code exercises like codeforces that are toolbox oriented and did super well in more complex problems on leetcode... Until the problems were outside of its training data, in which case it dropped from near perfect to not being able to do much worse.

https://x.com/cHHillee/status/1635790330854526981?t=tGRu60RHl6SaDmnQcfi1eQ&s=19

This was 4, but I don't think o1 is much different, it looks like they update more frequently so this is harder to spot in major benchmarks, but I still see it constantly.

Even if I stop seeing it myself, I'm going to assume that the problem is still there and just getting better at hiding unless there's a revolutionary change in how these models work. Catching lies up to this out seems to have selected for better lies

I would say that, barring strong evidence to the contrary, this should be assumed to be memorization.

I think that's useful! LLM's obviously encode a ton of useful algorithms and can chain them together reasonably well

But I've tried to get those bastards to do something slightly weird and they just totally self destruct.

But let's just drill down to demonstrable reality: if past SWE benchmarks were correct, these things should be able to do incredible amounts of work more or less autonomously and get all the LLM SWE replacements we've seen have stuck to highly simple, well documented takes that don't vary all that much. The benchmarks here have been meaningless from the start and without evidence we should assume increments on them is equally meaningless

The lying liar company run by liars that lie all the time probably lied here and we keep falling for it like Wiley Coyote

Hmm, mixed agree/disagree. Scale probably won't work, algorithms probably would, but I don't think it's going to be that quick.

Namely, I think that the company struggling with fixed capital costs could accomplish much more, much quicker using the salary expenses of the top researchers they already have they'd have done it or gave it a good try at least

I'm 5 percent that a serious switch to algorithms would result in AGI in 2 years. You might be more well read than me on this so I'm not quite taking side bets right now!

I think the algorithm progress is doing some heavy lifting in this model. I think if we had a future textbook on agi we could probably build one but AI is kinda famous for minor and simple things just not being implemented despite all the parts being there

See ReLU activations and sigmoid activations.

If we're bottlenecking at algorithms alone is there a reason that isn't a really bad bottleneck?

I haven't had warm receptions when critiquing points, which has frustratingly left me with bad detection for when I'm being nasty, so if I sound thorny it's not my intent.

Somewhere I think you might have misstepped is the frontier math questions: the quotes you've heard are almost certainly about tier 3 questions, the hardest ones meant for math researchers in training. The mid tier is for grad student level problems and tier 1 is bright high schooler to undergrad level problems

Tier 1: 25% of the test

Tier 2: 50% of the test

Tier 3: 25%

O3 got 25%, probably answering none of the hard questions and suspiciously matching almost exactly the proportion of easy questions. From some, there seems to be disagreement about whether tier 2 questions are consistently harder than tier 1 questions

Regardless, some (especially easier) problems are of the sort that can be verified and have explicitly been said to have instantaneously recognizable solutions. This is not an incredibly flattering picture of o3

THIS IS THE END OF WHERE I THINK YOU WERE MISTAKEN, TEXT PAST THIS IS MORE CONJECTURE AND/OR NOT DIRECTLY REFUTING YOU

the ARC test looks like it was taken by an overfit model. If the test creators are right, then the arc test for an 85 percent off a tuned model and probably spamming conclusions that it could verify, it trained on 75 percent of the questions from what I understand so one of that score seems like memorization and a mildly okay score on the 25 percent that was held as test data.

And this part is damning: the arc-2 test which is the success of to the first one, made by the same people, gets a 95 percent pass rate form humans (so easier than the 85 percent pass rate of the first test) but o3's score dropped to a 30%, a 55% drop and now 65% below human on a similar test made by the same people.

Let me be clear: if that isn't VERY inaccurate, then this is irrefutably a cooked test and o3 is overfit to the point of invalidating the results for any kind of generalizability.

There are other problems, like the fact that this pruning search method is really, really bad for some problems and that it seems to ride on validation being somewhat easy in order to work at all but that's not material to the benchmarks

I can cite sources if these are important points, not obviously incorrect, etc, I might write my first post about it if I'm digging that much!

Ah wait was reading it wrong. I thought each time was an order of magnitude, that looks to be standard notation for log scale. Mischief managed

Maybe a dumb question, but those log scale graphs have uneven ticks on the x axis, is there a reason they structured it like that beyond trying to draw a straight line? I suspect there is a good reason and it's not dishonesty but this does look like something one would do if you wanted to exaggerate the slope

I do not have a gauge for how much I'm actually bringing to this convo, so you should weigh my opinion lightly, however:

I believe your third point kinda nails it. There are models for gains from collective intelligence (groups of agents collaborating) and the benefits of collaboration bottleneck hard on your ability to verify which outputs from the collective are the best, and even then the dropoff happens pretty quick the more agents collaborate.

10 people collaborating with no communication issues and accurate discrimination between good and bad ideas are better than a lone person on some tasks, 100 moreso

You do not see jumps like that moving from 1,000 to 1,000,000 unless you set unrealistic variables.

I think inference time probably works in a similar way: dependent on discrimination between right and wrong answers and steeply falling off as inference time increases

My understanding is that o3 is similar to o1 but probably with some specialization to make long chains of thought stay coherent? The cost per token from leaks I've seen is the same as o1, it came out very quickly after o1 and o1 was bizarrely better at math and coding than 4o

Apologies if this was no help, responding with the best intentions

Load More