Thanks, this is a helpful comment. Fixed the typo
Edit: The sitation has evolved but is still somewhat confusing. There is now a leaderboard of scores on the public test set that Ryan is #1 on (see here). But this tweet from Jack Cole indicates that his (many month old) solution gets a higher score on the public test set than Ryan's top score on that leaderboard. I'm not really sure what's going on here,
One important caveat to the presentation of results in this post (and the discussion on Twitter) is that there are reasons to think this approach may not be SOTA, as it performs similarly to the prior best-performing approach when tested apples-to-apples, i.e. on the same problems.
There are three sets of ARC problems: the public training set, the public eval set, and the private eval set.
My two main deductions from this are:
Apparently, lots of people get better performance on the public test set than the private one, which is a little surprising given that if you read this page from the ARC folks, you'll see the following:
The public training set is significantly easier than the others (public evaluation and private evaluation set) since it contains many "curriculum" type tasks intended to demonstrate Core Knowledge systems. It's like a tutorial level.
The public evaluation sets and the private test sets are intended to be the same difficulty.
Two explanations come to mind: maybe the public and private test sets are not IID, and/or maybe past SOTA method overfit to the public set. Chollet claims it's (accidentally) the latter here, but he doesn't rule out the former. He says the tasks across the two public test sets are meant to be equally hard for a human, but he doesn't say they're divided in an IID manner.
I guess we'll see how the results on the public leaderboard shake out.
(Expanding on a tweet)
What are the considerations around whether to structure the debate to permit the judge to abstain (as Michael et al do, by allowing the judge to end the round with low credence) versus forcing the judge to pick an answer each time? Are there pros/cons to each approach? Any arguments about similarity of one or the other to the real AI debates that might be held in the future?
It's possible I'm misremembering/misunderstanding the protocols used for the debate here/in that other paper.
"Follow the right people on twitter" is probably the best option. People will often post twitter threads explaining new papers they put out. There's also stuff like:
I appreciate you transcribing these interviews William!
Did/will this happen?
I've been loving your optimization posts so far; thanks for writing them. I've been feeling confused about this topic for a while and feel like "being able to answer any question about optimization" would be hugely valuable for me.
We're expecting familiarity with PyTorch, unlike MLAB. The level of Python background expected is otherwise similar. The bar will vary somewhat depending on each applicant's other traits, e.g. mathematical and empirical-science backgrounds
Thanks, edited my post to reference this (lmk if you understand what's happening here better than I do)