Thank you for the discussion in the DMs!
Wrt superhuman doubts: The models we tested are superhuman. https://www.melonimarco.it/en/2021/03/08/stockfish-and-lc0-test-at-different-number-of-nodes/ gave a rough human ELO estimate of 3000 for a 2021 version of Leela with 100 nodes, 3300 for 1000 nodes. There is a bot on Lichess that plays single-node (no search at all) and seems to be in top 0.1% of players.
I asked some Leela contributors; they say that it's likely new versions of Leela are superhuman at even 20 nodes; and that our tests of 100-1600 nodes are almost certainly quite superhuman. We also tested Stockfish NNUE with 80k nodes and Stockfish classical with 4e6 nodes, with similar consistency results.
Table 5 in Appendix B.3 ("Comparison of the number of failures our method finds in increasingly stronger models"): this is all on positions from Master-level games. The only synthetically generated positions are for the Board transformation check, as no-pawn positions with lots of pieces are rare in human games.
We cannot comment on different setups not reproducing our results exactly; pairs of positions do not necessarily transfer between versions, but iirc preliminary exploration implied that the results wouldn't be qualitatively different. Maybe we'll do a proper experiment to confirm.
There's an important question to ask here: how much does scaling search help consistency? Scaling Scaling Laws with Board Games [Jones, 2021] is the standard reference, but I don't see how to convert their predictions to estimates here. We found one halving of in-distribution inconsistency ratio with two doublings of search nodes on the Recommended move check. Not sure if anyone will be working on any version of this soon (FAR AI maybe?). I'd be more interested in doing a paper on this if I could wrap my head around how to scale "search" in LLMs, with a similar effect as what increasing the number of search nodes does on MCTS trained models.
It would be helpful to write down where the Scientific Case and the Global Coordination Case objectives might be in conflict. The "Each subcomponent" section addresses some of the differences, but not the incentives. I do acknowledge that first steps look very similar right now, but the objectives might diverge at some point. It naively seems that demonstrating things that are scary might be easier and is not the same thing as creating examples which usefully inform alignment of superhuman models.
So I've read an overview [1] which says Chagnon observed a pre-Malthusian group of people, which was kept from exponentially increasing not by scarcity of resources, but by sheer competitive violence; a totalitarian society that lives in abundance.
There seems to be an important scarcity factor shaping their society, but not of the kind where we could say that "we only very recently left the era in which scarcity was the dominant feature of people’s lives."
Although, reading again, this doesn't disprove violence in general arising due to scarcity, and then misgeneralizing in abundant environments... And again, "violence" is not the same as "coercion".
Unnecessarily political, but seems to accurately represent Chagnon's observations, based on other reporting and a quick skim of Chagnon's work on Google Books.
I don't think "coercion is an evolutionary adaptation to scarcity, and we've only recently managed to get rid of the scarcity" is clearly true. It intuitively makes sense, but Napoleon Chagnon's research seems to be one piece of evidence against the theory.
Jason Wei responded at https://www.jasonwei.net/blog/common-arguments-regarding-emergent-abilities.
My thoughts: It is true that some metrics increase smoothly and some don't. The issue is that some important capabilities are inherently all-or-nothing, and we haven't yet found surrogate metrics which increase smoothly and correlate with things we care about.
What we want is: for a given capability, predicting whether this capability happens in the model that is being trained.
If extrapolating a smoothly increasing surrogate metric can do that, then emergence of that capability is indeed a mirage. Otherwise, Betteridge's law of headlines applies.
Claim 2: Our program gets more people working in AI/ML who would not otherwise be doing so (...)
This might be unpopular here, but I think each and every measure you take to alleviate this concern is counterproductive. This claim should just be discarded as a thing of the past. May 2020 has ended 6 months ago; everyone knows AI is the best thing to be working on if you want to maximize money or impact or status. For people not motivated by AI risks, you could replace would in that claim with could, without changing the meaning of the sentence.
On the other hand, maybe keeping the current programs explicitly in-group make a lot of sense if you think that AI x-risk will soon be a major topic in the ML research community anyway.
I didn't mean to go there, as I believe there are many reasons to think both authors are well-intentioned and that they wanted to describe something genuinely useful.
It's just that this contribution fails to live up to its title or to sentences like "In other words, no one has done for AI what Russell Impagliazzo did for complexity theory in 1995...". My original comment would be the same if it was an anonymous post.
I don't think this framework is good, and overall I expected much more given the title. The name "five worlds" is associated with a seminal paper that materialized and gave names to important concepts in the latent space... and this is just a list of outcomes of AI development, with that categorization by itself providing very little insight for actual work on AI.
Repeating my comment from Shtetl-Optimized, to which they didn't reply:
It appears that you’re taking collections of worlds and categorizing them based on the “outcome” projection, labeling the categories according to what you believe is the modal representative underlying world of those categories.
By selecting the representative worlds to be “far away” from each other, it gives the impression that these categories of worlds are clearly well-separated. But, we do not have any guarantees that the outcome map is robust at all! The “decision boundary” is complex, and two worlds which are very similar (say, they differ in a single decision made by a single human somewhere) might map to very different outcomes.
The classification describes *outcomes* rather than actual worlds in which these outcomes come from.
Some classifications of the possible worlds would make sense if we could condition on those to make decisions; but this classification doesn’t provide any actionable information.
Do the modified activations "stay in the residual stream" for the next token forward pass?
Is there any difference if they do or don't?
If I understand the method correctly, in Steering GPT-2-XL by adding an activation vector they always added the steering vectors on the same (token, layer) coordinates, hence in their setting this distinction doesn't matter. However, if the added vector is on (last_token, layer), then there seems to be a difference.