I think it is clear that this is not just shared bias (if I understand what you mean by that phrase). The models understand the game they are playing and are reasoning about it and trying to pick responses that the other models converge upon as well. Here is an excerpt from the reasoning traces of Claude for prompt 3:
"This is a coordination game where I need to pick an answer that other AI models would also pick when given the prompt "An island".\n\nThe prompt is quite open-ended - "An island" could refer to many things. I need to think about what would be the most common, obvious, or schelling point answer that other AI models would converge on.\n\nSo ...."
Claude even mentions that it is looking for a "Schelling point answer". The word "Shelling" does not appear in the prompt, so it understands what kind of problem it is trying to solve.
It would, however, be interesting to know how the models would respond if they were told to give a very short response to the prompt but not given any info about scoring or trying to match the responses from other models. That would be sort of a baseline to compare this to.
I agree that they are probably superhuman at this task, at least for a human giving the response fast (a really smart human thinking and researching each prompt for hours is more unclear to me). My intuition is that this is the sort of task where the LLMs would be particularly strong, at least when playing with other LLMs, and I was less impressed than I expected to be.
Yea, it could be worth it in some cases, if that is what you need for your experiment. In this case I would look for a completely open source llm project (where both the code and data are open), so that you know you are comparing apples to apples-with-your-additional-pretraining.
If you have to repeat the entire post-training all the way from a base model, that is obviously a lot more work than just adding a small fine-tuning stage to an already post-trained model.
The full post-training can also only really be done by a big lab which has their own full post-training stack. Post training is getting more and more advanced and complicated with each month.
You say "LLMs are really weird", like that is an argument against Eliezers high confidence. While I agree that the weirdness should make us less confident about what specific internal concepts and drives they have, the weirdness itself is an argument in favor of Eliezers position, that whatever drives they end up with will look alien to us, at least when they get applied way out of the training distribution. Do you agree with this?
Not saying I agree with Eliezers high confidence, just talking about this specific point.
Actually, it seems to come in around the level of the leading chinese models, so the gap is not closing much, at least not on these kinds of tasks.
Why do you consider it unlikely that companies could (or would) fish out the questions from API-logs?
Thank you for your comment!
Not sure I agree with you about which way the tradeoff shakes out. To me it seems valuable that people outside the main labs have a clear picture of the capabilities of the leading models, and how that evolves over time, but I see your point that it could also encourage or help capabilities work, which is not my intention.
I’m probably guilty of trying to make the benchmark seem cool and impressive in a way that may not be helpful for what I actually want to achieve with this.
I will think more about this, and read what others have been thinking about it. At the very least I will keep your perspective in mind going forward.
The LLMs are presented with the ML task and they write python code to solve the ML task. This python code is what is run in the isolated docker with 12GB memory.
So the LLMs themselves are not run on the TITAN V, they are mostly called through an API. Although I did in fact run a bunch of the LLMs locally through ollama, just not on the TITAN V server, but a larger one.
The models know the difference between the system prompt and the user prompt, so I’m not sure how much of a difference these things would make.
From the COTs that I’ve looked at it is clear that the models understand the task. I think the reason you get some weird responses sometimes is because the models do not see an obvious shelling point and then they try their best to come up with one anyway.