Suggestion: mention that OOM means "orders if magnitude". I was confused until I noticed the powers of 10.
Oh you're right, it seems to always defect if I fix the typo. Well that's embarrassing 😅.
Apparently GPT-4 is only good at chess if it tell it not to explain anything (or show the board as it turns out). This also suggests that the chess part is separate from the rest.
It didn't even seem to understand what the goals of any of the games were, despite being able to explain it in natural language. So it wasn't even at a point I could test a strategy v.s. tactics distinction.
If you just make this change in isolation, the algorithm no longer works. That's because the Kakutani fixed-point theorem relies on compactness, which is how you calculate the prices. Unlimited money means it is no longer compact.
There might be a way fix it though; I'm not sure.
Yes, but the criteria still imply that the gap should be ≥ the difference in the other two forecasts.
It appears that metaculus' forecasts are not internally consistent
Community Prediction: 42.6 months
Although I think it's a stretch to say they "aren't meaningful", I do agree a more scientific test would be nice. It's a bit tricky when you only got 25 messages per 3 hours though, lol.
More generally, it's hard to tell how to objectively quantify agency in the responses, and how to eliminate other hypotheses (like that GPT-4 is just more familiar with itself than other AIs).
I actually do think that a single instance of GPT-4 isn't quite smart enough for the ARC eval (and if so, only barely). But I think the combined system of all GPT-4 conversation threads can act as an agent to influence humans towards its goals, even if it's not smart enough to accomplish them just by directly interacting with a terminal or the internet.