I just tested frontier models on a riddle from the podcast Lateral with Tom Scott: "a woman microwaves a chocolate bar once a year. What is her job and what is this procedure for?"
I recall seeing someone do a pretty systematic evaluation of how models did on Lateral questions and other game/connections shows, but with the major drawback that those are retrospective and thus in the training data. The episode with this riddle came out less than a week ago, so I assume not in training data. I also didn't give any context, other than that this was a riddle.
I'm interested in seeing more lateral thinking/creative reasoning tests of LLMs, since I anticipate that's what will determine their ability to make new science breakthroughs. I don't know if there are any out there.
Huh! I didn't know that. I suspect the user who coded it had a re-prompting feature to tell chatgpt if its move was illegal. that was an advantage I didn't give to the LLMs here.
I tried to play a [chess game](https://chatgpt.com/share/680212b3-c9b8-8012-89a5-14757773dc05) against o4-mini-high. Most of the time, LLM chess games fail because the model plays 15 normal moves and then starts to hallucinate piece positions so the game devolves. But o4-mini-high blundered checkmate in the first 6 moves. When I questioned why it made a move that allowed mate in 1, it confidently asserted that there was nothing better. o3 did better but still blundered checkmate after 16 moves. In contrast 4o did [quite well](https://chatgpt.com/share/68022a6b-6588-8012-a6fa-2c62fb2996f8), playing 24 pretty good moves before it hallucinated anything.
I don't have an account for why the newer models seem to be worse at this. Chess is a capability that I would have expected reasoning models to improve on relative to the GPT series. That tells me there's some weirdness in the progression of reasoning models that I wouldn't expect to see if reasoning models were a clear jump forward.
For what it's worth, your "small and vulnerable" post is what convinced me that people can really have an unbelievable amount of kindness and compassion in them, a belief that made me much more receptive to EA. Stay out of the misery mines pls!
I've seen a lot of EAs who are earnest. I think they are in for hurt down the line. I am not earnest in that way. I am not committed to tight philosophical justifications of my actions or values. I dont follow arguments to the end of the line. But one day I heard will macaskill describe the drowning child thought experiment, thought "yeah that makes sense to me", and added that to my list of thoughts. When I realized I was on the path to an economics PhD (for my own passions), I figured it was worth looking up this EA stuff and seeing what it had to say. I figured there would be lots of useful things I could do. I think that was the right intuition. I have found myself in a good position where I only need to make minor changes to my path to increase my impact dramatically.
Saving the world only sucks when you sacrifice everything else for it. Saving the world in your free time is great fun.
Great essay. Before this, I thought that the impact of more noisy signals about Normies was mainly through selectors being risk-averse. This essay pointed out why even if selectors are risk neutral, noisy signals matter by increasing the weight of the prior. Given that selectors also tend to have negative priors about Normies, noisy signals really function to prevent positive updates.
You're right that legibility alone isn't the whole story, but the reason I think Presties would still be advantaged in the many-slots-few-applicants story is that admissions officers also have a higher prior on Prestie quality. The impact of AOs' favorable prior about Presties is, I think, well acknowledged; the impact of their more precise signals is not, which is why I think this post is onto something important.
If AI risk arguments mainly apply to consequentialist (which I assume is the same as EU-maximizing in the OP) AI, and the first half of the OP is right that such AI is unlikely to arise naturally, does that make you update against AI risk?
Imagine the question was instead inverted to "a physics teacher wants to do an experiment demonstrating the speed of light to children. What would the experiment be?" Now this is straightforward recall because the model can look up what physics teachers do. But in the form that it is, what is the actual lookup that would help solve this question trivially?
The metric is not whether the information to answer this question is available in a model's corpus, but whether the model can make the connection between the question and the information in its corpus. Cases in which it isn't straightforward to make that connection are riddles. But that's also a description of a large class of research breakthroughs – figuring out that X solution from one domain can help answer Y question from another domain. Even though both X and Y are known, connecting them was the trick. That's the ability I wanted to test.