GPT-5 still loses in the typical way to tic tac toe. But GPT-5-thinking does much better. It blocks the initial fork. I tested it by opening another fork rather than playing for the optimal draw and it beat me. Though its COT before the final move seems very discordant. Chat below.
https://chatgpt.com/share/68999afc-5378-8004-a9f0-588c7e2a183d
EDIT: I probably didn't play optimally, but I let 5-thinking go first in 4x4x4 tic tac toe and it beat me.
https://chatgpt.com/share/689ab471-eca0-8004-ac34-b47b3af48c36
Grok 4 was able to guess my rule of "three rational numbers." Haven't tested out other models yet.
https://grok.com/share/c2hhcmQtMw%3D%3D_748b1b41-eda9-4619-868e-5bb4cb022d50
EDIT: Claude Opus 4 is also able to guess the rule on the first attempt.
https://claude.ai/share/4dcd8fcf-4fcb-4d48-a18f-70c56a9c4be7
It seems that o4-mini-high (released today) is able to solve the first problem with one attempt, though it needs some prompting to explain its solution. It first asserts that the minimal number of moves is 15. If you ask it to list the moves, it is able to do so, and the list of moves seems valid on my check. If asked to prove that 15 is minimal, it reports that a BFS shows that 15 is minimal.
I'm not sure if this fully counts as a success, as I suspect it wrote code to perform the BFS while generating the answer. It was also unable to point out that,...
Looking back at the parameters of the bet, it's interesting to me that the benchmark and math components have all fallen, but that the two "real world" components of the bet are still standing.
It also seems a patent was filed for this material in 2021 and was granted earlier this year prior to publication.
Notably, if you tell it to think step by step it gets the question right and answers with:
This problem is known as the Monty Hall problem. In the original problem, the car is placed randomly behind one of the three doors, and the host always opens a door with a goat after you make your initial choice. However, in your variation, the car is always behind door number 1.
Let's analyze your version step by step:
You initially choose door No. 1, which always has the car behind it.
The host, knowing what's behind the doors, will always open one of the other two doo...
I've found it's ability to be much better as well. In contrast to GPT-3, which often seemed to be unable to keep track of board state and made illegal moves toward the end of the game, it not only played legal moves, it actually mated me. Granted I'm a terrible player and I was deliberately not reading ahead to see if it would be able to mate a weak player. My method was to tell it I wanted to play and then give my move in algebraic notation. It would respond with a move, then I would respond with another. After it beat me, I asked it to list all the moves...
I recently got access to Bing and asked it about the bullet in temporary gravity of varying duration. It does quite a bit better than GPT-3 though it's very verbose. It does do a search during it's answer but only to find the typical initial velocity of a bullet. It makes an error regarding the final velocity of the bullet after three seconds, but correctly determines that the bullet will go up forever if gravity lasts three seconds but will fall back to Earth if it lasts five minutes. Bold is me, everything else is Bing.
Okay, I’ve cleared the slate ...
It would cause a severe heat dissipation problem. All that energy is going to be radiated as waste heat and, in equilibrium, will be radiated as fast as it comes in. The temperature required to radiate at the requisite power level would be in excess of the temperature at the surface of the sun, any harvesting machinery on the surface of the planet would melt unless it is built from something unknown to modern chemistry.
In my particular case it wasn't really all that hard. I went to an extremely small school so classes weren't tracked the way they might be at a larger school. Since I was much better at taking tests than my peers I didn't really have to study to get A's on tests. We didn't even have all that much homework, though I guess it probably was hundreds of hours over the course of my high school career. I would have had to do that regardless though.
For me the answer is yes, but my situation is quite non-central. I got into MIT since I was a kid from a small rural town with really good grades, really good test scores, and was on a bunch of sports teams. Because I was from a small rural town and was pretty smart, none of this required special effort other than being on sports teams (note: being on the teams required no special skill as everyone who tried out made the team given small class size). The above was enough to get me an admission probably for reasons of diversity I'm a white man but I'm fairl...
But this doesn’t solve the problem of angry customers and media the way firing a misbehaving employee would. Though I suppose this is more an issue of friction/aversion to change than an actual capabilities issue.
Yeah, but Putin’s been president of Russia for over 20 years and already has a very large, loyal following. There will always be those that enthusiastically follow the party line of the leader. It’s somewhat harder to actually seize power. (None of this is to excuse the actions of Putin or those who support him.)
Likely higher than one in a million, but they can be fired after a failure to allow the company to save face. Harder to do that with a $50M language model.
I think the issue here is that the tasks in question don't fully capture everything we care about in terms of language facility. I think this is largely because even very low probabilities of catastrophic actions can preclude deployment in an economically useful way.
For example, a prime use of a language model would be to replace customer service representative. However, if there is even a one in a million chance that your model will start cursing out a customer, offer a customer a million dollars to remedy an error, or start spewing racial epithets, the model cannot be usefully deployed in such a fashion. None of the metrics in the paper can guarantee, or even suggest, that level of consistency.
One small quibble, you can actually live much more cheaply on rice. A pound of dry rice contains 1600 calories, if you eat 2000 calories a day, you need 5 pounds every 4 days, so a 50 pound bag will last 40 days, meaning you need 9 per year. This has a total cost of $450 at your price. Probably less if you shop around or buy in bulk.
I think you added an extra three zeros during your total year calculations. you list 2.23E15 as the total number of years experienced, but multiplying the total time of 5E4 by the current population of 8E9 gives a total of only 4E14 experience years. The true number must be quite a bit lower as the human population was quite a bit lower than 8 billion for most of that time. This also affects the proportion of experience years which have occurred in living memory. My guess is 20% have occurred since the birth of Kane Tanaka and 10% experienced by living peo...
Our position early in the Stelliferous Era may be less surprising than it first seems. Red dwarves are likely unsuitable for complex life as they have a pre-main sequence phase lasting quite some time during which they are notably hotter. Any planet in the habitalble zone for the main sequence phase will be baked for hundreds of millions of years during the pre-main sequence phase and could easily lose all its water.
Though the setelliferious era will go on for quite some time, star formation will slow dramatically (and in fact has already slowed). If you exclude red dwarves and weight by star-years, we're early but not nearly so early as a pure accounting of years suggests.