I really wish someone tried out o3/gemini with a weaker harness (say equal to claude), which is where it would be more interesting and also it would make a cross-model comparison easier.
I just want to note another data point about reforming institutions which was postwar Iraq. De-Baathification was an explicit policy undertaken to explicitly remove and replace members of the government associated with the Saddam affiliated Ba'ath Party, and it's generally considered a failure and having lead to a lot of sectarian violence, the rise of ISIS, and generally contributing to an ineffective government afterwards.
It's a somewhat different situation since that was more of an ideological project, but is I think notable and relevant.
Meta is delaying their Behemoth model launch because of disappointing evals.
This is another major lab (both OpenAI, Anthropic have also experienced this) that has seen disappointing results in trying to scale their model via raw parameter size into the next generation, which suggests to me that there really is some sort of soft or hard wall at this size. It's good news for people favoring a slow/pause, though of course there is now RL to pursue. I am genuinely curious what's going on though; it seems like maybe it's just getting enough high quality tokens that's an issue, and synthetic data is too hard to get or it could be a qualitative... (read more)
I am curious what aspects of sci-fi utopias seem unappealing to you. Something like 'The Culture' for instance doesn't have any downsides I can think of.