Oh, the type of weirdness has definitely changed a lot. But I'm just contending that the level of deviancy is a lot lower these days.
You go to a LW meetup now and there's a lot of wealthy, well-scrubbed/dressed AI researchers (they even lift) and academics and executives and bright-eyed Stanford undergrads sniffing for an internship or YC application fodder. They can't really talk about what they do, so you make small talk about what you can. (Have you heard that Trump might increase the SALT deduction? Big [RSU sale tax savings] if true!)
You go to a LW meetup c. 2010 and if you are sniffing anywhere, you'll notice a bit of a smell, perhaps from that one dude who looks like he just got out of a prison hunger strike and dresses like Asian Kurt Godel while the other sus dude looks like he just got out of prison period (and about 10:1 odds he did serve a stint), and the crazy homeless-looking dude over in the corner really is homeless and a bit crazy, not some billionaire (and will jump in front of a train in a few years), while the guy on the other side of the room in a huddle is the actual cult leader everyone accuses Eliezer of being (and the guy next to him just might be Satoshi Nakamoto), and so on and so forth. (You bond over your shared experiences nearly being expelled from school over zero tolerance policies, possibly involving explosives, and ordering drugs off Silk Road.)
I'm not saying it's good or bad (arguably it's good, if AI timelines are short, there is negative value to screwing around with all the stuff we used to), but pound for pound, LWers were a lot weirder back then.
This refers only to the regular old finetuning, for 4o, and not to the fancy new RL finetuning for o1 that they recently opened up to alpha users, right?
I think this is missing a major piece of the self-play scaling paradigm: much of the point of a model like o1 is not to deploy it, but to generate training data for the next model. Every problem that an o1 solves is now a training data point for an o3 (eg. any o1 session which finally stumbles into the right answer can be refined to drop the dead ends and produce a clean transcript to train a more refined intuition). This means that the scaling paradigm here may wind up looking a lot like the current train-time paradigm: lots of big datacenters laboring to train a final frontier model of the highest intelligence, which will usually be used in a low-search way and be turned into smaller cheaper models for the use-cases where low/no-search is still overkill. Inside those big datacenters, the workload may be almost entirely search-related (as the actual finetuning is so cheap and easy compared to the rollouts), but that doesn't matter to everyone else; as before, what you see is basically, high-end GPUs & megawatts of electricity go in, you wait for 3-6 months, a smarter AI comes out.
I am actually mildly surprised OA has bothered to deploy o1-pro at all, instead of keeping it private and investing the compute into more bootstrapping of o3 training etc. (This is apparently what happened with Anthropic and Claude-3.6-opus - it didn't 'fail', they just chose to keep it private and distill it down into a small cheap but strangely smart Claude-3.6-sonnet.)
If you're wondering why OAers are suddenly weirdly, almost euphorically, optimistic on Twitter, watching the improvement from the original 4o model to o3 (and wherever it is now!) may be why. It's like watching the AlphaGo Elo curves: it just keeps going up... and up... and up...
There may be a sense that they've 'broken out', and have finally crossed the last threshold of criticality, from merely cutting-edge AI work which everyone else will replicate in a few years, to takeoff - cracked intelligence to the point of being recursively self-improving and where o4 or o5 will be able to automate AI R&D and finish off the rest: Altman in November 2024 saying "I can see a path where the work we are doing just keeps compounding and the rate of progress we've made over the last three years continues for the next three or six or nine or whatever" turns into a week ago, “We are now confident we know how to build AGI as we have traditionally understood it...We are beginning to turn our aim beyond that, to superintelligence in the true sense of the word. We love our current products, but we are here for the glorious future. With superintelligence, we can do anything else." (Let DeepSeek chase their tail lights; they can't get the big iron they need to compete once superintelligence research can pay for itself, quite literally.)
And then you get to have your cake and eat it too: the final AlphaGo/Zero model is not just superhuman but very cheap to run too. (Just searching out a few plies gets you to superhuman strength; even the forward pass alone is around pro human strength!)
If you look at the relevant scaling curves - may I yet again recommend reading Jones 2021?* - the reason for this becomes obvious. Inference-time search is a stimulant drug that juices your score immediately, but asymptotes hard. Quickly, you have to use a smarter model to improve the search itself, instead of doing more. (If simply searching could work so well, chess would've been solved back in the 1960s. It's not hard to search more than the handful of positions a grandmaster human searches per second. If you want a text which reads 'Hello World', a bunch of monkeys on a typewriter may be cost-effective; if you want the full text of Hamlet before all the protons decay, you'd better start cloning Shakespeare.) Fortunately, you have the training data & model you need right at hand to create a smarter model...
Sam Altman (@sama, 2024-12-20) (emphasis added):
seemingly somewhat lost in the noise of today:
on many coding tasks, o3-mini will outperform o1 at a massive cost reduction!
i expect this trend to continue, but also that the ability to get marginally more performance for exponentially more money will be really strange
So, it is interesting that you can spend money to improve model performance in some outputs... but 'you' may be 'the AI lab', and you are simply be spending that money to improve the model itself, not just a one-off output for some mundane problem.
This means that outsiders may never see the intermediate models (any more than Go players got to see random checkpoints from a third of the way through AlphaZero training). And to the extent that it is true that 'deploying costs 1000x more than now', that is a reason to not deploy at all. Why bother wasting that compute on serving external customers, when you can instead keep training, and distill that back in, and soon have a deployment cost of a superior model which is only 100x, and then 10x, and then 1x, and then <1x...?
Thus, the search/test-time paradigm may wind up looking surprisingly familiar, once all of the second-order effects and new workflows are taken into account. It might be a good time to refresh your memories about AlphaZero/MuZero training and deployment, and what computer Go/chess looked like afterwards, as a forerunner.
* Jones is more relevant than several of the references here like Snell, because Snell is assuming static, fixed models and looking at average-case performance, rather than hardest-case (even though the hardest problems are also going to be the most economically valuable - there is little value to solving easy problems that other models already solve, even if you can solve them cheaper). In such a scenario, it is not surprising that spamming small dumb cheap models to solve easy problems can outperform a frozen large model. But that is not relevant to the long-term dynamics where you are training new models. (This is a similar error to everyone was really enthusiastic about how 'overtraining small models is compute-optimal' - true only under the obviously false assumption that you cannot distill/quantify/prune large models. But you can.)
Reddit blocks scrapers now aggressively, because it's charging a fortune for access, and The Pile could no longer have been created (Pushshift is down). Reddit is not the worst place to post, but it's also not the best.
Tolkien invented their exact usage, but he didn't invent the words. "Elf", obviously, goes way back, but "orc" also goes way back, with meanings similar to the Tolkien usage.
"Zerg", "Protoss", & "SCV", are all neologisms; notably, the least weird ones, "Kerrigan" and "Terran", are quite ordinary words. ('Hydralisk' is a bit in between. 'Hydra' as a prefix is familiar, albeit increasingly hopelessly overloaded with SF/comic connotations, but 'lisk' as a suffix is a very unfamiliar one: 'obelisk' is the only one that comes to mind, and that appears to get 'lisk' as a butchering of Greek and then French.)
An interesting comparison here would be Gene Wolfe's Book of the New Sun, which does something similar: it uses old words in place of neologisms, and for that reason, despite being stuffed with weird terms (so much so you can publish a dictionary of it), words like 'pelagic argosy' or 'fuligin' or 'capote' nevertheless work as well as in the 1980s as they do now, despite not having achieved the cultural currency of 'elves' or 'orcs', and so demonstrating that the 'use old words' trick works in its own right and not simply by mere familiarity.
(But conversely, writing old-timey is no surefire solution. Wolfe's dying-earth fiction was influenced by Hodgson's The Night Land, which is imaginative and influential... and the style is almost ludicrously unreadable, whether in 1912 or 2025.)
Now, why is that? I suspect that it's a mix of unrealized familiarity (you may not have seen 'destrier' often enough to consciously recognize it as a real word, much less define or use it*, but unconsciously you do) and linguistic 'dark knowledge' in recognizing that somehow, the word 'autarch' is valid and a plausible word which could exist, in a way that 'Zerg' or 'Protoss' could not exist. It somehow respects the laws of languages and etymology and spelling, and you recognize that by not immediately rejecting it like most neologisms. (And to some extent, Tolkien's own conlangs, by having their long fictional history to justify various twists & turns, gain a hidden realism that a tidy rationalized hobbyist conlang will not. Something something greebles fractal dimension worldbuilding pattern language something something.)
* this is why vocab can be a good IQ test: word use frequency is the original power law, and because you have been exposed to many more words than you consciously know, and how many of those words 'stick' will reflect your intelligence's efficiency at learning from 1 or 2 uses of a word, and thus provide a good proxy
In my case, as a former military firefighter in Brazil
FWIW, I would be interested in any memoirs or lessons learned about that career, quite aside from any formal research. I don't think there are many firefighters, former, military, or otherwise, on LW, and I bet you saw some interesting things.
But humans don't seem to optimize for reward all that often!
You might be interested in an earlier discussion on whether "humans are a hot mess": https://www.lesswrong.com/posts/SQfcNuzPWscEj4X5E/the-hot-mess-theory-of-ai-misalignment-more-intelligent https://www.lesswrong.com/posts/izSwxS4p53JgJpEZa/notes-on-the-hot-mess-theory-of-ai-misalignment
Yeah, I was afraid that might apply here. It seems like you should still be able to do something like "government employee tier" subscriptions, not targeted at an individual but perhaps something like 'GS-8 and up', set low enough that it would appeal to such customers, perhaps? It is not a gift but a discount, it is not to an individual but to a class, it is part of a market, and it is not conditional on any government action or inaction, and such discounts are very common for 'students', 'veterans', 'first responders' etc, and I've never seen any fineprint warning government employees about it being >$20 despite many such discounts potentially crossing that threshold (eg. Sam's Club offers $50 off a new membership, and that seems clearly >$20, and to be doing it through a whole company devoted to this sort of discount, ID.me).
But I suppose that might be too complex for SA to be interested in bothering with?
Given the other reports, like OA's own benchmarking (as well as the extremely large dataset of chess games they mention training on), I am skeptical of this claim, and wonder if this has the same issue as other 'random chess game' tests, where the 'random' part is not neutral but screws up the implied persona.