Full tweet for anyone curious:
i'm reminded today of a dinner conversation i had once w one of the top MIRI folks...
we talked AI safety and i felt he was playing status games in our conversation moreso than actually engaging w the substance of my questions- negging me and implying i was not very smart if i didn't immediately react w fear to the parable of the paperclip, if i asked questions about hardware & infrastructure & connectivity & data constraints...
luckily i don't define myself by my intelligence so i wasn't cowed into doom but instead joined the budding e/acc movement a few weeks later.
still i was unsettled by the attempted psychological manipulation and frame control hiding under the hunched shoulders and soft ever so polite voice.
Ben Evans' Are better models better? (from a business/consumer perspective, not LW/AF etc):
Part of the concept of ‘Disruption’ is that important new technologies tend to be bad at the things that matter to the previous generation of technology, but they do something else important instead. Asking if an LLM can do very specific and precise information retrieval might be like asking if an Apple II can match the uptime of a mainframe, or asking if you can build Photoshop inside Netscape. No, they can’t really do that, but that’s not the point and doesn’t mean they’re useless. They do something else, and that ‘something else’ matters more and pulls in all of the investment, innovation and company creation. Maybe, 20 years later, they can do the old thing too - maybe you can run a bank on PCs and build graphics software in a browser, eventually - but that’s not what matters at the beginning. They unlock something else.
What is that ‘something else’ for generative AI, though? How do you think conceptually about places where that error rate is a feature, not a bug?
Machine learning started working as image recognition, but it was much more than that, and it took a while to work out that the right way to think about it was as pattern recognition. You could philosophise for a long time about the ‘right way’ to think about what PCs, the web or mobile really were. What is that for generative AI? I don’t think anyone has really worked it out yet, but using it as a new set of API calls within traditional patterns of software feels like using the new thing to do the old things.
By analogy:
These kinds of puzzles also remind me of a meeting I had in February 2005, now almost exactly 20 years ago, with a VP from Motorola, at the MWC mobile conference in Cannes. The iPod was the hot product, and all the phone OEMs wanted to match it, but the micro-HDD that Apple was using would break very reliably if you dropped your device. The man from Motorola pointed out that this was partly a problem of expectation and perception: if you dropped your iPod and it broke, you blamed yourself, but if you dropped your phone and it broke, you blamed the phone maker, even though it was using the same hardware.
Six months later Apple switched from HDDs to flash memory with the Nano, and flash doesn’t break if you drop it. But two years later Apple started selling the iPhone, and now your phone does break if you drop it, but you probably blame yourself. Either way, we adopted a device that breaks if you drop if with a battery that lasts a day instead of a week, in exchange for something new that came with that. We moved our expectations. This problem of expectation and perception seems to apply right now to generative AI.
This seems loosely reminiscent of his other essay How to lose a monopoly (emphasis mine):
... what is ‘power’? When we talk about ‘power’ and ‘dominance’ and perhaps ‘monopoly’ in tech, we actually mean two rather different things, and we generally conflate them:
- There is having power or dominance or a monopoly around your own product in that product’s own market…
- but then there is whether that position also means you control the broader industry.
In the 1970s dominating mainframes meant dominating tech, and in the 1990s dominating PC operating systems (and productivity software) meant dominating tech. Not any more. IBM still dominates mainframes, and Microsoft still dominates PCs, but that isn’t where broader dominance of the tech industry comes from. Once upon a time, IBM, and then Microsoft, could make people do things they didn’t want to do. Not today. Being rich is not the same as being powerful. ...
Today, it’s quite common to hear the assertion that our own dominant tech companies - Google, Facebook et al - will easily and naturally transfer their dominance to any new cycle that comes along. This wasn’t true for IBM or Microsoft, the two previous generations of tech dominance, but then there’s another assertion - that this was because of anti-trust intervention, especially for Microsoft. This tends to be said as though it can be taken for granted, but in fact it’s far from clear that this is actually true.
The end of Microsoft’s dominance of tech actually came in two phases. First, as above, it lost the development environment to the web, but it still had the client (the Windows PC) and it then provided lots and lots of clients to access the web and so became a much bigger company. But second, a decade or so later, Apple proposed a better client model with the iPhone, and Google picked that up and made a version for every other manufacturer to use. Microsoft lost dominance of development to the web, and then lost dominance of the client to smartphones.
As we all know, there were major anti-trust cases around what Microsoft tried to do with the web, and specific regulatory interventions, and so you can at least argue for some direct connection to Microsoft’s failure to take the lead online, although this can be disputed. But those cases ended in 2001 and none of them said anything about mobile, and yet Microsoft lost that as well. So what happened?
Here, the argument for anti-trust as the decisive factor generally acknowledges that nothing in the actual judgement or remedies that were imposed had any specific effect on Microsoft’s mobile efforts, but instead says that Microsoft somehow became less good at execution or aggression as a result.
There are two problems with this. The first is that it wasn’t remotely apparent in 2007 that Microsoft wasn’t being aggressive in mobile. After all, Microsoft didn’t ‘miss’ mobile - it had started with the launch of Windows CE in 1996, and accelerated with PocketPC in 2001, and it had a whole bunch of ‘Windows’ smartphones on the market when the iPhone launched.
Rather, the iPhone created such a radical change in every assumption about how you would make a ‘smartphone’ that everyone else had to start again from scratch. It’s important to remember that none of the smartphone companies who’d been building things since the late 1990s - Nokia/Symbian, Palm, RIM and Microsoft - managed the transition. None of the others had anti-trust issues. But, they all had platforms, and just as importantly cultures and assumptions, that were based on the constraints of hardware and networks in 2000, whereas the iPhone was based on what hardware and networks would look like in 2010. The only way to compete was with a totally new platform and totally new assumptions about how it would work, and ‘dump our platform and build an entirely new one’ is always a near-death experience in technology. Failing to make it isn’t about a lack of aggression or execution - it’s that it’s really hard.
Indeed, even knowing quite what to do is hard. For Microsoft, we know now that the answer would have been to create an entirely new operating system, with no cross-compatibility with Windows apps, and make it open source, and give it away for free. Imagine saying that to Bill Gates that in 2007 - he’d have looked at you as though you’d grown a third arm.
which segued into a discussion on 'moats' (emphasis mine):
The tech industry loves to talk about ‘moats’ around a business - some mechanic of the product or market that forms a fundamental structural barrier to competition, so that just having a better product isn‘t enough to break in. But there are several ways that a moat can stop working. Sometimes the King orders you to fill in the moat and knock down the walls. This is the deus ex machina of state intervention - of anti-trust investigations and trials. But sometimes the river changes course, or the harbour silts up, or someone opens a new pass over the mountains, or the trade routes move, and the castle is still there and still impregnable but slowly stops being important. This is what happened to IBM and Microsoft. The competition isn’t another mainframe company or another PC operating system - it’s something that solves the same underlying user needs in very different ways, or creates new ones that matter more. The web didn’t bridge Microsoft’s moat - it went around, and made it irrelevant. Of course, this isn’t limited to tech - railway and ocean liner companies didn’t make the jump into airlines either. But those companies had a run of a century - IBM and Microsoft each only got 20 years.
The main nuance that your description
LLM are already good at solving complicated, Ph.D. level mathematical problems
misses out on is that these are very specific kinds of problems:
Problems must be novel and unpublished, with answers that can be automatically verified through computation—either as exact integers or mathematical objects like matrices and symbolic expressions in SymPy.
That excludes nearly all of research math.
(Tangent: I had no idea what that sentence meant; Sonnet 4 says
This is Gen Z/Gen Alpha internet slang, often called "brainrot" language. Here's the translation:
- "mogged" = dominated/outperformed (from "AMOG" - Alpha Male of Group)
- "sigma" = a supposed personality type above "alpha" in internet masculinity hierarchies
- "rizz" = charisma, particularly with romantic interests (from "charisma")
- "gyatt" = exclamation expressing attraction (corruption of "goddamn")
- "skibidi" = meaningless word from viral YouTube videos, often used as filler
- "What the sigma?" = "What the hell?" but using sigma slang
- "Sus" = suspicious
- "No cap" = "no lie" or "I'm serious"
- "fanum tax" = taking someone's food (from streamer Fanum)
in case anyone else was as confused)
Benjamin Todd: Dropping the error rate from 10% to 1% (per 10min) makes 10h tasks possible.
In practice, the error rate has been halving every 4 months(!).
In fact we can’t rule out that individual humans have a fixed error rate – just one that’s lower than current AIs.
Ever since I read Sarah Constantin's Errors vs. Bugs and the End of Stupidity I find myself immediately skeptical of claims like "humans have a fixed error rate".
A common mental model for performance is what I'll call the "error model." In the error model, a person's performance of a musical piece (or performance on a test) is a perfect performance plus some random error. You can literally think of each note, or each answer, as x + c*epsilon_i, where x is the correct note/answer, and epsilon_i is a random variable, iid Gaussian or something. Better performers have a lower error rate c. Improvement is a matter of lowering your error rate. This, or something like it, is the model that underlies school grades and test scores. Your grade is based on the percent you get correct. Your performance is defined by a single continuous parameter, your accuracy.
But we could also consider the "bug model" of errors. A person taking a test or playing a piece of music is executing a program, a deterministic procedure. If your program has a bug, then you'll get a whole class of problems wrong, consistently. Bugs, unlike error rates, can't be quantified along a single axis as less or more severe. A bug gets everything that it affects wrong. And fixing bugs doesn't improve your performance in a continuous fashion; you can fix a "little" bug and immediately go from getting everything wrong to everything right. You can't really describe the accuracy of a buggy program by the percent of questions it gets right; if you ask it to do something different, it could suddenly go from 99% right to 0% right. You can only define its behavior by isolating what the bug does.
Often, I think mistakes are more like bugs than errors. My clinkers weren't random; they were in specific places, because I had sub-optimal fingerings in those places. A kid who gets arithmetic questions wrong usually isn't getting them wrong at random; there's something missing in their understanding, like not getting the difference between multiplication and addition. Working generically "harder" doesn't fix bugs (though fixing bugs does require work).
Asked ChatGPT o3 to read 6 links (nothing intensive, just the familiar essays on cognitive decoupling by Jacob Falkovich, Sarah Constantin, John Nerst, drossbucket) and its reasoning repeatedly mentions this which I've never seen before (emphasis mine):
I’m planning to review each article carefully. I’ll first open lines around the 250-line mark to get a deeper view, but I’m also mindful of the word limit. ...
I’m thinking I’ll need to open each article and focus on the part about "decoupling." To get more detailed, I’ll try opening further into the text — maybe around 250 lines — to ensure I capture enough context without overwhelming myself.
I plan to open each article and examine lines near the bottom to view relevant content. There's a limit of 200 lines, so I might need to focus on searching for specific keywords like "decoupling."
I’m trying to locate specific references to "decoupling" in each article. I want to open around 250 lines to make sure I find enough content for a thorough summary. I’ll use a quick search to find "decoupling" precisely, but if it’s not yielding precise results, I’ll expand the search to ensure I get more lines. It’s about balancing deep accuracy with not overwhelming myself with too much text.
I’ve found the decoupling discussion between lines 249-255, which looks promising. Now, I’ll open lines 300 onward to make sure I capture enough content for a well-rounded summary. This will help me ensure that I’m not missing any relevant points or context. It’s all about balance — getting enough details without reading too much.
(...)
Maybe I'm missing something, but I thought the 200k token context window would obviate "overwhelming myself with >250 lines of text" which seems oddly low, not sure what's going on...
I remain highly skeptical of claims about my alleged effects (via my commenting, anyhow) on anyone or anything
Out of curiosity, what evidence would change your mind? I naively expected habryka's comment would but you don't seem to agree.
(Feel free not to reply – the rate-limit on you is pretty severe, and my query is mere idle curiosity.)
Your interpretation seems like the opposite of sunwillrise's remark about Said?
... I do suspect some of the moderator pushback to Said comes from the (IMO correct) observation that... LW is just missing something, something that Said contributed, at least a bit, to pushing away in the aggregate (even if any one given action of his was by itself worthwhile from a cost/benefit perspective). Something that every single one of these authors used to provide in the past...
FWIW I'm glad you posted it here, albeit mainly because it's by you.