There's an improvement in LLM's I've seen that is important but has wildly inflated people's expectations beyond what's reasonable:
LLM's have hit a point in some impressive tests where they don't reliably fail past the threshold of being unrecoverable. They are conservative enough that they can do search on a problem, fail a million times until they mumble into an answer.
I'm going to try writing something of at least not-embarrassing quality about my thoughts on this but I am really confused by people's hype around this sort of thing, this feels like directed randomness
Gotcha, you didn't sound OVER confident so I assumed it was much-less-than-certain, still refreshingly concrete
Ah, okay.
I'll throw in my moderately strong disagreement for future bayes points, respect for the short term, unambiguous prediction!
This is not going to be a high quality answer, sorry in advance.
I noticed this with someone in my office who is learning robotic process automation: people are very bad at measuring their productivity, they are better at seeing certain kinds of gains and certain kinds of losses. I know someone who swears emphatically that they are many times as productive but have become almost totally unreliable. He's in denial over it, and a couple people now have openly told me they try to remove him from workflows for all the problems he causes.
I think the situation is...
By "solve", what do you mean? Like, provably secure systems, create a AAA game from scratch, etc?
I feel like any system that could do that would implicitly have what the OP says these systems might lack, but you seem to be in half agreeance with them. Am I misunderstanding something?
Definitely! However, there is more money and "hype" in the direction of wanting these to scale into AGI.
Hype and anti-hype don't cancel each other out, if someone invests a billion dollars into LLM's, someone else can't spend negative 1 billion and it cancels out: the billion dollar spender is the one moving markets, and getting a lot of press attention.
We have Yudkowsky going on destiny, I guess?
I think there's some miscommunication here, on top of a fundamental disagreement on whether more compute takes us to AGI.
On miscommunication, we're not talking about the lowering cost per flop, we're talking about a world where openai either does or does not have a price war eating it's margins.
On fundamental disagreement, I assume you don't take very seriously the idea that AI labs are seeing a breakdown of scaling laws? No problem if so, reality should resolve that disagreement relatively soon!
This is actually a good use case, which fits with what gpt does well, where very cheap tokens help!
Pending some time for people to pick at it to test it's limits, this might be really good. My instinct is legal research, case law etc. will be the test of how good it is, if it does well this might be it's foothold into real commercial use that actually generates profit.
My prediction is that we will be glad this exists. It will not be "phd level", a phrase which defaces all who utter it, but it will save some people a lot of time and effort
Where I think we d...
Also, Amodei needs to cool it. There's a reading of the things he's been saying lately that could be taken as sane but a plausible reading that makes him look like a buffoon. Credibility is a scarce resource
I feel like this comes down a lot to intuition, all I can say is gesture at the thinning distance between marginal cost and prices, wave my hand in the direction of discount rates and the valuation of Openai and ask... Are you sure?
The demand curve on this seems textbook inelastic at current margins. slashing the price of milk by 10x would have us cleaning our driveways with it, slashing the price of eggs would have us using crushed eggshells as low grade building material. A 10x decrease in the price per token of AI is barely even noticed, in fact in some...
I'm kinda opposite on the timelines thing? This is probably a timeline delayer even if I thought LLM's scaled to AGI, which I don't but let's play along.
If a Pharma company could look at another company's product and copy it and release it for free with no consequences, but the product they release itself could only be marginally improved without massive investment, what does that do to the landscape?
It kills the entire industry. This HURTS anyone trying to fundraise, reckless billions will have a harder time finding their way into the hands of developers ...
Can you make some noise in the direction of the shockingly low numbers it gets on early arc-2 benchmarks? This feels like pretty open and shut proof that it doesn't generalize, no?
The fact that the model was trained on 75 percent of the training set feels like they ghetto rigged a test set and RL'd the thing to success. If the <30% score on the second test ends up being true, I feel like that should inform our guesses at what it's actually doing heavily away from genuine intelligence and towards a brute force search for verifiable answers.
The frontier t...
Thanks for the reply! Still trying to learn how to disagree properly so let me know if I cross into being nasty at all:
I'm sure they've gotten better, o1 probably improved more from its heavier use of intermediate logic, compute/runtime and such, but that said, at least up till 4o it looks like there has been improvements in the model itself, they've been getting better
They can do incredibly stuff in well documented processes but don't survive well off the trodden path. They seem to string things together pretty well so I don't know if I would say there's ...
I would say that, barring strong evidence to the contrary, this should be assumed to be memorization.
I think that's useful! LLM's obviously encode a ton of useful algorithms and can chain them together reasonably well
But I've tried to get those bastards to do something slightly weird and they just totally self destruct.
But let's just drill down to demonstrable reality: if past SWE benchmarks were correct, these things should be able to do incredible amounts of work more or less autonomously and get all the LLM SWE replacements we've seen have stuck to high...
Can we bet karma?
Edit: sarcasm
Hmm, mixed agree/disagree. Scale probably won't work, algorithms probably would, but I don't think it's going to be that quick.
Namely, I think that the company struggling with fixed capital costs could accomplish much more, much quicker using the salary expenses of the top researchers they already have they'd have done it or gave it a good try at least
I'm 5 percent that a serious switch to algorithms would result in AGI in 2 years. You might be more well read than me on this so I'm not quite taking side bets right now!
I think the algorithm progress is doing some heavy lifting in this model. I think if we had a future textbook on agi we could probably build one but AI is kinda famous for minor and simple things just not being implemented despite all the parts being there
See ReLU activations and sigmoid activations.
If we're bottlenecking at algorithms alone is there a reason that isn't a really bad bottleneck?
I haven't had warm receptions when critiquing points, which has frustratingly left me with bad detection for when I'm being nasty, so if I sound thorny it's not my intent.
Somewhere I think you might have misstepped is the frontier math questions: the quotes you've heard are almost certainly about tier 3 questions, the hardest ones meant for math researchers in training. The mid tier is for grad student level problems and tier 1 is bright high schooler to undergrad level problems
Tier 1: 25% of the test
Tier 2: 50% of the test
Tier 3: 25%
O3 got 25%, probably a...
Ah wait was reading it wrong. I thought each time was an order of magnitude, that looks to be standard notation for log scale. Mischief managed
No, sorry, that's not a typo that's a linguistic norm that i probably assumed was more common than it actually is
Me and the people I talk with have used the words "mumble" and "babble" to describe LLM reasoning. Sort of like human babble, see https://www.lesswrong.com/posts/i42Dfoh4HtsCAfXxL/babble