Vladimir_Nesov

Wiki Contributions

Comments

We start with an LLM trained on 50T tokens of real data, however capable it ends up being, and ask how to reach the same level of capability with synthetic data. If it takes more than 50T tokens of synthetic data, then it was less valuable per token than real data.

But at the same time, 500T tokens of synthetic data might train an LLM more capable than if trained on the 50T tokens of real data for 10 epochs. In that case, synthetic data helps with scaling capabilities beyond what real data enables, even though it's still less valuable per token.

With Go, we might just be running into the contingent fact of there not being enough real data to be worth talking about, compared with LLM data for general intelligence. If we run out of real data before some threshold of usefulness, synthetic data becomes crucial (which is the case with Go). It's unclear if this is the case for general intelligence with LLMs, but if it is, then there won't be enough compute to improve the situation unless synthetic data also becomes better per token, and not merely mitigates the data bottleneck and enables further improvement given unbounded compute.

I would be genuinely surprised if training a transformer on the pre2014 human Go data over and over would lead it to spontaneously develop alphaZero capacity.

I expect that if we could magically sample much more pre-2014 unique human Go data than was actually generated by actual humans (rather than repeating the limited data we have), from the same platonic source and without changing the level of play, then it would be possible to cheaply tune an LLM trained on it to play superhuman Go.

The best method of improving sample efficiency might be more like AlphaZero. The simplest method that's more likely to be discovered might be more like training on the same data over and over with diminishing returns. Since we are talking low-hanging fruit, I think it's reasonable that first forays into significantly improved sample efficiency with respect to real data are not yet much better than simply using more unique real data.

a reduction in training data will not necessarily reduce the amount of computation needed. But once again, that’s the way to bet

I'm ambivalent on this. If the analogy between improvement of sample efficiency and generation of synthetic data holds, synthetic data seems reasonably likely to be less valuable than real data (per token). In that case we'd be using all the real data we have anyway, which with repetition is sufficient for up to about $100 billion training runs (we are at $100 million right now). Without autonomous agency (not necessarily at researcher level) before that point, there won't be investment to go over that scale until much later, when hardware improves and the cost goes down.

We receive about a billion tokens growing to adulthood. The leading LLMs get orders of magnitude more than that. We should be able to do much better.

Capturing this would probably be a big deal, but a counterpoint is that compute necessary to achieve an autonomous researcher using such sample efficient method might still be very large. Possibly so large that training an LLM with the same compute and current sample-inefficient methods is already sufficient to get a similarly effective autonomous researcher chatbot. In which case there is no effect on timelines. And given that the amount of data is not an imminent constraint on scaling, the possibility of this sample efficiency improvement being useless for the human-led stage of AI development won't be ruled out for some time yet.

(Re: Difficult to Parse react on the other comment
I was confused about relevance of your comment above on chunky innovations, and it seems to be making some point (for which what it actually says is an argument), but I can't figure out what it is. One clue was that it seems like you might be talking about innovations needed for superintelligence, while I was previously talking about possible absence of need for further innovations to reach autonomous researcher chatbots, an easier target. So I replied with formulating this distinction and some thoughts on the impact and conditions for reaching innovations of both kinds. Possibly the relevance of this was confusing in turn.)

There are two kinds of relevant hypothetical innovations: those that enable chatbot-led autonomous research, and those that enable superintelligence. It's plausible that there is no need for (more of) the former, so that mere scaling through human efforts will lead to such chatbots in a few years regardless. (I think it's essentially inevitable that there is currently enough compute that with appropriate innovations we can get such autonomous human-scale-genius chatbots, but it's unclear if these innovations are necessary or easy to discover.) If autonomous chatbots are still anything like current LLMs, they are very fast compared to humans, so they quickly discover remaining major innovations of both kinds.

In principle, even if innovations that enable superintelligence (at scale feasible with human efforts in a few years) don't exist at all, extremely fast autonomous research and engineering still lead to superintelligence, because they greatly accelerate scaling. Physical infrastructure might start scaling really fast using pathways like macroscopic biotech even if drexlerian nanotech is too hard without superintelligence or impossible in principle. Drosophila biomass doubles every 2 days, small things can assemble into large things.

See "Zero Sum" is a misnomer, shifting and rescaling of utility functions breaks formulations that simply ask to take a sum of payoffs, but we can rescue the concept to mean that all outcomes/strategies of the game are Pareto efficient.

"Positive sum" seems to be about Kaldor-Hicks improvement, an outcome that in principle admits a subsequent redistribution of resources that would turn the outcome into a Pareto improvement (over some original situation of "not playing"), but there is no commitment or possibly even practical feasibility to actually perform the redistribution. This hypothetical redistribution step takes care of comparing utilities of different players. A whole game/interaction/project would then be "positive-sum" if each outcome/strategy is equivalent via a redistribution to some hypothetical "outcome" that is a Pareto improvement over the status quo of not engaging in the game/interaction/project. In actuality, without the hypothetical redistribution step, some players can end up worse off.

a story that's in conflict with itself

The story involves phase changes. Just scaling is what's likely to be available to human developers in the short term (a few years), it's not enough for superintelligence. Autonomous agency secures funding for a bit more scaling. If this proves sufficient to get smart autonomous chatbots, they then provide speed to very quickly reach the more elusive AI research needed for superintelligence.

It's not a little speed, it's a lot of speed, serial speedup of about 100x plus running in parallel. This is not as visible today, because current chatbots are not capable of doing useful work with serial depth, so the serial speedup is not in practice distinct from throughput and cost. But with actually useful chatbots it turns decades to years, software and theory from distant future become quickly available, non-software projects get to be designed in perfect detail faster than they can be assembled.

it seems scaffolding tricks haven't really improved the baseline performance of models that much. Overwhelmingly, the capability comes down to whether the rlfhed base model can do the task.

That's what I'm also saying above (in case you are stating what you see as a point of disagreement). This is consistent with scaling-only short timeline expectations. The crux for this model is current chatbots being already close to autonomous agency and to becoming barely smart enough to help with AI research. Not them directly reaching superintelligence or having any more room for scaling.

Obligation to answer makes questions/criticism cause damage unrelated to their content, so they are marginally more withheld or suppressed. If they won't be withheld or suppressed, like in a debate, they still act as motivation to avoid getting into that situation. The cost has a use for signaling that you can easily provide answers, but it's still a cost, higher for those who can't easily provide answers or don't value conveying that particular signal.

Load More