No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
TL;DR: Fine-tuned 8B models are beating 100B+ models on specific tasks at 1/10th the cost and energy. The evidence suggests we're over-investing in "one model to rule them all" and under-investing in ecosystems of specialists. I'm not saying stop frontier scaling (we need the big models to distill from), but the current 90/10 allocation toward scale seems wrong. Portfolio diversification, maybe 60/40, would give us more useful AI sooner, cheaper, and greener.
Ever since listening to Ilya's interview with Dwarkesh, one part has stuck out with me: the observation that humans learn quite efficiently with a lot less data than the current SOTA large language models. I've been wondering why that is, and thinking about what that means for the SOTA and the year ahead.
Where It Seems We're At
Here is what the field believes: Intelligence scales. Add more compute, data, and parameters, and you get more capability. The end-goal is AGI, a single system that can do everything.
Here is what I believe: we have systematically undervalued what small, specialized models can do. The evidence is accumulating that fine-tuned 8B models, optimized for specific tasks, often outperform general-purpose models 10x-100x their size at a fraction of the training cost, inference cost, and energy consumption.
I'm not arguing that large models are useless, but I do think that the marginal utility of the next trillion parameters may be worse than the marginal utility of the next thousand fine-tuned specialists. And I'm arguing that we should take this possibility seriously rather than treating scale as the only path to AGI.
I call this alternative vision Artificial Manifold Intelligence: not one AGI model, but an ecosystem of specialists that can be composed, orchestrated, and swapped. Manifold captures the multiplicity, but also connects to the manifold hypothesis in ML: high-dimensional data concentrates on lower-dimensional manifolds. A generalist tries to learn all manifolds simultaneously; a specialist can focus its capacity on learning one well.
Steelmanning the Scaling Hypothesis
To be fair, the foundations of scaling are robust:
Responding to The Bitter Lesson
The strongest theoretical objection is Sutton's Bitter Lesson: historically, general methods leveraging computation beat clever domain-specific engineering. Am I just advocating for the losing approach with extra steps?
I don't think so. Fine-tuning isn't hand-coded heuristics. It's gradient descent on narrower data. The specialist models I'm describing don't encode human expertise through rules; they extract and concentrate learned representations from data. This is still general methods leveraging computation, just applied to specific problems rather than all problems simultaneously.
Sutton's losers were researchers who tried to build in human knowledge about chess positions or speech phonemes. The winners let the model learn. Fine-tuned specialists do exactly that; they just learn from a more focused distribution.
If anything, specialization might be more aligned with the Bitter Lesson than the AGI approach. Trying to build one model that handles everything could be its own form of over-engineering where we are imposing an architectural constraint (generality) that computation alone doesn't demand.
Scaling Laws: Kaplan et al. and subsequent work demonstrated that language model performance follows predictable power laws as you increase compute, data, and parameters.
Emergence: At sufficient scale, models appear to acquire capabilities that weren't explicitly trained. GPT-3's few-shot learning, chain-of-thought reasoning in larger models, the jump from GPT-3.5 to GPT-4 all suggest that scale unlocks qualitatively new behaviors.
These are real. However, I disagree that they mandate pouring all resources into the single biggest bucket.
The Evidence for Specialization
Small Models, Carefully Tuned, Can Beat Giants
The empirical evidence is quite interesting:
AWS's 350M-parameter specialist: Researchers at Amazon Web Services fine-tuned an OPT-350M model on agentic tool-calling data. The result? A 77.55% pass rate on the ToolBench benchmark significantly outperforming ChatGPT-CoT (26.00%), ToolLLaMA-DFS (30.18%), and other models up to 500x larger.
NVIDIA's 8B code reviewer: NVIDIA fine-tuned Llama 3 8B using knowledge distillation from GPT-4. The result outperformed Llama 3 70B (8x larger) and Nemotron 4 340B (40x larger) on code review severity prediction. An 18% improvement over baseline, beating models that cost orders of magnitude more to run.
Parsed's healthcare scribe: A Gemma 3 27B model, fine-tuned for medical transcription, outperformed Claude Sonnet 4 by 60% on the task while using 10-100x less compute per inference. The pattern is consistent across medical, legal, and scientific domains: fine-tuned specialists show 40-100% improvements over base models.
Wiz's 1B security model: Wiz fine-tuned Llama 3.2 1B for detecting secrets in code, achieving 86% precision and 82% recall—significantly outperforming regex-based methods while running on standard CPU hardware. A 1-billion-parameter model, deployed at scale, doing a security-critical task better than the alternatives.
Microsoft's Phi models represent a deliberate bet against the "bigger is better" paradigm:
Phi-2 (2.7B parameters): surpasses Mistral-7B and Llama-2-13B on various benchmarks. Matches or exceeds Gemini Nano 2.
Phi-3-mini (3.8B parameters): performs better than models twice its size. Approaches GPT-3.5 on many tasks.
Phi-4 (14B parameters): outperforms GPT-4 on some STEM reasoning tasks despite being a fraction of the size.
Phi-4-reasoning (14B parameters): beats DeepSeek-R1-Distill-Llama-70B (5x larger) and approaches the full DeepSeek-R1 (671B parameters) on mathematical reasoning.
In this case, Microsoft claims that the answer for them was "textbook-quality" data.
As Sébastien Bubeck, Microsoft's VP of Generative AI, put it: "This is not necessarily the type of progress that we were expecting. I think nobody knew the size that you would need to get capabilities that get close to something like GPT-3.5."
Training is a one-time cost; inference is forever. Every query to a 70B model costs roughly 70× the compute of a query to a 1B model. If you're serving millions of requests per day, a specialist that matches the generalist on your specific task is cheaper on orders of magnitude.
This also matters for who gets to participate. You don't need $100 million and a data center to train and deploy a useful 8B model. This means more people in the world can have a stake in how AI is used and deployed.
Companies don't hire one omniscient employee who handles engineering, legal, marketing, HR, finance, and janitorial work. They hire specialists. They create roles. They build org charts. A startup might have generalists early on, but as it scales, it specializes because distributed specialization is what complex work necessitates.
Think about how a tech company operates:
Developers write code. They don't do the UX research.
Designers handle UX/UI. They don't negotiate contracts.
Legal reviews agreements. They don't debug production outages.
Product managers coordinate and prioritize. They don't write the marketing copy.
Managers (some more useful than others, admittedly...) handle coordination, routing, and resource allocation.
A great backend engineer and a great contract lawyer have both spent years developing expertise that a generalist simply cannot match. The organization's intelligence emerges from the composition of specialists, rather than some mythical 100x dev.
The AGI vision is essentially: "What if we had one employee who was superhuman at everything?" But the manifold intelligence vision is: "What if we had a well-organized company of superhuman specialists?"
The second vision is how humans actually solve complex problems. It's how ecosystems work. It's how markets work. The coordinator/specialist pattern appears everywhere because it works.
The claim that "AI specialists would have prohibitive coordination costs" implies that AI coordination is harder than human coordination. But AI systems can share context instantly, don't have egos, don't take vacations, and can be orchestrated programmatically. Deciding which specialist handles what is really just a management problem. And we know how to solve management problems: you build a good coordinator.
A well-designed router is essentially a PM who never gets tired, has perfect memory of what each team member is good at, and can delegate in milliseconds. The coordinator doesn't need to be as capable as the specialists; it just needs to know who to call. To me, that sounds like a much easier problem than building one model that's superhuman at everything. However, Opus 4.5 still believes I am underselling how difficult a problem this is.
Manifold Intelligence Is Slowly Emerging
It seems that manifold intelligence is already on its way in some labs:
Mixture-of-experts architectures within single models (DeepSeek, Mixtral)
Model routing systems that select specialists based on query type
Enterprise deployments of multiple fine-tuned models for different workflows
Distillation pipelines that extract specific capabilities from large teachers into small students
Biological intelligence didn't evolve as a monolith either. Your immune system is "intelligent" in ways your prefrontal cortex isn't. Ecosystems solve problems no individual organism can. The question is whether we recognize distributed specialization as a viable alternative to the AGI race, or continue treating scale as the only path forward.
Implications
If I'm right, there are some pretty big implications:
For researchers: The 8B fine-tuning space is under-explored relative to its potential. There's likely low-hanging fruit in systematic study of how small models can be optimized for specific domains.
For practitioners: Before reaching for the biggest model, ask whether a fine-tuned specialist might do better for your use case. The economics often favor it.
For labs: Consider whether some R&D budget should shift from pure scaling to specialization infrastructure (better fine-tuning methods, better routing, better distillation).
For policy: The AGI race narrative drives compute accumulation and contributes to the idea that only the richest actors can participate in frontier AI. A specialization paradigm is more democratically accessible.
For the environment: We might be able to get much of the value of AI with dramatically less energy.
For privacy: "One Model" paradigm typically implies an API-based relationship where sensitive data, must leave the building. AMI means it could be feasible to deploy these machines on-prem.
Epistemic Status
High confidence: Small fine-tuned models often outperform larger general models on specific tasks. This is empirically demonstrated across many domains.
Medium confidence: Specialist ecosystems will remain competitive with AGI-style systems as capabilities advance. This depends on how scaling curves evolve and whether emergent capabilities can be distilled.
Lower confidence: The marginal return on scale is diminishing faster than commonly assumed. Opus 4.5 and Gemini 3.0 have really impressed me, so my confidence has decreased slightly here.
Speculation: The optimal architecture for general AI might be more manifold-like than monolithic.
The Distillation Problem
The strongest objection to everything I've written: most successful small models today are distilled from frontier models or trained on synthetic data generated by those models.
This is true. NVIDIA's 8B code reviewer used knowledge distillation from GPT-4. DeepSeek distills R1 into smaller variants. Phi-4's remarkable performance comes partly from synthetic data generated by larger models. The students are standing on the shoulders of giants.
I concede this. If we stopped scaling frontier models entirely, the distillation pipeline would stagnate. You can't have better students without better teachers.
But this refines my argument rather than defeating it. I'm not saying abandon frontier scaling... I'm saying the current resource allocation (call it 90/10 toward pure scale) seems suboptimal. Something like 60/40, with serious investment in distillation infrastructure, fine-tuning methods, and specialist deployment, might get us more useful AI faster.
The manifold needs the monolith. But maybe the monolith also needs the manifold.
I'm curious what your thoughts are, especially from people working on scaling. Where am I still wrong? The distillation dependency objection in particular seems to constrain my thesis more than I'd like... I'm curious whether there's a path to capable specialists that doesn't require frontier teachers.
Ever since listening to Ilya's interview with Dwarkesh, one part has stuck out with me: the observation that humans learn quite efficiently with a lot less data than the current SOTA large language models. I've been wondering why that is, and thinking about what that means for the SOTA and the year ahead.
Where It Seems We're At
Here is what the field believes: Intelligence scales. Add more compute, data, and parameters, and you get more capability. The end-goal is AGI, a single system that can do everything.
Here is what I believe: we have systematically undervalued what small, specialized models can do. The evidence is accumulating that fine-tuned 8B models, optimized for specific tasks, often outperform general-purpose models 10x-100x their size at a fraction of the training cost, inference cost, and energy consumption.
I'm not arguing that large models are useless, but I do think that the marginal utility of the next trillion parameters may be worse than the marginal utility of the next thousand fine-tuned specialists. And I'm arguing that we should take this possibility seriously rather than treating scale as the only path to AGI.
I call this alternative vision Artificial Manifold Intelligence: not one AGI model, but an ecosystem of specialists that can be composed, orchestrated, and swapped. Manifold captures the multiplicity, but also connects to the manifold hypothesis in ML: high-dimensional data concentrates on lower-dimensional manifolds. A generalist tries to learn all manifolds simultaneously; a specialist can focus its capacity on learning one well.
Steelmanning the Scaling Hypothesis
To be fair, the foundations of scaling are robust:
Responding to The Bitter Lesson
The strongest theoretical objection is Sutton's Bitter Lesson: historically, general methods leveraging computation beat clever domain-specific engineering. Am I just advocating for the losing approach with extra steps?
I don't think so. Fine-tuning isn't hand-coded heuristics. It's gradient descent on narrower data. The specialist models I'm describing don't encode human expertise through rules; they extract and concentrate learned representations from data. This is still general methods leveraging computation, just applied to specific problems rather than all problems simultaneously.
Sutton's losers were researchers who tried to build in human knowledge about chess positions or speech phonemes. The winners let the model learn. Fine-tuned specialists do exactly that; they just learn from a more focused distribution.
If anything, specialization might be more aligned with the Bitter Lesson than the AGI approach. Trying to build one model that handles everything could be its own form of over-engineering where we are imposing an architectural constraint (generality) that computation alone doesn't demand.
Scaling Laws: Kaplan et al. and subsequent work demonstrated that language model performance follows predictable power laws as you increase compute, data, and parameters.
Emergence: At sufficient scale, models appear to acquire capabilities that weren't explicitly trained. GPT-3's few-shot learning, chain-of-thought reasoning in larger models, the jump from GPT-3.5 to GPT-4 all suggest that scale unlocks qualitatively new behaviors.
These are real. However, I disagree that they mandate pouring all resources into the single biggest bucket.
The Evidence for Specialization
Small Models, Carefully Tuned, Can Beat Giants
The empirical evidence is quite interesting:
AWS's 350M-parameter specialist: Researchers at Amazon Web Services fine-tuned an OPT-350M model on agentic tool-calling data. The result? A 77.55% pass rate on the ToolBench benchmark significantly outperforming ChatGPT-CoT (26.00%), ToolLLaMA-DFS (30.18%), and other models up to 500x larger.
NVIDIA's 8B code reviewer: NVIDIA fine-tuned Llama 3 8B using knowledge distillation from GPT-4. The result outperformed Llama 3 70B (8x larger) and Nemotron 4 340B (40x larger) on code review severity prediction. An 18% improvement over baseline, beating models that cost orders of magnitude more to run.
Parsed's healthcare scribe: A Gemma 3 27B model, fine-tuned for medical transcription, outperformed Claude Sonnet 4 by 60% on the task while using 10-100x less compute per inference. The pattern is consistent across medical, legal, and scientific domains: fine-tuned specialists show 40-100% improvements over base models.
Wiz's 1B security model: Wiz fine-tuned Llama 3.2 1B for detecting secrets in code, achieving 86% precision and 82% recall—significantly outperforming regex-based methods while running on standard CPU hardware. A 1-billion-parameter model, deployed at scale, doing a security-critical task better than the alternatives.
The 100-sample threshold: Research comparing specialized small models to general large models on text classification found that fine-tuned specialists need, on average, only 100 labeled samples to match or exceed the performance of much larger general-purpose models.
Microsoft's Phi Series: Small by Design
Microsoft's Phi models represent a deliberate bet against the "bigger is better" paradigm:
In this case, Microsoft claims that the answer for them was "textbook-quality" data.
As Sébastien Bubeck, Microsoft's VP of Generative AI, put it: "This is not necessarily the type of progress that we were expecting. I think nobody knew the size that you would need to get capabilities that get close to something like GPT-3.5."
The Economics
It'd be odd not to mention money and energy.
Training is a one-time cost; inference is forever. Every query to a 70B model costs roughly 70× the compute of a query to a 1B model. If you're serving millions of requests per day, a specialist that matches the generalist on your specific task is cheaper on orders of magnitude.
This also matters for who gets to participate. You don't need $100 million and a data center to train and deploy a useful 8B model. This means more people in the world can have a stake in how AI is used and deployed.
No Company Works Like AGI
Here's something obvious that I don't see discussed enough, although people on X seem to be trying.
Companies don't hire one omniscient employee who handles engineering, legal, marketing, HR, finance, and janitorial work. They hire specialists. They create roles. They build org charts. A startup might have generalists early on, but as it scales, it specializes because distributed specialization is what complex work necessitates.
Think about how a tech company operates:
A great backend engineer and a great contract lawyer have both spent years developing expertise that a generalist simply cannot match. The organization's intelligence emerges from the composition of specialists, rather than some mythical 100x dev.
The AGI vision is essentially: "What if we had one employee who was superhuman at everything?" But the manifold intelligence vision is: "What if we had a well-organized company of superhuman specialists?"
The second vision is how humans actually solve complex problems. It's how ecosystems work. It's how markets work. The coordinator/specialist pattern appears everywhere because it works.
The claim that "AI specialists would have prohibitive coordination costs" implies that AI coordination is harder than human coordination. But AI systems can share context instantly, don't have egos, don't take vacations, and can be orchestrated programmatically. Deciding which specialist handles what is really just a management problem. And we know how to solve management problems: you build a good coordinator.
A well-designed router is essentially a PM who never gets tired, has perfect memory of what each team member is good at, and can delegate in milliseconds. The coordinator doesn't need to be as capable as the specialists; it just needs to know who to call. To me, that sounds like a much easier problem than building one model that's superhuman at everything. However, Opus 4.5 still believes I am underselling how difficult a problem this is.
Manifold Intelligence Is Slowly Emerging
It seems that manifold intelligence is already on its way in some labs:
Biological intelligence didn't evolve as a monolith either. Your immune system is "intelligent" in ways your prefrontal cortex isn't. Ecosystems solve problems no individual organism can. The question is whether we recognize distributed specialization as a viable alternative to the AGI race, or continue treating scale as the only path forward.
Implications
If I'm right, there are some pretty big implications:
For researchers: The 8B fine-tuning space is under-explored relative to its potential. There's likely low-hanging fruit in systematic study of how small models can be optimized for specific domains.
For practitioners: Before reaching for the biggest model, ask whether a fine-tuned specialist might do better for your use case. The economics often favor it.
For labs: Consider whether some R&D budget should shift from pure scaling to specialization infrastructure (better fine-tuning methods, better routing, better distillation).
For policy: The AGI race narrative drives compute accumulation and contributes to the idea that only the richest actors can participate in frontier AI. A specialization paradigm is more democratically accessible.
For the environment: We might be able to get much of the value of AI with dramatically less energy.
For privacy: "One Model" paradigm typically implies an API-based relationship where sensitive data, must leave the building. AMI means it could be feasible to deploy these machines on-prem.
Epistemic Status
High confidence: Small fine-tuned models often outperform larger general models on specific tasks. This is empirically demonstrated across many domains.
Medium confidence: Specialist ecosystems will remain competitive with AGI-style systems as capabilities advance. This depends on how scaling curves evolve and whether emergent capabilities can be distilled.
Lower confidence: The marginal return on scale is diminishing faster than commonly assumed. Opus 4.5 and Gemini 3.0 have really impressed me, so my confidence has decreased slightly here.
Speculation: The optimal architecture for general AI might be more manifold-like than monolithic.
The Distillation Problem
The strongest objection to everything I've written: most successful small models today are distilled from frontier models or trained on synthetic data generated by those models.
This is true. NVIDIA's 8B code reviewer used knowledge distillation from GPT-4. DeepSeek distills R1 into smaller variants. Phi-4's remarkable performance comes partly from synthetic data generated by larger models. The students are standing on the shoulders of giants.
I concede this. If we stopped scaling frontier models entirely, the distillation pipeline would stagnate. You can't have better students without better teachers.
But this refines my argument rather than defeating it. I'm not saying abandon frontier scaling... I'm saying the current resource allocation (call it 90/10 toward pure scale) seems suboptimal. Something like 60/40, with serious investment in distillation infrastructure, fine-tuning methods, and specialist deployment, might get us more useful AI faster.
The manifold needs the monolith. But maybe the monolith also needs the manifold.
I'm curious what your thoughts are, especially from people working on scaling. Where am I still wrong? The distillation dependency objection in particular seems to constrain my thesis more than I'd like... I'm curious whether there's a path to capable specialists that doesn't require frontier teachers.