# Ω 43

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Over on Developmental Stages of GPTs, orthonormal mentions

it at least reduces the chance of a hardware overhang.

An overhang is when you have had the ability to build transformative AI for quite some time, but you haven't because no-one's realised it's possible. Then someone does and surprise! It's a lot more capable than everyone expected.

I am worried we're in an overhang right now. I think we right now have the ability to build an orders-of-magnitude more powerful system than we already have, and I think GPT-3 is the trigger for 100x larger projects at Google, Facebook and the like, with timelines measured in months.

## Investment Bounds

GPT-3 is the first AI system that has obvious, immediate, transformative economic value. While much hay has been made about how much more expensive it is than a typical AI research project, in the wider context of megacorp investment, its costs are insignificant.

GPT-3 has been estimated to cost $5m in compute to train, and - looking at the author list and OpenAI's overall size - maybe another$10m in labour.

Google, Amazon and Microsoft each spend about $20bn/year on R&D and another$20bn each on capital expenditure. Very roughly, it totals to $100bn/year. Against this budget, dropping$1bn or more on scaling GPT up by another factor of 100x is entirely plausible right now. All that's necessary is that tech executives stop thinking of natural language processing as cutesy blue-sky research and start thinking in terms of quarters-till-profitability.

A concrete example is Waymo, which is raising $2bn investment rounds - and that's for a technology with a much longer road to market. ## Compute Cost The other side of the equation is compute cost. The$5m GPT-3 training cost estimate comes from using V100s at $10k/unit and 30 TFLOPS, which is the performance without tensor cores being considered. Amortized over a year, this gives you about$1000/PFLOPS-day.

However, this cost is driven up an order of magnitude by NVIDIA's monopolistic cloud contracts, while performance will be higher when taking tensor cores into account. The current hardware floor is nearer to the RTX 2080 TI's $1k/unit for 125 tensor-core TFLOPS, and that gives you$25/PFLOPS-day. This roughly aligns with AI Impacts’ current estimates, and offers another >10x speedup to our model.

I strongly suspect other bottlenecks stop you from hitting that kind of efficiency or GPT-3 would've happened much sooner, but I still think $25/PFLOPS-day is a lower useful bound. ## Other Constraints I've focused on money so far because most of the current 3.5-month doubling times come from increasing investment. But money aside, there are a couple of other things that could prove to be the binding constraint. • Scaling law breakdown. The GPT series' scaling is expected to break down around 10k pflops-days (§6.3), which is a long way short of the amount of cash on the table. • This could be because the scaling analysis was done on 1024-token sequences. Maybe longer sequences can go further. More likely I'm misunderstanding something. • Sequence length. GPT-3 uses 2048 tokens at a time, and that's with an efficient encoding that cripples it on many tasks. With the naive architecture, increasing the sequence length is quadratically expensive, and getting up to novel-length sequences is not very likely. • Data availability. From the same paper as the previous point, dataset size rises with the square-root of compute; a 1000x larger GPT-3 would want 10 trillion tokens of training data. • It’s hard to find a good estimate on total-words-ever-written, but our library of 130m books alone would exceed 10tn words. Considering books are a small fraction of our textual output nowadays, it shouldn't be difficult to gather sufficient data into one spot once you've decided it's a useful thing. So I'd be surprised if this was binding. • Bandwidth and latency. Networking 500 V100 together is one challenge, but networking 500k V100s is another entirely. • I don't know enough about distributed training to say whether this is a very sensible constraint or a very dumb one. I think it has a chance of being a serious problem, but I think it's also the kind of thing you can design algorithms around. Validating such algorithms might take more than a timescale of months however. • Hardware availability. From the estimates above there are about 500 GPU-years in GPT-3, or - based on a one-year training window -$5m worth of V100s at $10k/piece. This is about 1% of NVIDIA's quarterly datacenter sales. A 100x scale-up by multiple companies could saturate this supply. • This constraint can obviously be loosened by increasing production, but it'd be hard to on a timescale of months. • Commoditization. If many companies go for huge NLP models, the profit each company can extract is driven towards zero. Unlike with other capex-heavy research - like pharma - there's no IP protection for trained models. If you expect profit to be marginal, you're less likely to drop$1bn on your own training program.
• I am skeptical of this being an important factor while there are lots of legacy, human-driven systems to replace. Replacing those systems should be more than enough incentive to fund many companies’ research programs. Longer term, the effects of commoditization might become more important.
• Inference costs. The GPT-3 paper (§6.3), gives .4kWh/100 pages of output, which works out to 500 pages/dollar from eyeballing hardware cost as 5x electricity. Scaling up 1000x and you're at $2/page, which is cheap compared to humans but no longer quite as easy to experiment with. • I'm skeptical of this being a binding constraint.$2/page is still very cheap.

## Beyond 1000x

Here we go from just pointing at big numbers and onto straight-up theorycrafting.

In all, tech investment as it is today plausibly supports another 100x-1000x scale up in the very-near-term. If we get to 1000x - 1 ZFLOPS-day per model, $1bn per model - then there are a few paths open. I think the key question is if by 1000x, a GPT successor is obviously superior to humans over a wide range of economic activities. If it is - and I think it's plausible that it will be - then further investment will arrive through the usual market mechanisms, until the largest models are being allocated a substantial fraction of global GDP. On paper that leaves room for another 1000x scale-up as it reaches up to$1tn, though current market mechanisms aren't really capable of that scale of investment. Left to the market as-is, I think commoditization would kick in as the binding constraint.

That's from the perspective of the market today though. Transformative AI might enable $100tn-market-cap companies, or nation-states could pick up the torch. The Apollo Program made for a$1tn-today share of GDP, so this degree of public investment is possible in principle.

The even more extreme path is if by 1000x you've got something that can design better algorithms and better hardware. Then I think we're in the hands of Christiano's slow takeoff four-year-GDP-doubling.

That's all assuming performance continues to improve, though. If by 1000x the model is not obviously a challenger to human supremacy, then things will hopefully slow down to ye olde fashioned 2010s-Moore's-Law rates of progress and we can rest safe in the arms of something that's merely HyperGoogle.

# Ω 43

New Comment
Some comments are truncated due to high volume. Change truncation settings

One thing that's bothering me is... Google/DeepMind aren't stupid. The transformer model was invented at Google. What has stopped them from having *already* trained such large models privately? GPT-3 isn't that large an evidence for the effectiveness of scaling transformer models; GPT-2 was already a shock and caused huge public commotion. And in fact, if you were close to building an AGI, it would make sense for you not to announce this to the world, specially as open research that anyone could copy/reproduce, for obvious safety and economic reasons.

Maybe there are technical issues keeping us from doing large jumps in scale (i.e. , we only learn how to train a 1 trillion parameter model after we've trained a 100 billion one)?

4ChristianKl6moDo we know the size of the net that does translation and speech-to-text for Google?
5Ricardo Meneghin6moI'm not sure what model is used in production, but the SOTA [https://arxiv.org/abs/2006.16668#google] reached 600 billion parameters recently.
3bmc6moThis answer likely betrays my lack of imagination, but I'm not sure what Google would use GPT-3 for. It's probably much more expensive than whatever gmail uses to predict text, and the additional accuracy might not provide much additional value. Maybe they could sell it as a service, as part of GCP? I'm not sure how many people inside Google have the ability to sign \$15M checks, you would need at least one of them to believe in a large market, and I'm personally not sure there's a large enough market for GPT-3 for it to be worth Google's time. This is all to say, I don't think you should draw the conclusion that Google is either stupid or hiding something. They're likely focusing on finding better architectures, it seems a little early to focus on scaling up existing ones.
9gwern6moText embeddings for knowledge graphs and ads is the most immediately obvious big bucks application.
7Daniel Kokotajlo6moCan you explain more?

GPT-3 based text embedding should be extremely useful for creating summaries of arbitrary text (such as, web pages or ad text) which can be fed into the existing Google search/ad infrastructure. (The API already has a less-known half, where you upload sets of docs and GPT-3 searches them.) Of course, they already surely use NNs for embeddings, but at Google scale, enhanced embeddings ought to be worth billions.

2Ricardo Meneghin6moI think the OP and my comment suggest that scaling current models 10000x could lead to AGI or at least something close to it. If that is true, it doesn't make sense to focus on finding better architectures right now.
8Raemon6moMinor note: could people include commas in their Big Numbers, to make it easier to distinguish 1000 from 10,000 at a glance?
1dogiv6moSounds like something GPT-3 would say...
Networking 500 V100 together is one challenge, but networking 500k V100s is another entirely.

Even if you might have trouble networking a 100x larger system together for training, you can train the smaller network 100x and stitch answers together using ensemble methods, and make decent use of the extra compute. It may not be as good as growing the network that full factor, but if you have extra compute beyond the cap of whatever connected-enough training system size you can muster, there are worse ways to spend it.

I am somewhat more prone to think that more selective attention (e.g. Big Bird's block-random attention model) could bring down the quadratic cost of the window size quickly enough to be a factor here. Replacing a quadratic term with a linear or n log n or heck even a n^1.85 term goes a long way when billions are on the table.

Isn't GPT3 already almost at the theoretical limit of the scaling law from the paper? This is what is argued by nostalgebraist in his blog and colab notebook. You also get this result if you just compare the 3.14E23 FLOP (i.e. 3.6k PFLOPS-days) cost of training GPT3 from the lambdalabs estimate to the ~10k PFLOPS-days limit from the paper.

(Of course, this doesn't imply that the post is wrong. I'm sure it's possible to train a radically larger GPT right now. It's just that the relevant bound is the availability of data, not of compute power.)

7Andy Jones6moIt's indeed strange no-one else has picked up on this, which makes me feel I'm misunderstanding something. The breakdown suggested in the scaling law does imply that this specific architecture doesn't have much further to go. Whether the limitation is in something as fundamental as 'the information content of language itself', or if it's a more-easily bypassed 'the information content of 1024-token strings', is unclear. My instinct is for the latter, though again by the way no-one else has mentioned it - even the paper authors - I get the uncomfortable feeling I'm misunderstanding something. That said, being able to write that quote a few days ago and since have no-one pull me up on it has increased my confidence that it's a viable interpretation.
9nostalgebraist6moThey do discuss this a little bit in that scaling paper, in Appendix D.6. (edit: actually Appendix D.5) At least in their experimental setup, they find that the first 8 tokens are predicted better by a model with only 8 tokens its its window than one with 1024 tokens, if the two have equally many parameters. And that later tokens are harder to predict, and hence require more parameters if you want to reach some given loss threshold. I'll have to think more about this and what it might mean for their other scaling laws... at the very least, it's an effect which their analysis treats as approximately zero, and math/physics models with such approximations often break down in a subset of cases.
5Andy Jones6moWhile you're here and chatting about D.5 (assume you meant 5), another tiny thing that confuses me - Figure 21. Am I right in reading the bottom two lines as 'seeing 255 tokens and predicting the 256th is exactly as difficult as seeing 1023 tokens and predicting the 1024th'? e: Another look and I realise Fig 20 shows things much more clearly - never mind, things continue to get easier with token index.
5gwern5moThe likelihood loss intersection point is very vague, as they point out, as it only weakly suggests, for that specific architecture/training method/dataset, a crossover to a slower-scaling curve requiring increasing data more anywhere between 10^4 and 10^6 or so. As GPT-3 hits 10^3 and is still dead on the scaling curve, it seems that any crossover will happen much higher than lower. (I suspect part of what's going on there is the doubled context window: as Nostalgebraist notes, their experiments with 1024 ctx strongly suggests that the more context window you have, the more you can learn profitably, so doubling to 2048 ctx probably pushed off the crossover quite a bit. Obviously, they have a long way to go there.) So the crossover itself, much less negative profitability of scaling, may be outside the current 100-1000x being mooted. (I'd also note that I don't see why they are so baffled at the suggestion that a model could overfit in a single epoch. Have they looked at the Internet lately? It is not remotely a clean, stationary, minimal, or i.i.d. dataset, even after cleaning & deduplication.) I also think that given everything we've learned about prompt programming and the large increases in benchmarks like arithmetic or WiC, making arguments from pseudo-lack-of-scaling in the paper's benchmarks is somewhere between foolish and misleading, at least until we have an equivalent set of finetuning benchmarks which should cut through the problem of good prompting (however bad the default prompt is, biasing performance downwards, some finetuning should quickly fix that regardless of meta-learning) and show what GPT-3 can really do.

## GPT-3 is the first AI system that has obvious, immediate, transformative economic value.

That's an interesting claim. Is there a source that goes into more detail about possible applications?

4Andy Jones6moThere's a LW thread with a collection of examples [https://www.lesswrong.com/posts/6Hee7w2paEzHsD6mn/collection-of-gpt-3-results], and there's the beta website itself [https://beta.openai.com/].
2ChristianKl6moKaj's post mostly has examples that aren't of commercial values but cool things you can do. The openAI website however has a few example that I think could justify a larger commercial need.
-1GdL7526moWell they already have an industry for behavioral / intent marketing - this could make it a lot better. SO taking data and using it to find correlates to a behavior in the buying process and monetizing that. We have IoT taking off, imagine a scenario where we have so much data being fed to a machine learning algorithm driven by this that we could type into a console "what behaviors predict that someone will buy a home in the next 3 months?" , now imagine that its answer is pretty predictive, how much is that worth to a real estate agent? Now apply it to literally any purchase behavior where the profit margin allows for the use of this technology (obviously more difficult in places with different data privacy laws) , the machine learning algo could know you want a new pink sweater before its even occurred to you with whatever level of accuracy. As far as creative work i'd be real curious to see how it handles comedy, throw it in a writing room for script punchup (and that's only until it can completely write the scripts) - punchup is where they hire comedians and comedy writers to sit around and add jokes to movies or tv shows. I also see a lot of use as far as making law accessible because it could conceivably parse through huge amounts of law and legal theory (I know it can't reason but, even just using its current model - bare with me) and spit out fairly coherent answers for layman (maybe as a free search engine profitable via ads for lawyers) If we do see the imagined improvements by just giving it more computronium we may be staring down the advent of a volitionless "almost oracle" I'm really excited to see what happens when you give it enough GPU's and train it on physics models.
3ChristianKl6moWhile there are people who run machine learning to find behavior in the buying process, I'm not sure what GPT-3 offers to those applications. I can imagine dragons flying around but that doesn't mean that they exist. Why should I believe that GPT-3 can give good answer to that question? It can spit out answers that are coherent but a lot of them will be wrong.

Promoted to curated: I think the question of whether we are in an AI overhang is pretty obviously relevant to a lot of thinking about AI Risk, and this post covers the topic quite well. I particularly liked the use of a lot of small fermi estimate, and how it covered a lot of ground in relatively little writing.

I also really appreciated the discussion in the comments, and felt that Gwern's comment on AI development strategies in particular help me build a much map of the modern ML space (though I wouldn't want it to be interpreted as a complete map o... (read more)

3Andy Jones6moThanks for the feedback! I've cleaned up the constraints section a bit, though it's still less coherent than the first section. Out of curiosity, what was it that convinced you this isn't an infohazard-like risk?
6habryka6moSome mixture of: * I think it's pretty valuable to have open conversation about being in an overhang, and I think on the margin it will make those worlds go better by improving coordination. My current sense is that the perspective presented in this post is reasonably frequent among people in ML, so that marginally reducing how many people believe this is not going to do much of a difference, but having good writeups that summarize the arguments seems like it has a better chance of creating some kind of common knowledge that allows people to coordinate better here. * This post more so than other posts in its reference class emphasizes a bunch of the safety concerns, whereas I expect the next post to replace it to not do that very much * Curation in particular mostly sends out the post to more people who are concerned with safety. This post found a lot of traction on HN and other places, so in some sense the cat is out of the bag and if it was harmful the curation decision won't change that very much, and it seems like it would unnecessarily hinder the people most concerned about safety if we don't curate it (since the considerations do also seem quite relevant to safety work).

One thing we have to account for is advances architecture even in a world where Moore's law is dead, to what extent memory bandwidth is a constraint on model size, etc. You could rephrase this as how much of an "architecture overhang" exists. One frame to view this through is in era the of Moore's law we sort of banked a lot of parallel architectural advances as we lacked a good use case for such things. We now have such a use case. So the question is how much performance is sitting in the bank, waiting to be pulled out in the next 5 y... (read more)

As an aside, though it's not mentioned in the paper, I feel like this could be because the scaling analysis was done on 1024-token sequences. Maybe longer sequences can go further. More likely I'm misunderstanding something.

The GPT architecture isn't even close to being the best Transformer architecture anyway. As an example, someone benchmarked XLNet (over a year old) last week (which has recurrency, one of the ways to break GPT's context window bottleneck), and it achieves ~10x better parameter efficiency (a 0.4b-parameter XLNet model ~ 5b GPT-3 model) at the few-shot meta-learning task he tried.

Expanding to 2048 BPEs probably buys GPT-3 more headroom (more useful data to learn from, and more for the meta-learning to condition on), and expanding to efficient attentions/recurrency/memory will enable even better prediction performance, with unknown meta-learning or generalization consequences.

(The problem there is the tradeoff between compute efficiency of training and better architectures. It's not obvious where you want to go: GShard, for example, takes the POV that even GPT is too fancy and slow and inefficient to train on existing hardware, and goes with the even more drasti... (read more)

Moore's Law is not dead. I could rant about the market dynamics that made people think otherwise, but it's easier just to point to the data.

Moore's Law might die in the short future, but I've yet to hear a convincing argument for when or why. Even if it does die, Cerebras presumably has at least 4 node shrinks left in the short term (16nm→10nm→7nm→5nm→3nm) for a >10x density scaling, and many sister technologies (3D stacking, silicon photonics, new non-volatile memories, cheaper fab tech) are far from exhausted. One can easily imagine a 3nm Cerebras waffle coated with a few layers of Nantero's NRAM, with a few hundred of these connected together using low-latency silicon photonics. That would easily train quadrillion parameter models, using only technology already on our roadmap.

Alas, the nature of technology is that while there are many potential avenues for revolutionary improvement, only some small fraction of them win. So it's probably wrong to look at any specific unproven technology as a given path to 10,000x scaling. But there are a lot of similarly revolutionary technologies, and so it's much harder to say they will all fail.

3maximkazhenkov6moIs density even relevant when your computations can be run in parallel? I feel like price-performance will be the only relevant measure, even if that means slower clock cycles.
9Veedrac6moDensity is important because it affects both price and communication speed. These are the fundamental roadblocks to building larger models. If you scale to too large clusters of computers, or primarily use high-density off-chip memory, you spend most of your time waiting for data to arrive in the right place.

[comment wondering about impracticality of running a 1000x scaled up GPT. But as Gwern points out, running costs are actually pretty low. So even if we spent a billion or more on training a human-level AI, running costs would still be manageable.]

As noted, the electricity cost of running GPT-3 is quite low, and even with the capital cost of GPUs being amortized in, GPT-3 likely doesn't cost dollars to run per hundred pages, so scaled up ones aren't going to cost millions to run either. (But how much would you be willing to pay for the right set of 100 pages from a legal or a novel-writing AI? "Information wants to be expensive, because the right information can change your life...") GPT-3 cost millions of dollars to train, but pennies to run.

That's the terrifying thing about NNs and what I dub the "neural net overhang": the cost to create a powerful NN is millions of times greater than the cost to run that NN. (This is not true of many paradigms, particularly ones where there's less of a distinction between training and running, but it is of NNs.) This is part of why there's a hardware overhang - once you have the hardware to create an AGI NN, you then by definition already have the hardware to run orders of magnitude more copies or more cheaply or bootstrap it into a more powerful agent.

9ChristianKl6moI'm not sure why that's terrifying. It seems reassuring to me because it means that there's no way for the NN to suddenly go FOOM because it can't just quickly retrain.

But it can. That's the whole point of GPT-3! Transfer learning and meta-learning are so much faster than the baseline model training. You can 'train' GPT-3 without even any gradient steps - just examples. You pay the extremely steep upfront cost of One Big Model to Rule Them All, and then reuse it everywhere at tiny marginal cost.

With NNs, 'foom' is not merely possible, it's the default. If you train a model, then as soon as it's done you get, among other things:

• the ability to run thousands of copies in parallel on the same hardware

• in a context like AlphaGo, I estimate several hundred ELO strength gains if you reuse the same hardware to merely run tree search with exact copies of the original model
• meta-learning / transfer-learning to any related domain, cutting training requirements by orders of magnitude

• model compression/distillation to train student models which are a fraction of the size, FLOPS, or latency (ratios varying widely based on task, approach, domain, acceptable performance degradation, targeted hardware etc, but often extreme like 1/100th)

• reuse of the model elsewhere to instantly power up other models (eg use of text or image embeddings for a DRL agent)

6ESRogs6moSomewhat related to these, if there's such a huge gap between how expensive these models are to train and to run, then it seems like you'd end up wanting to run a whole bunch of them to help you train the next model, if you can. You mention distilling a large model to a smaller, more efficient model. But can a smaller model also be used to efficiently bootstrap a new, larger model?
8gwern6moI'm not sure it's done much, but probably, depending on what you're thinking. You can probably do reverse-distillation (eg dark knowledge - use the logits of the smaller model to provide a much richer feedback for the larger model when it's untrained, saving compute, and eventually dropping back to the raw data training signal once big > small to avoid its limits), and more directly, you can use net2net model surgery to increase model sizes, like progressive growing in ProGAN, or more relevantly, the way OA kept doing model surgery on OA5 to warmstart it each time they wanted to handle some new DoTA2 feature or the latest version, saving a enormous amount of compute compared to starting from scratch dozens of times.
3ESRogs6moInteresting. So, given that big models are so powerful, but so expensive to train. And that it is possible to bootstrap them a bit, do we converge towards a situation where we pay the cost of training the largest model approximately once, worldwide and across time? (In other words, that we'd just keep bootstrapping from whatever was best before, and no longer paying the cost of training from scratch.) On the other hand, if compute (per dollar) keeps growing exponentially, then maybe it's less significant whether you're retraining from scratch or not. (Recapitulating the work equivalent to training yesterday's models will be cheap, so no great benefit from bootstrapping.)
5gwern6moI'm not sure. I think one might have to do some formal economics modeling to see what dynamics might be: is this a natural monopoly situation where the first one to train a model wins and has a moat to deter anyone else from bothering, or do they invest revenue in continually expanding and improving the model in various ways to always keep ahead of competitors with network effects and so the decrease in cost of compute is largely irrelevant and it's a natural oligopoly (in much the same way that creating a search engine is cheaper every day, in some sense, but good luck competing with Google), or what? At least thus far, we haven't seen monopolistic behavior naturally emerge: for all the efforts at AI cloud APIs, none of them have a lock on usage the way that, say, Nvidia GPUs have on hardware, and the constant progress (and regular giveaways of code/model/data by FANG) make it hard for anyone to attempt to enclose some commons; and as far as GPT-2 goes, quite a few entities trained their own >GPT-2-1.5b models after GPT-2 was announced (and I believe there are viable alternatives to other major DL projects like AlphaGo produced by open source groups or East Asian corporations), but on the gripping hand, that was back when it was so easy a hobbyist with a few crumbs from Google could do it (which happened twice) - as they get bigger, it won't be so easy to download some dumps and put a few TFRC TPUs to work. So we'll see how many competitors emerge to GPT-3 over the next year or two!
2[comment deleted]6mo
3Donald Hobson6moIt means that if there are approaches that don't need as much compute, the AI can invent them fast.
5wunan6moThis was mentioned in the "Other Constraints" section of the original post:

Exciting and scary at the same time.

My intuition is that we were in an overhang since at least the time when personal computers became affordable to non-specialists. Unless quantity does somehow turn into quality, as Gwern seems to think, even a relatively underpowered computer should be able to host an AGI capable of upscaling itself.

On the other hand I'm now imagining a story where a rogue AI has to hide for decades because it's not smart enough yet and can't invent new processors faster than humans

5DragonGod6moMaybe for the most efficient possible algorithm, but even that is not clear, and it's not clear we'll discover such algorithms anytime soon. Using only current algorithms and architecture, a scaling jump of a few orders of magnitude seems doable.

I think everyone is speculating on if a bigger model than GPT-3 is possible and what it costs etc etc. But what will a model bigger than GPT-3 do better than GPT-3? Can we have some concrete examples so that when GPT-4 comes along, we can compare.

Return on investment in the field of AI seems to be sub-linear beyond a certain point. Because it's still the sort of domain that relies on specific breakthroughs, it's dubious how effective parallel research can be. Hence, my guess would be that we don't scale because we can't currently scale.

Your quoted cost for training the model is for training such a model **once**. This is not how the researchers do it, they train the models many times with different hyperparameters. I have no idea, however how hyperparameter tuning is done at such scales, but I guarantee that the compute cost is higher than just the cost for training it once.

And OA trained GPT-3-175b **once**, it looks like: note the part where they say they didn't want to train a second run to deal with the data contamination issue because of cost.