Inference Speed is Not Unbounded

[-]rotatingpaguro2y60

I don't completely understand your point because I don't have a calibration for your "slow" in "training an AI must be slow". How slow is "slow"? Compared to what? (Leaving aside Solomonoff inductors and other incomputable things.)

Do you consider the usual case that "a toddler requires fewer examples" as a reference for "not slow"? If so: human DNA is < 1GB, so humans get at most 1GB of free knowledge as inductive bias. Does your argument for "AI slow" then rely on us not getting to that <1GB of stuff to preconfigure in a ML system? If not so (humans slow too): do you think humans are a ceiling, or close to one, on data efficiency?

[-]OneManyNone2y21

You're right that my points lack a certain rigor. I don't think there is a rigorous answer to questions like "what does slow mean?".

However, there is a recurring theme I've seen in discussions about AI where people express incredulity about neural networks as a method for AGI since they require so much "more data" than humans to train. My argument was merely that we should expect things to take a lot of data, and situations where they don't are illusory. Maybe that's less common in this space, so it I should have framed it differently. But I wrote this mostly to put it out there and get people's thoughts.

Also, I see your point about DNA only accounting for 1GB. I wasn't aware it was so low. I think it's interesting and suggests the possibility of smaller learning systems than I envisioned, but that's as much a question about compression as anything else. Don't forget that that DNA still needs to be "uncompressed" into a human, and at least some of that process is using information stored in the previous generation of human. Admittedly, it's not clear how much that last part accounts for, but there is evidence that part of a baby's development is determined by the biological state of the mother.

But I guess I would say my argument does rely on us not getting that <1GB of stuff, with the caveat that that 1GB is super highly compressed through a process that takes a very complex system to uncompress.

I should add as well that I definitely don't believe that LLMs are remotely efficient, and I wouldn't necessarily be surprised if humans are as close to the maximum on data efficiency as possible. I wouldn't be surprised if they weren't, either. But we were built over millions (billions?) of years under conditions that put a very high price tag on inefficiency, so it seems reasonable to believe our data efficiency is at least at some type of local minima.

EDIT: Another way to phrase the point about DNA: You need to account not just for the storage size of the DNA, but also the Kolmogrov complexity of turning that into a human. No idea if that adds a lot to its size, though.

[-]tarwatirno2y40

1GB for DNA is a lower bound. That's how much it takes to store the abstract base pair representation. There's lots of other information you'd need to actually build a human and a lot of it is common to all life. Like, DNA spends most of its time not in the neat little X shapes that happen during reproduction, but in coiled up little tangles. A lot of the information is stored in the 3D shape and in the other regulatory machinery attached to the chromosomes.

If all you had was a human genome, the best you could do would be to do a lot of simulation to reconstruct all the other stuff. Probably doable, but would require a lot of "relearning."

The brain also uses DNA for storing information on the form of methylation patterns in individual neurons.

[-]rotatingpaguro2y43

I expect that the mother does not add much to the DNA as information; so yes it's complex and necessary, but I think you have to count almost only the size of DNA for inductive bias. That said, this is a gut guess!

However, there is a recurring theme I've seen in discussions about AI where people express incredulity about neural networks as a method for AGI since they require so much "more data" than humans to train. My argument was merely that we should expect things to take a lot of data, and situations where they don't are illusory.

Yeah I got this, I have the same impression. The way I think about the topic is: "The NN requires tons of data to learn human language because it's a totally alien mind, while humans have produced themselves their language, so it's tautologically adapted to their base architecture, you learn it easily only because it's designed to be learned by you".

But after encountering the DNA size argument myself a while ago, I started doubting this framework. It may be possible to do much, much better that what we do now.

[-]OneManyNone2y42

Yeah, I agree that it's a surprising fact requiring a bit of updating on my end. But I think the compression point probably matters more than you would think, and I'm finding myself more convinced the more I think about it. A lot of processing goes into turning that 1GB into a brain, and that processing may not be highly reducible. That's sort of what I was getting at, and I'm not totally sure the complexity of that process wouldn't add up to a lot more than 1GB.

It's tempting to think of DNA as sufficiently encoding a human, but (speculatively) it may make more sense to think of DNA only as the input to a very large function which outputs a human. It seems strange, but it's not like anyone's ever built a human (or any other organism) in a lab from DNA alone; it's definitely possible that there's a huge amount of information stored in the processes of a living human which isn't sufficiently encoded just by DNA.

You don't even have to zoom out to things like organs or the brain. Just knowing which bases match to which amino acids is an (admittedly simple) example of processing that exists outside of the DNA encoding itself.

[-]M. Y. Zuo2y0-1

Even if you include a very generous epigenetic and womb-environmental component 9x bigger then the DNA component, any possible human baby at birth would need less then 10 GB to describe them completely with DNA levels of compression.

A human adult at age 25 would probably need a lot more to cover all possible development scenarios, but even then I can't see it being more then 1000x, so 10TB should be enough.

For reference Windows Server 2016 supports 24 TB of RAM, and many petabytes of attached storage.

[-]OneManyNone2y51

I think you're broadly right, but I think it's worth mentioning that DNA is a probabilistic compression (evidence: differences in identical twins), so it gets weird when you talk about compressing an adult at age 25 - what is probabilistic compression at that point?

But I think you've mostly convinced me. Whatever it takes to "encode" a human, it's possible to compress it to be something very small.

[-]M. Y. Zuo2y0-1

A minor nitpick, DNA, the encoding concept, is not probabilistic, it's everything surrounding such as the packaging, 3D shape, epigenes, etc., plus random mutations, transcription errors, etc., that causes identical twins to deviate.

Of course it is so compact because it doesn't bother spending many 'bits' on ancilliary capabilities to correct operating errors.

But it's at least theoretically possible for it to be deterministic under ideal conditions.

[-]OneManyNone2y31

To that first sentence, I don't want to get lost in semantics here. My specific statement is that the process that takes DNA into a human is probabilistic with respect to the DNA sequence alone. Add in all that other stuff, and maybe at some point it becomes deterministic, but at that point you are no longer discussing the <1GB that makes DNA. If you wanted to be truly deterministic, especially up to the age of 25, I seriously doubt it could be done in less than millions of petabytes, because there are such a huge number of miniscule variations in conditions and I suspect human development is a highly chaotic process.

As you said, though, we're at the point of minor nitpicks here. It doesn't have to be a deterministic encoding for your broader points to stand.

[-]M. Y. Zuo2y10

Perhaps I phrased it poorly, let me put it this way.

If super-advanced aliens suddenly showed up tomorrow and gave us the near-physically-perfectly technology, machines, techniques, etc., we could feasibly have a fully deterministic, down to the cell level at least, encoding of any possible individual human stored in a box of hard drives or less.

In practical terms I can't even begin to imagine the technology needed to reliably and repeatably capture a 'snapshot' of a living, breathing, human's cellular state, but there's no equivalent of a light speed barrier preventing it.

[-]green_leaf2y10

How did you estimate the number of possible development scenarios till the age 25?

[-]M. Y. Zuo2y*1-3

Total number of possible permutations an adult human brain could be in and still remain conscious, over and above that of a baby's. The most extreme edge cases would be something like a Phineas Gage, where a ~1 inch diameter iron rod was rammed through a frontal lobe and he still could walk around.

So fill in the difference with guesstimation

I doubt there's literally 1000x more permutations, since there's already a huge range of possible babies, but I chose it anyways as a nice round number.

[-]johnlawrenceaspden2y31

All of RL’s successes, even the huge ones like AlphaGo (which beat the world champion at Go) or its successors, were not easy to train. For one thing, the process was very unstable and very sensitive to slight mistakes. The networks had to be designed with inductive biases specifically tuned to each problem.
And the end result was that there was no generalization. Every problem required you to rethink your approach from scratch. And an AI that mastered one task wouldn’t necessarily learn another one any faster.

I had the distinct impression that AlphaZero (the version of AlphaGo where they removed all the tweaks) could be left alone for an afternoon with the rules of almost any game in the same class as go, chess, shogi, checkers, noughts-and-crosses, connect four, othello etc, and teach itself up to superhuman performance.

In the case of chess, that involved rediscovering something like 400 years of human chess theorizing, to become the strongest player in history including better than all previous hand-constructed chess programs.

In the case of go, I am told that it not only rediscovered a whole 2000 year history of go theory, but added previously undiscovered strategies. "Like getting a textbook from the future", is a quote I have heard.

That strikes me as neither slow nor ungeneral.

And there was enough information in the AlphaZero paper that it was replicated and improved on by the LeelaChessZero open-source project, so I don't think there can have been that many special tweaks needed?

[-]Noosphere891y20

Admittedly, the success of AlphaZero relied on it being essentially able to generate very, very large amounts of very high-quality data, so this is a domain where synthetic data was very successful.

So a weaker version of the post is "you need either a lot of data, and high quality ones, or high amounts of compute, and there's little going around it."

[-][anonymous]2y30

One aspect you skipped over was how a superintelligence might reason if given data that has many possible hypotheses for an explanation.

You mentioned occam's razor, and kinda touched on inductive biases, but I think you left out something important.

If you think about it, Occam's razor is part of a process of : consider multiple hypothesis. Take the minimum complexity hypothesis, discard the others.

We can do better than that, trivially. See particle filters. In that case the algorithm is : consider up to n possible hypotheses and store them in memory in a hypothesis space able to contain them. (so the 2d particle filter exists in a space for only 2d coordinates, but it is possible to have an n dimension filter for the space of coherent scientific theories etc).

A human intelligence using occam's razor is just doing a particle filter where you carry over only 1 point from step to step. And during famous scientific debates, another "champion" of a second theory held onto a different point.

Since a superintelligence can have an architecture with more memory and more compute, it can hold n points. It could generate millions of hypotheses (or near infinite) from the "3 frames" example, and some would contain correct theories of gravity. It can then reason using all hypotheses it has in memory, weighted by probability or by a clustering algorithm or other methods. This means it would be able to act, controlling robotics in the real world or making decisions, without having found a coherent theory of gravity yet, just a large collection of hypotheses biased towards 'objects fall'. (3 frames is probably not enough information to act effectively)

I'm not sure "architecture" isn't distinct from inductive bias. "Architecture" is the dimensions of each network, the topology connecting each network, the subcomponents of each network, and the training loss functions used at each stage. What's different is that a model cannot learn it's way past architectural limits, a model the scale of GPT-2 cannot approach the performance of GPT-4 no matter the amount of training data.

So inductive bias = information you started with, isn't necessary because a general enough network can still learn the same information if it has more training information.

architecture = technical way the machine is constructed, it puts a ceiling on capabilities even with infinite training data.

Another aspect of this is considering the particle filter case, where a superintelligence tracks n hypotheses for what it believes during a decision making process. Each time you increment n+1, you increase compute needed per decision by O(n) or some cases much worse than that. There's probably a way to mathematically formalize this and estimate how a superintelligence's decision making ability scales with compute, since each additional hypothesis you track has diminishing returns. (probably per the same power law for training loss for llm training)

[-]OneManyNone2y10

To your point about the particle filter, my whole point is that you can’t just assume the super intelligence can generate an infinite number of particles, because that takes infinite processing. At the end of the day, superintelligence isn’t magic - those hypotheses have to come from somewhere. They have to be built, and they have to be built sequentially. The only way you get to skip steps is by reusing knowledge that came from somewhere else.

Take a look at the game of Go. The computational limits on the number of games that could be simulated made this “try everything” approach essentially impossible. When Go was finally “solved”, it was with an ML algorithm that proposed only a limited number of possible sequences - it was just that the sequences it proposed were better.

But how did it get those better moves? It didn’t pull them out of the air, it used abstractions it had accumulated form playing a huge number of games.

_____

I do agree with some of the things you’re saying about architecture, though. Sometimes inductive bias imposes limitations. In terms of hypotheses, it can and does often put hard limits on which hypotheses you can consider, period.

I also admit I was wrong and was careless in saying that inductive bias is just information you started with. But I don’t think it’s imprecise to say that “information you started with” is just another form of inductive bias, of which ”architecture” is another.

But at a certain point, the line between architecture and information is going to blur. As I’ve pointed out, a transformer without some of the explicit benefits of a CNN’s architecture can still structure itself in a way that learns shift invariance. I also don’t think any of this effects my key arguments.

[-]Hastings2y10

Lets assume that as part of pondering the three webam frames, the AI thought of the rules of Go- ignoring how likely this is.

In that circumstance, in your framing of the question, would it be allowed to play several million games against itself to see if that helped it explain the arrays of pixels?

[-]OneManyNone2y10

I guess so? I'm not sure what point you're making, so it's hard for me to address it.

My point is that if you want to build something intelligent, you have to do a lot of processing and there's no way around it. Playing several million games of Go counts as a lot of processing.

[-]johnlawrenceaspden2y20

My basic argument is that the there are probably mathematical limits on how fast it is possible to learn.

Doubtless there are! And limits to how much it is possible to learn from given data.

But I think they're surprisingly high, compared to how fast humans and other animals can do it.

There are theoretical limits to how fast you can multiply numbers, given a certain amount of processor power, but that doesn't mean that I'd back the entirety of human civilization to beat a ZX81 in a multiplication contest.

What you need to explain is why learning algorithms are a 'different sort of thing' to multiplication algorithms.

Maybe our brains are specialized to learning the sorts of things that came in handy when we were animals.

But I'd be a bit surprised if they were specialized to abstract reasoning or making scientific inferences.

[-]AnthonyC2y20

I always assumed the original apple frames and grass quote to be...maybe not a metaphor, but at least acknowledged as a theoretical rather than practical ideal. What a hypercomputer executing Solomonoff induction might be able to accomplish.

The actual feat of reasoning described in the story itself is that an entire civilization of people approaching the known-attainable upper reaches of human intelligence, with all the past data and experience that entails, devoting its entire thought and compute budget for decades towards what amounts to a single token prediction problem with a prompt of a few MB in size.

I think we can agree that those are, at least, sufficiently wide upper and lower bounds for what would be required in practice to solve the Alien Physics problem in the story.

Everything else, the parts about spending half a billion subjective years persuading them to let us out of the simulation, is irrelevant to that question. So what really is the practical limit? How much new input to how big a pre-existing model? I don't know. But I do know that while humans have access to lots of data during our development, we throw almost all of it away, and don't have anywhere near enough compute to make thorough use of what's left. Which in turn means the limit of learning with the same data and higher compute should be much faster than human.

In any case, an AI doesn't need to be anywhere near the theoretical limit, in a world where readily available sources of data online include tens of thousands of years of video and audio, and hundreds of terabytes of text.

[-]Max H2y20

But there is no way to solve that problem exactly without doing a whole lot of work.^[3] For a couple hundred cities, we’re talking about more work than you could fit into the lifespan of the universe with computers millions of times stronger than the best supercomputers in existence.

It's true that an exact solution might be intractable, but approximate solutions are often good enough. According to wikipedia:

Modern methods can find solutions for extremely large problems (millions of cities) within a reasonable time which are with a high probability just 2–3% away from the optimal solution.

Perhaps there is a yet-undiscovered heuristic algorithm for approximating Solomonoff induction relatively efficiently.

[-]OneManyNone2y72

Yes, I wasn’t sure if it was wise to use TSP as an example for that reason. Originally I wrote it using the Hamiltonian Path problem, but thought a non-technical reader would be more able to quickly understand TSP. Maybe that was a mistake. It also seems I may have underestimated how technical my audience would be.

But your point about heuristics is right. That’s basically what I think an AGI based on LLMs would do to figure out the world. However, I doubt there would be one heuristic which could do Solomonoff induction in all scenarios, or even most. Which means you’d have to select the right one, which means you’d need a selection criteria, which takes us back to my original points.

[-]jacob_cannell2y20

Perhaps there is a yet-undiscovered heuristic algorithm for approximating Solomonoff induction relatively efficiently.

There are - approximate inference on neural networks, such as variants of SGD. Neural networks are a natural universal circuit language, so you can't get any more than a constant improvement by moving to another universal representation. And in the class of all learning algorithms which approximately converge to full bayesian inference (ie solomonoff induction), SGD style Langevin Dynamics are also unique and difficult to beat in practice - the differences between that and full bayesian inference reduce to higher order corrections which rapidly fall off in utility/op.

[-]David Johnston2y10

Do you have a link to a more in-depth defense of this claim?

[-]jacob_cannell2y61

I mean it's like 4 or 5 claims? So not sure which ones you want more in-depth on, but

Neural networks are universal is obvious, as arithmetic/analog circuits they fully generalize (reduce to) binary circuits, which are circuit complete.
A Full Bayesian Inference and Solomonoff Induction are equivalent - fairly obvious
B Approximately converge is near guaranteed if the model is sufficiently overcomplete and trained long enough with correct techniques (normalization, regularization, etc) - as in the worst case you can recover exhaustive exploration ala solomonoff. But SGD on NN is somewhat exponentially faster that exhaustive program search, as it can explore not a single solution at a time, but a number of solutions (sparse sub circuits embedded in the overcomplete model) that is exponential with NN depth (see lottery tickets, dropout, and sum product networks).
C " differences between that and full bayesian inference reduce to higher order corrections which rapidly fall off in utility/op". This is known perhaps experimentally in the sense that the research community has now conducted large-scale extensive (and even often automated) exploration of much of the entire space of higher order corrections to SGD, and come up with almost nothing much better than stupidly simple inaccurate but low cost 2nd order correction approximations like Adam. (The research community has come up with an endless stream of higher order optimizers that improve theoretical convergence rate, and near zero that improve wall time convergence speed. ) I do think there is still some room for improvement here, but not anything remotely like "a new category of algorithm".

But part of my claim simply is that modern DL techniques encompasses nearly all of optimization that is relevant, it simply ate everything, such that the possibility of some new research track not already considered is would be just nomenclature distinction at this point.

[-]David Johnston2y10

Neural networks being universal approximators doesn't mean they do as well at distributing uncertainty as Solomonoff, right (I'm not entirely sure about this)? Also, are practical neural nets actually close to being universal?

in the worst case you can recover exhaustive exploration ala solomonoff

Do you mean that this is possible in principle, or that this is a limit of SGD training?

known perhaps experimentally in the sense that the research community has now conducted large-scale extensive (and even often automated) exploration of much of the entire space of higher order corrections to SGD

I read your original claim as "SGD is known to approximate full Bayesian inference, and the gap between SGD and full inference is known to be small". Experimental evidence that SGD performs competitively does not substantiate that claim, in my view.

[-]jacob_cannell2y10

Also, are practical neural nets actually close to being universal?

Trivially so - as in they can obviously encode a binary circuit equivalent to a CPU, and also in practice in the sense that transformers descend from related research (neural turing machines, memory networks, etc) and are universal.

Do you mean that this is possible in principle, or that this is a limit of SGD training?

I mean in the worst case where you have some function that is actually really hard to learn - as long as you have enough data (or can generate it ) - big overcomplete NNs with SGD can obviously perform a strict improvement over exhaustive search .

"SGD is known to approximate full Bayesian inference, and the gap between SGD and full inference is known to be small"

Depends on what you mean by "gap" - whether you are measuring inference per unit data or inference per unit compute.

There are clearly scenarios where you can get faster convergence via better using/approximating the higher order terms, but that obviously is not remotely sufficient to beat SGD - as any such extra complexity must also pay for itself against cost of compute.

Of course if you are data starved, then that obviously changes things.

[-]David Johnston2y10

they can obviously encode a binary circuit equivalent to a CPU

A CPU by itself is not universal. Are you saying memory augmented neural networks are practically close to universality?

as long as you have enough data (or can generate it ) - big overcomplete NNs with SGD can obviously perform a strict improvement over exhaustive search

Sorry, I'm being slow here:

Solomonoff does exhaustive search for any amount of data; is part of your claim that as data -> infinity, NN + SGD -> Solomonoff?
How do we actually do this improved exhaustive search? Do we know that SGD gets us to a global minimum in the end?

[-]jacob_cannell2y4-4

A CPU by itself is not universal.

Any useful CPU is by my definition - turing universal.

Solomonoff does exhaustive search for any amount of data; is part of your claim that as data-> infinity, NN + SGD -> Solomonoff?

You can think of solomonoff as iterating over all programs/circuits by size, evaluating each on all the data, etc.

A sufficiently wide NN + SGD can search the full circuit space up to a depth D across the data set in an efficient way (reusing all subcomputations across sparse subcircuit solutions (lottery tickets)).

[-]David Johnston2y*27

Thanks for explaining the way to do exhaustive search - a big network can exhaustively search smaller network configurations. I believe that.

However, a CPU is not Turing complete (what is Turing universal?) - a CPU with an infinite read/write tape is Turing complete. This matters, because Solomonoff induction is a mixture of Turing machines. There are simple functions transformers can’t learn, such as “print the binary representation of the input + 1”; they run out of room. Solomonoff induction is not limited in this way.

Practical transformers are also usually (always?) used with exchangeable sequences, while Solomonoff inductors operate on general sequences. I can imagine ways around this (use a RNN and many epochs with a single sequence) so maybe not a fundamental limit, but still a big difference between neural nets in practice and Solomonoff inductors.

[-]ponkaloupe2y10

To learn gravity, you need additional evidence or context; to learn that the world is 3D, you need to see movement. To understand that movement, you have to understand how light moves, etc. etc.

for the 3d part: either the object of observation needs to move, or the observer needs to move: these are equivalent statements due to symmetry. consider two 2D images taken simultaneously from different points of observation: this provides the same information relevant here as were there to be but 2 images of a moving object from a stationary observer at slightly different moments in time.

in fact then, you don’t need to see movement in order to learn that the world is 3D. making movement a requirement to discover the dimensionality of a space mandates the additional dimension of time: how then could we discover the 4 dimensional space-time without access to some 5th dimensional analog of time? it’s an infinite regress.

similarly, you don’t need to understand the movement of light. certainly, we didn’t for a very long time. you just need to understand the projection from object to image. that’s where the bulk of these axiomatic properties of worldly knowledge reside (assumptions about physics being regular, or whatever else you need so that you can leverage things like induction in your learning).

[-]OneManyNone2y10

My objection applied at a different level of reasoning. I would argue that anyone who isn't blind understands light at the level I'm talking about. You understand that the colors you see are objects because light is bouncing off them and you know how to interpret that. If you think about it, starting from zero I'm not sure that you would recognize shapes in pictures as objects.

^{^}

Admittedly, it’s very rare that these limits on efficiency are actually proven, at least in the most general case, since no one’s proven that P != NP. But there is a lot of evidence that this is true.

^{^}

Technically this isn’t proven, but a whole lot of smart people believe it’s true. P does not equal NP and all that.

^{^}

It’s possible, maybe even likely, that if you actually could do the math on this you would find that the challenge of discovering gravity is really just doable in linear time given the minimal amount of required data. Maybe, who knows? None of this is well defined. I suspect the constant factors would still be very large, though.

^{^}

Unless you already built the first N-1 steps into your system. Let's not get ahead of ourselves though; I'll address that.

^{^}

Here's one last salient example: the field of mathematics itself. Technically there is no input data at all and all provable things are already provable before you even start to do any work. And yet I’d bet good money that there’s a hard limit to how fast any intelligence could infer certain mathematical facts. And of course, many formal proof systems in mathematics actually have the property that there will always exist statements that take an arbitrary amount of effort to prove.

^{^}

Actually, what I really did as a little kid was guess a number close to what I thought it was and refine from there, but that’s just another less elegant (and highly probabilistic) abstraction.

^{^}

Pretrained Transformers as Universal Computation Engines

^{^}

I guess you could come up with a different term, but that’s not the point. The point is whatever that knowledge is, it isn’t “learned.”

^{^}

Their attention mechanisms develop into shift-invariant Toeplitz matrices. (Pay Attention to MLPs)

^{^}

Although, is human learning actually more like fine-tuning? Maybe. Let’s not get into that; the argument would follow the same trajectory as the rest of this post anyway.

^{^}

Yes, evolution is a design process, it’s just not an intelligent design process.

^{^}

Obviously I'm not saying that LLMs won't get far more efficient to train in the coming years, just that they'll always require a certain minimum of resources. I’m also not giving a rigorous definition of “fast.” The exact value of that doesn’t matter; my points are more about the dynamics of learning.

^{^}

Playing Atari with Deep Reinforcement Learning

^{^}

If you want to get technical, the LLM is trained with RL during the whole process, since “next token prediction” is a special case of RL. But I don’t want to get that technical and I think my point is clear enough.

LESSWRONG
LW

LESSWRONG
LW

35

Inference Speed is Not Unbounded

35

35

Three Apples and a Blade of Grass

Maximum Inference: Data Only Goes So Far

There Are Limits to How Fast You Can Perform Inference

Precomputation: Intelligence is Just Accumulated Abstraction

Inductive Bias: The Knowledge You Start With

Sutton’s Bitter Lesson

There's No Skipping the Line

The Evolution of Human Intelligence

Conclusion: Training an AI System Must Be Slow

Final Thoughts: Why the Bet on Reinforcement Learning Didn't Pay Off