Wiki Contributions

Comments

I'm quite unsure as well.

On one hand, I have the same feeling that it has a lot of weirdly specific, surely-not-universalizing optimizations when I look at it.

But on the other -- it does seem to do quite well on different envs, and if this wasn't hyper-parameter-tuned then that performance seems like the ultimate arbiter. And I don't trust my intuitions about what qualifies as robust engineering v. non-robust tweaks in this domain. (Supervised learning is easier than RL in many ways, but LR warm-up still seems like a weird hack to me, even though it's vital for a whole bunch of standard Transformer architectures and I know there are explanations for why it works.)

Similarly -- I dunno, human perceptions generally map to something like log-space, so maybe symlog on rewards (and on observations (?!)) makes deep sense? And maybe you need something like the gradient clipping and KL balancing to handle the not-iid data of RL? I might just stare at the paper for longer.

It's working for me? I disabled the cache in devtools and am still seeing it. It looks like it's hitting a LW-specific CDN also. (https://res.cloudinary.com/lesswrong-2-0/image/upload/v1674179321/mirroredImages/mRwJce3npmzbKfxws/kadwenfpnlvlswgldldd.png)

Thanks for this, this was a fun review of a topic that is both intrinsically and instrumentally interesting to me!

I remain pretty happy with most of this, looking back -- I think this remains clear, accessible, and about as truthful as possible without getting too technical.

I do want to grade my conclusions / predictions, though.

(1). I predicted that this work would quickly be exceeded in sample efficiency. This was wrong -- it's been a bit over a year and EfficientZero is still SOTA on Atari. My 3-to-24-month timeframe hasn't run out, but I said that I expected "at least a 25% gain" towards the start of the time, which hasn't happened.

(2). There has been a shift to multitask domains, or to multi-benchmark papers. This wasn't too hard of a prediction, but I think it was correct. (Although of course good evidence for such a shift would require comprehensive lit review.)

To sample two -- DreamerV3 is a very recently released model-based DeepMind algorithm. It does very well at Atari100k -- it gets a better mean score then everything but EfficientZero -- but it also does well at DMLab + 4 other benchmarks + even crafting a Minecraft diamond. The paper emphasizes the robustness of the algorithm, and is right to do so -- once you get human-level sample efficiency on Atari100k, you really want to make sure you aren't just overfitting to that!

And course the infamous Gato is a multitask agent across host of different domains, although the ultimate impact of it remains unclear at the moment.

(3). And finally -- well, the last conclusion, that there is still a lot of space for big gains in performance in RL even without field-overturning new insights, is inevitably subjective. But I think the evidence still supports it.

Thermodynamics is the deep theory behind steam engine design (and many other things) -- it doesn't tell you how to build a steam engine, but to design a good one you probably need to draw on it somewhat.

This post feels like a gesture at a deep theory behind truth-oriented forum / community design (and many other things) -- it certainly doesn't help tell you how to build one, but you have to think at least around what it talks about to design a good one. Also applicable to many other things, of course.

It also has virtue of being very short. Per-word one of my favorite posts.

I like post because it: -- Focuses on a machine which is usually non-central to accounts of the industrial revolution (at least in others which I've read), which makes novel and interesting to those interested in the roots of progress -- And has a high ratio of specific empirical detail to speculation -- Furthermore separates speculation from historical claims pretty cleanly

This post is a good review of a book, to an space where small regulatory reform could result in great gains, and also changed my mind about LNT. As an introduction to the topic, more focus on economic details would be great, but you can't be all things to all men.

There's a scarcity of stories about how things could go wrong with AI which are not centered on the "single advanced misaligned research project" scenario. This post (and the mentioned RAAP post by Critch) helps partially fill that gap.

It definitely helped me picture / feel some of what some potential worlds look like, to the degree I currently think something like this -- albeit probably slower, as mentioned in the story -- is more likely than the misaligned research project disaster.

It also is a (1) pretty good / fun story and (2) mentions the elements within the story which the author feels are unlikely, which is virtuous and helps prevent higher detail from being mistaken for plausibility.

I like this post in part because of the dual nature of the conclusion, aimed at two different audiences. Focusing on the cost of implementing various coordination schemes seems... relatively unexamined on LW, I think. The list of life-lessons is intelligible, actionable, and short.

On the other hand, I think you could probably push it even further in "Secret of Our Success" tradition / culture direction. Because there's... a somewhat false claim in it: "Once upon a time, someone had to be the first person to invent each of these concepts."

This seems false about markets, for instance. Markets in goods can exist without any specific person understanding them or how they work, I think? (Far too much of history, after all, is people stumbling across markets, saying "This seems bad", breaking it, and suffering consequences.) And similarly money-like things can certainly arise without anyone understanding them.

It's also false (almost certainly?) about language, the o.g. coordination mechanism.

(And if you wanted to reach: Is monogamy a coordination scheme that makes men work harder, as some anthropologists think? If so, it's doubtful it was conceived of as such by more than a tiny handful of people! Or maybe that's just stretching "coordination scheme" way too far, I don't know.)

I don't really have a greater conclusion from this, though. These are all points in the same direction, moodwise, as the original article is pointing, I think.

That's 100% true about the quote above being false for environments for which the optimal strategy is stochastic, and a very good catch. I'd expect naive action-value methods to have a lot of trouble in multi agent scenarios.

The ease with which other optimization methods (i.e., policy optimization, which directly adjusts likelihood of different actions, rather than using an estimate of the action-value function to choose actions) represent stochastic policies is one of their advantages over q-learning, which can't really do so. That's probably one reason why extremely large-scale RL (i.e., Starcraft, Dota) tend to use more policy optimization (or some complicated mixture of both).

Re. the bullet list, that's a little too restrictive, at least in some places -- for instance, even if an agent doesn't know all (or even any) of the laws of physics, for instance, in the limit of infinite play action-value based methods can (I think provably) converge to true values. (After all the basic Q-learning never even tries learning the transition function for the environment.)

I think Sutton & Barto or Bertsekas & Tsitsiklas would cover the complete criteria for q-learning to be guaranteed to converge? Although of course in practice, my understanding is it's quite rare for environments to meet all the criteria and (sometimes!) the methods work anyhow.

Load More