Hi Alex,
Let me first acknowledge that your write-up is significantly more thorough than pretty much all content on LessWrong, and that I found the particular examples interesting. I also appreciated that you included a related work section in your write-up. The reason I commented on this post and not others is because it's one of the few ML posts on LessWrong that seemed like it might teach me something, and I wish I had made that more clear before posting critical feedback (I was thinking of the feedback as directed at Oliver / Raemon's moderation norms, ...
Thanks so much, I really appreciate this comment. I think it'll end up improving this post/the upcoming paper.
(I might reply later to specific points)
I'll just note that I, like Dan H, find it pretty hard to engage with this post because I can't tell whether it's basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn't really help in this regard.
I'm not sure what you mean about whether the post was "missing something important", but I do think that you should be pretty worried about LessWrong's collective epistemics that Dan H is the only one bringing this important point up, and that rather than being rewarded for doing so or engaged w...
I, like Dan H, find it pretty hard to engage with this post because I can't tell whether it's basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn't really help in this regard.
The answer is: No, our work is very different from that paper. Here's the paragraph in question:
...Editing Models with Task Arithmetic explored a "dual" version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weig
Here is my take: since there's so much AI content, it's not really feasible to read all of it, so in practice I read almost none of it (and consequently visit LW less frequently).
The main issue I run into is that for most posts, on a brief skim it seems like basically a thing I have thought about before. Unlike academic papers, most LW posts do not cite previous related work nor explain how what they are talking about relates to this past work. As a result, if I start to skim a post and I think it's talking about something I've seen before, I have no easy ...
and consequently visit LW less frequently
Tangentially, "visiting LW less frequently" is not necessarily a bad thing. We are not in the business of selling ads; we do not need to maximize the time users spend here. Perhaps it would be better if people spent less time online (including on LW) and more time doing whatever meaningful things they might do otherwise.
But I agree that even assuming this, "the front page is full of things I do not care about" is a bad way to achieve it.
tools for citation to the existing corpus of lesswrong posts and to off-site scientific papers would be amazing; eg, rolling search for related academic papers as you type your comment via the semanticscholar api, combined with search over lesswrong for all proper nouns in your comment. or something. I have a lot of stuff I want to say that I expect and intend is mostly reference to citations, but formatting the citations for use on lesswrong is a chore, and I suspect that most folks here don't skim as many papers as I do. (that said, folks like yourself c...
I think this might be an overstatement. It's true that NSF tends not to fund developers, but in ML the NSF is only one of many funders (lots of faculty have grants from industry partnerships, for instance).
Thanks for writing this!
Regarding how surprise on current forecasts should factor into AI timelines, two takes I have:
* Given that all the forecasts seem to be wrong in the "things happened faster than we expected" direction, we should probably expect HLAI to happen faster than expected as well.
* It also seems like we should retreat more to outside views about general rates of technological progress, rather than forming a specific inside view (since the inside view seems to mostly end up being wrong).
I think a pure outside view would give a med...
Thanks! I just read over it and assuming I understood correctly, this bottleneck primarily happens for "small" operations like layer normalization and softlax, and not for large matrix multiples. In addition, these small operations are still the minority of runtime (40% in their case). So I think this is still consistent with my analysis, which assumes various things will creep in to keep GPU utilization around 40%, but that they won't ever drive it to (say) 10%. Is this correct or have I misunderstood the nature of the bottleneck?
Edit: also maybe we're ju...
Okay, thanks! The posts actually are written in markdown, at least on the backend, in case that helps you.
Question for mods (sorry if I asked this before): Is there a way to make the LaTeX render?
In theory MathJax should be enough, eg that's all I use at the original post: https://bounded-regret.ghost.io/how-fast-can-we-perform-a-forward-pass/
I was surprised by this claim. To be concrete, what's your probability of xrisk conditional on 10-year timelines? Mine is something like 25% I think, and higher than my unconditional probability of xrisk.
Fortunately (?), I think the jury is still out on whether phase transitions happen in practice for large-scale systems. It could be that once a system is complex and large enough, it's hard for a single factor to dominate and you get smoother changes. But I think it could go either way.
Thanks! I pretty much agree with everything you said. This is also largely why I am excited about the work, and I think what you wrote captures it more crisply than I could have.
Yup, I agree with this, and think the argument generalizes to most alignment work (which is why I'm relatively optimistic about our chances compared to some other people, e.g. something like 85% p(success), mostly because most things one can think of doing will probably be done).
It's possibly an argument that work is most valuable in cases of unexpectedly short timelines, although I'm not sure how much weight I actually place on that.
Note the answer changes a lot based on how the question is operationalized. This stronger operationalization has dates around a decade later.
Here's a link to the version on my blog: https://bounded-regret.ghost.io/appendix-more-is-different-in-related-fields/
Thanks Ruby! Now that the other posts are out, would it be easy to forward-link them (by adding links to the italicized titles in the list at the end)?
@Mods: Looks like the LaTeX isn't rendering. I'm not sure what the right way to do that is on LessWrong. On my website, I do it with code injection. You can see the result here, where the LaTeX all renders in MathJax: https://bounded-regret.ghost.io/ml-systems-will-have-weird-failure-modes-2/
I feel like you are arguing for a very strong claim here, which is that "as soon as you have an efficient way of determining whether a problem is solved, and any way of generating a correct solution some very small fraction of the time, you can just build an efficient solution that solves it all of the time"
Hm, this isn't the claim I intended to make. Both because it overemphasizes on "efficient" and because it adds a lot of "for all" statements.
If I were trying to state my claim more clearly, it would be something like "generically, for the large majority...
Thanks for the push-back and the clear explanation. I still think my points hold and I'll try to explain why below.
In order to even get a single expected datapoint of approval, I need to sample 10^8 examples, which in our current sampling method would take 10^8 * 10 hours, e.g. approximately 100,000 years. I don't understand how you could do "Learning from Human Preferences" on something this sparse
This is true if all the other datapoints are entirely indistinguishable, and the only signal is "good" vs. "bad". But in practice you would compare / rank the d...
This would imply a fixed upper bound on the number of bits you can produce (for instance, a false negative rate of 1 in 128 implies at most 7 bits). But in practice you can produce many more than 7 bits, by double checking your answer, combining multiple sources of information, etc.
Maybe, but I think some people would disagree strongly with this list even in the abstract (putting almost no weight on Current ML, or putting way more weight on humans, or something else). I agree that it's better to drill down into concrete disagreements, but I think right now there are implicit strong disagreements that are not always being made explicit, and this is a quick way to draw them out.
Basically the same techniques as in Deep Reinforcement Learning from Human Preferences and the follow-ups--train a neural network model to imitate your judgments, then chain it together with RL.
I think current versions of that technique could easily give you 33 bits of information--although as noted elsewhere, the actual numbers of bits you need might be much larger than that, but the techniques are getting better over time as well.
Yes, I think I understand that more powerful optimizers can find more spurious solutions. But the OP seemed to be hypothesizing that you had some way to pick out the spurious from the good solutions, but saying it won't scale because you have 10^50, not 100, bad solutions for each good one. That's the part that seems wrong to me.
That part does seem wrong to me. It seems wrong because 10^50 is possibly too small. See my post Seeking Power is Convergently Instrumental in a Broad Class of Environments:
...If the agent flips the first bit, it's locked into a single trajectory. None of its actions matter anymore.
But if the agent flips the second bit – this may be suboptimal for a utility function, but the agent still has lots of choices remaining. In fact, it still can induce observation histories. If and , then that's  
I'm not sure I understand why it's important that the fraction of good plans is 1% vs .00000001%. If you have any method for distinguishing good from bad plans, you can chain it with an optimizer to find good plans even if they're rare. The main difficulty is generating enough bits--but in that light, the numbers I gave above are 7 vs 33 bits--not a clear qualitative difference. And in general I'd be kind of surprised if you could get up to say 50 bits but then ran into a fundamental obstacle in scaling up further.
Can you be more concrete about how you would do this? If my method for evaluation is "sit down and think about the consequences of doing this for 10 hours", I have no idea how I would chain it with an optimizer to find good plans even if they are rare.
The fear of anthropomorphising AI is one of the more ridiculous traditional mental blindspots in the LW/rationalist sphere.
You're really going to love Thursday's post :).
Jokes aside, I actually am not sure LW is that against anthropomorphising. It seems like a much stronger injunction among ML researchers than it is on this forum.
I personally am not very into using humans as a reference class because it is a reference class with a single data point, whereas e.g. "complex systems" has a much larger number of data points.
In addition, it seems like intuition ...
Okay I think I get what you're saying now--more SGD steps should increase "effective model capacity", so per the double descent intuition we should expect the validation loss to first increase then decrease (as is indeed observed). Is that right?
But if you keep training, GD should eventually find a low complexity high test scoring solution - if one exists - because those solutions have an even higher score (with some appropriate regularization term). Obviously much depends on the overparameterization and relative reg term strength - if it's too strong GD may fail or at least appear to fail as it skips the easier high complexity solution stage. I thought that explanation of grokking was pretty clear.
I think I'm still not understanding. Shouldn't the implicit regularization strength of SGD be higher...
I'm not sure I get what the relation would be--double descent is usually with respect to the model size (vs. amout of data), although there is some work on double descent vs. number of training iterations e.g. https://arxiv.org/abs/1912.02292. But I don't immediately see how to connect this to grokking.
(I agree they might be connected, I'm just saying I don't see how to show this. I'm very interested in models that can explain grokking, so if you have ideas let me know!)
I don't think it's inferior -- I think both of them have contrasting strengths and limitations. I think the default view in ML would be to use 95% empiricism, 5% philosophy when making predictions, and I'd advocate for more like 50/50, depending on your overall inclinations (I'm 70-30 since I love data, and I think 30-70 is also reasonable, but I think neither 95-5 or 5-95 would be justifiable).
I'm curious what in the post makes you think I'm claiming philosophy is superior. I wrote this:
> Confronting emergence will require adopting mindsets that are le...
Also my personal take is that SF, on a pure scientific/data basis, has had one of the best responses in the nation, probably benefiting from having UCSF for in-house expertise. (I'm less enthusiastic about the political response--I think we erred way too far on the "take no risks" side, and like everyone else prioritized restaurants over schools which seems like a clear mistake. But on the data front I feel like you're attacking one of the singularly most reasonable counties in the U.S.)
It seems like the main alternative would be to have something like Alameda County's reporting, which has a couple days fewer lag at the expense of less quality control: https://covid-19.acgov.org/data.page?#cases.
It's really unclear to me that Alameda's data is more informative than SF's. (In fact I'd say it's the opposite--I tend to look at SF over Alameda even though I live in Alameda County.)
I think there is some information lost in SF's presentation, but it's generally less information lost than most alternatives on the market. SF is also backdating th...
Finding the min-max solution might be easier, but what we actually care about is an acceptable solution. My point is that the min-max solution, in most cases, will be unacceptably bad.
And in fact, since min_x f(theta,x) <= E_x[f(theta,x)], any solution that is acceptable in the worst case is also acceptable in the average case.
Thanks! I appreciated these distinctions. The worst-case argument for modularity came up in a past argument I had with Eliezer, where I argued that this was a reason for randomization (even though Bayesian decision theory implies you should never randomize). See section 2 here: The Power of Noise.
Re: 50% vs. 10% vs. 90%. I liked this illustration, although I don't think your argument actually implies 50% specifically. For instance if it turns out that everyone else is working on the 50% worlds and no one is working on the 90% worlds, you should probably wo...
I think this probably depends on the field. In machine learning, solving problems under worst-case assumptions is usually impossible because of the no free lunch theorem. You might assume that a particular facet of the environment is worst-case, which is a totally fine thing to do, but I don't think it's correct to call it the "second-simplest solution", since there are many choices of what facet of the environment is worst-case.
One keyword for this is "partial specification", e.g. here is a paper I wrote that makes a minimal set of statistical assumptions...
Cool paper! One brief comment is this seems closely related to performative prediction and it seems worth discussing the relationship.
Edit: just realized this is a review, not a new paper, so my comment is a bit less relevant. Although it does still seem like a useful connection to make.
Oh okay got it! It looks like the behavior is as intended, but one downside from my perspective is that the blog link is not very visually prominent as is--I would expect most readers to not notice it. I care about this mostly because I would like more people to know about my blog's existence, and I think it could be fixed if there was the option to add a small avatar next to the blog name to make it more visually prominent (I could imagine lots of other fixes too but just throwing a concrete one out there).
On a separate not it looks like the latex is not ...
@LW mods: Looks like this one also doesn't link back to Bounded Regret? Could it be because of the italicized text that I put at the top?
My basic take is that there will be lots of empirical examples where increasing model size by a factor of 100 leads to nonlinear increases in capabilities (and perhaps to qualitative changes in behavior). On median, I'd guess we'll see at least 2 such examples in 2022 and at least 100 by 2030.
At the point where there's a "FOOM", such examples will be commonplace and happening all the time. Foom will look like one particularly large phase transition (maybe 99th percentile among examples so far) that chains into more and more. It seems possible (though not c...
@LW Mods: It looks like the embedded IFrame from the original post didn't copy over. Is there some way either to embed it here, or else just copy it over as an image? (Also, it looks like this post doesn't actually link back to my blog like it normally does, not sure why...)
Thanks. For time/brevity, I'll just say which things I agree / disagree with:
> sufficiently capable and general AI is likely to have property X as a strong default [...]
I generally agree with this, although for certain important values of X (such as "fooling humans for instrumental reasons") I'm probably more optimistic than you that there will be a robust effort to get not-X, including by many traditional ML people. I'm also probably more optimistic (but not certain) that those efforts will succeed.
[inside view, modest epistemology]: I don't have...
I'm not (retroactively in imaginary prehindsight) excited by this problem because neither of the 2 possible answers (3 possible if you count "the same") had any clear-to-my-model relevance to alignment, or even AGI. AGI will have better OOD generalization on capabilities than current tech, basically by the definition of AGI; and then we've got less-clear-to-OpenPhil forces which cause the alignment to generalize more poorly than the capabilities did, which is the Big Problem. Bigger models generalizing better or worse doesn't say anything obvio...
Not sure if this helps, and haven't read the thread carefully, but my sense is your framing might be eliding distinctions that are actually there, in a way that makes it harder to get to the bottom of your disagreement with Adam. Some predictions I'd have are that:
* For almost any experimental result, a typical MIRI person (and you, and Eliezer) would think it was less informative about AI alignment than I would.
* For almost all experimental results you would think they were so much less informative as to not be worthwhile.
* There's a sma...
I would agree with you that "MIRI hates all experimental work" / etc. is not a faithful representation of this state of affairs, but I think there is nevertheless an important disagreement MIRI has with typical ML people, and that the disagreement is primarily about what we can learn from experiments.
Ooh, that's really interesting. Thinking about it, I think my sense of what's going on is (and I'd be interested to hear how this differs from your sense):
It is definitely useful in some settings! For instance it's much easier to collaborate with people not at Berkeley, and in some cases those people have valuable specialized skills that easily outweigh the productivity hit.
I personally have Wednesdays, plus Thursday mornings, as "no meeting days". I think it works pretty well and I know other faculty who do something similar (sometimes just setting mornings as meeting-free). So this does seem like a generally good idea!
I enjoyed this quite a bit. Vision is very important in sports as well, but I hadn't thought to apply it to other areas, despite generally being into applying sports lessons to research (i.e. https://bounded-regret.ghost.io/film-study/).
In sports, you have to choose between watching the person you're guarding and watching the ball / center of play. Or if you're on offense, between watching where you're going and watching the ball. Eye contact is also important for (some) passing.
What's most interesting is the second-level version of this, where good player...
Glad it was helpful!