Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
Filler tokens don't allow for serially deeper cognition than what architectural limits allow
This depends on your definition of serial cognition, under the definitions I like most the serial depth scales logarithmically with the number of tokens. This is because as you increase parallelism (in the sense you use above), that also increases serial depth logarithmically.
The basic intuitions for this are:
Ah, I realized there was something else I should have highlighted. You mention you care about pre-ChatGPT takes towards shorter timelines -- while compute-centric takeoff was published two months after ChatGPT, I expect that the basic argument structure and conclusions were present well before the release of ChatGPT.
While I didn't observe that report in particular, in general Open Phil worldview investigations took > 1 year of serial time and involved a pretty significant and time-consuming "last mile" step where they get a bunch of expert review before publication. (You probably observed this "last mile" step with Joe Carlsmith's report, iirc Nate was one of the expert reviewers for that report.) Also, Tom Davidson's previous publications were in March 2021 and June 2021, so I expect he was working on the topic for some of 2021 and ~all of 2022.
I suppose a sufficiently cynical observer might say "ah, clearly Open Phil was averse to publishing this report that suggests short timelines and intelligence explosions until after the ChatGPT moment". I don't buy it, based on my observations of the worldview investigations team (I realize that it might not have been up to the worldview investigations team, but I still don't buy it).
I guess one legible argument I could make to the cynic would be that on the cynical viewpoint, it should have taken Open Phil a lot longer to realize they should publish the compute-centric takeoff post. Does the cynic really think that, in just two months, a big broken org would be able to:
That's just so incredibly fast for big broken orgs to move.
I think I agree with all of that under the definitions you're using (and I too prefer the bounded rationality version). I think in practice I was using words somewhat differently than you.
(The rest of this comment is at the object level and is mostly for other readers, not for you)
Saying it's "crazy" means it's low probability of being (part of) the right world-description.
The "right" world-description is a very high bar (all models are wrong but some are useful), but if I go with the spirit of what you're saying I think I might not endorse calling bio anchors "crazy" by this definition, I'd say more like "medium" probability of being a generally good framework for thinking about the domain, plus an expectation that lots of the specific details would change with more investigation.
Honestly I didn't have any really precise meaning by "crazy" in my original comment, I was mainly using it as a shorthand to gesture at the fact that the claim is in tension with reductionist intuitions, and also that the legibly written support for the claim is weak in an absolute sense.
Saying it's "the best we have" means it's the clearest model we have--the most fleshed-out hypothesis.
I meant a higher bar than this; more like "the most informative and relevant thing for informing your views on the topic" (beyond extremely basic stuff like observing that humanity can do science at all, or things like reference class priors). Like, I also claim it is better than "query your intuitions about how close we are to AGI, and how fast we are going, to come up with a time until we get to AGI". So it's not just the clearest / most fleshed-out, it's also the one that should move you the most, even including various illegible or intuition-driven arguments. (Obviously scoped only to the arguments I know about; for all I know other people have better arguments that I haven't seen.)
If it were merely the clearest model or most fleshed-out hypothesis, I agree it would usually be a mistake to make a large belief update or take big consequential actions on that basis.
I also want to qualify / explain my statement about it being a crazy argument. The specific part that worries me (and Eliezer, iiuc) is the claim that, at a given point in time, the delta between natural and artificial artifacts will tend to be approximately constant across different domains. This is quite intuition-bending from a mechanistic / reductionist viewpoint, and the current support for it seems very small and fragile (this 8 page doc). However, I can see a path where I would believe in it much more, which would involve things like:
I anticipate someone asking the followup "why didn't Open Phil do that, then?" I don't know what Open Phil was thinking, but I don't think I'd have made a very different decision. It's a lot of work, not many people can do it, and many of those people had better things to do, e.g. imo the compute-centric takeoff work was indeed more important and caused bigger updates than I think the work above would have done (and was probably easier to do).
Would you agree that updating / investing "a lot" in an argument that's kind of crazy in some absolute sense, would be an epistemic / strategic mistake, even if that argument is the best available specific argument in a relative sense?
Hmm, maybe? What exactly is the alternative?
Some things that I think would usually be epistemic / strategic mistakes in this situation:
Some things that I don't think would immediately qualify as epistemic / strategic mistakes (of course they could still be mistakes depending on further details):
Some takes:
On the actual object level beliefs:
The first two paragraphs of my original comment were trying to do this.
(I have the same critique of the first two paragraphs, but thanks for the edit, it helps)
The second is where you summarize my position as being that we need deep scientific understanding or else everyone dies (which I think you can predict is a pretty unlikely position for me in particular to hold).
Fwiw, I am actively surprised that you have a p(doom) < 50%, I can name several lines of evidence in the opposite direction:
In terms of evidence that you have a p(doom) < 50%, I think the main thing that comes to mind is that you argued against Eliezer about this in late 2021, but that was quite a while ago (relative to the evidence above) and I thought you had changed your mind. (Also iirc the stuff you said then was consistent with p(doom) ~ 50%, but it's long enough ago that I could easily be forgetting things.)
However, I'd guess that I'm more willing than you to point to a broad swathe of people pursuing marginalist approaches and claim that they won't reliably improve things.
You could point to ~any reasonable subcommunity within AI safety (or the entire community) and I'd still be on board with the claim that there's at least a 10% chance that will make things worse, which I might summarize as "they won't reliably improve things", so I still feel like this isn't quite capturing the distinction. (I'd include communities focused on "science" in that, but I do agree that they are more likely not to have a negative sign.) So I still feel confused about what exactly your position is.
On reflection, it's not actually about which position is more common. My real objection is that imo it was pretty obvious that something along these lines would be the crux between you and Neel (and the fact that it is a common position is part of why I think it was obvious).
Inasmuch as you are actually trying to have a conversation with Neel or address Neel's argument on its merits, it would be good to be clear that this is the crux. I guess perhaps you might just not care about that and are instead trying to influence readers without engaging with the OP's point of view, in which case fair enough. Personally I would find that distasteful / not in keeping with my norms around collective-epistemics but I do admit it's within LW norms.
(Incidentally, I feel like you still aren't quite pinning down your position -- depending on what you mean by "reliably" I would probably agree with "marginalist approaches don't reliably improve things". I'd also agree with "X doesn't reliably improve things" for almost any interesting value of X.)
What exactly do you mean by ambitious mech interp, and what does it enable? You focus on debugging here, but you didn't title the post "an ambitious vision for debugging", and indeed I think a vision for debugging would look quite different.
For example, you might say that the goal is to have "full human understanding" of the AI system, such that some specific human can answer arbitrary questions about the AI system (without just delegating to some other system). To this I'd reply that this seems like an unattainable goal; reality is very detailed, AIs inherit a lot of that detail, a human can't contain all of it.
Maybe you'd say "actually, the human just has to be able to answer any specific question given a lot of time to do so", so that the human doesn't have to contain all the detail of the AI, and can just load in the relevant detail for a given question. To do this perfectly, you still need to contain the detail of the AI, because you need to argue that there's no hidden structure anywhere in the AI that invalidates your answer. So I still think this is an unattainable goal.
Maybe you'd then say "okay fine, but come on, surely via decent heuristic arguments, the human's answer can get way more robust than via any of the pragmatic approaches, even if you don't get something like a proof". I used to be more optimistic about this but things like self-repair and negative heads make it hard in practice, not just in theory. Perhaps more fundamentally, if you've retreated this far back, it's unclear to me why we're calling this "ambitious mech interp" rather than "pragmatic interp".
To be clear, I like most of the agendas in AMI and definitely want them to be a part of the overall portfolio, since they seem especially likely to provide new affordances. I also think many of the directions are more future-proof (ie more likely to generalize to future very different AI systems). So it's quite plausible that we don't disagree much on what actions to take. I mostly just dislike gesturing at "it would be so good if we had <probably impossible thing> let's try to make it happen".
Cool result!
Why do you believe this is meta-cognition? (Or maybe the question is, what do you mean by meta-cognition?)
It seems like it could easily be something else. For example, probably when solving problems the model looks at the past strategies it has used and tries some other strategy to increase the likelihood of solving the problem. It does this primarily in the token space (looking at past reasoning and trying new stuff) but this also generalizes somewhat to the activation space (looking at what past forward passes did and trying something else). So when you have filler tokens the latter effect still happens, giving a slight best-of-N type boost, producing your observed results.