Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
On reflection, it's not actually about which position is more common. My real objection is that imo it was pretty obvious that something along these lines would be the crux between you and Neel (and the fact that it is a common position is part of why I think it was obvious).
Inasmuch as you are actually trying to have a conversation with Neel or address Neel's argument on its merits, it would be good to be clear that this is the crux. I guess perhaps you might just not care about that and are instead trying to influence readers without engaging with the OP's point of view, in which case fair enough. Personally I would find that distasteful / not in keeping with my norms around collective-epistemics but I do admit it's within LW norms.
(Incidentally, I feel like you still aren't quite pinning down your position -- depending on what you mean by "reliably" I would probably agree with "marginalist approaches don't reliably improve things". I'd also agree with "X doesn't reliably improve things" for almost any interesting value of X.)
What exactly do you mean by ambitious mech interp, and what does it enable? You focus on debugging here, but you didn't title the post "an ambitious vision for debugging", and indeed I think a vision for debugging would look quite different.
For example, you might say that the goal is to have "full human understanding" of the AI system, such that some specific human can answer arbitrary questions about the AI system (without just delegating to some other system). To this I'd reply that this seems like an unattainable goal; reality is very detailed, AIs inherit a lot of that detail, a human can't contain all of it.
Maybe you'd say "actually, the human just has to be able to answer any specific question given a lot of time to do so", so that the human doesn't have to contain all the detail of the AI, and can just load in the relevant detail for a given question. To do this perfectly, you still need to contain the detail of the AI, because you need to argue that there's no hidden structure anywhere in the AI that invalidates your answer. So I still think this is an unattainable goal.
Maybe you'd then say "okay fine, but come on, surely via decent heuristic arguments, the human's answer can get way more robust than via any of the pragmatic approaches, even if you don't get something like a proof". I used to be more optimistic about this but things like self-repair and negative heads make it hard in practice, not just in theory. Perhaps more fundamentally, if you've retreated this far back, it's unclear to me why we're calling this "ambitious mech interp" rather than "pragmatic interp".
To be clear, I like most of the agendas in AMI and definitely want them to be a part of the overall portfolio, since they seem especially likely to provide new affordances. I also think many of the directions are more future-proof (ie more likely to generalize to future very different AI systems). So it's quite plausible that we don't disagree much on what actions to take. I mostly just dislike gesturing at "it would be so good if we had <probably impossible thing> let's try to make it happen".
Fair, I've edited the comment with a pointer. It still seems to me to be a pretty direct disagreement with "we can substantially reduce risk via [engineering-type / category 2] approaches".
My claim is "while it certainly could be net negative (as is also the case for ~any action including e.g. donating to AMF), in aggregate it is substantially positive expected risk reduction".
Your claim in opposition seems to be "who knows what the sign is, we should treat it as an expected zero risk reduction".
Though possibly you are saying "it's bad to take actions that have a chance of backfiring, we should focus much more on robustly positive things" (because something something virtue ethics?), in which case I think we have a disagreement on decision theory instead.
I still want to claim that in either case, my position is much more common (among the readership here), except inasmuch as they disagree because they think alignment is very hard and that's why there's expected zero (or negative) risk reduction. And so I wish you'd flag when your claims depend on these takes (though I realize it is often hard to notice when that is the case).
Makes sense, I still endorse my original comment in light of this answer (as I already expected something like this was your view). Like, I would now say
Imo the vast, vast majority of progress in the world happens via "engineering-type / category 2" approaches, so if you do think you can win via "engineering-type / category 2" approaches you should generally bias towards them
while also noting that the way we are using the phrase "engineering-type" here includes a really large amount of what most people would call "science" (e.g. it includes tons of academic work), so it is important when evaluating this claim to interpret the words "engineering" and "science" in context rather than via their usual connotations.
2) the AI safety community doesn't try to solve alignment for smarter than human AI systems
I assume you're referring to "whole swathes of the field are not even aspiring to be the type of thing that could solve misalignment".
Imo, chain of thought monitoring, AI control, amplified oversight, MONA, reasoning model interpretability, etc, are all things that could make the difference between "x-catastrophe via misalignment" and "no x-catastrophe via misalignment", so I'd say that lots of our work could "solve misalignment", though not necessarily in a way where we can know that we've solved misalignment in advance.
Based on Richard's previous writing (e.g. 1, 2) I expect he sees this sort of stuff as not particularly interesting alignment research / doesn't really help, so I jumped ahead in the conversation to that disagreement.
even if no one breakthrough or discovery "solves alignment", a general frame of "lets find principled approaches" is often more generative than "let's find the cheapest 80/20 approach"
Sure, I broadly agree with this, and I think Neel would too. I don't see Neel's post as disagreeing with it, and I don't think the list of examples that Richard gave is well described as "let's find the cheapest 80/20 approach".
I wish when you wrote these comments you acknowledged that some people just actually think that we can substantially reduce risk via what you call "marginalist" approaches. Not everyone agrees that you have to deeply understand intelligence from first principles else everyone dies. (EDIT: See Richard's clarification downthread.) Depending on how you choose your reference class, I'd guess most people disagree with that.
Imo the vast, vast majority of progress in the world happens via "marginalist" approaches, so if you do think you can win via "marginalist" approaches you should generally bias towards them.
Ah yeah, I think with that one the audiences were "researchers heavily involved in AGI Safety" (LessWrong) and "ML researchers with some interest in reward hacking / safety" (Medium blog)
We mostly just post more informal stuff directly to LessWrong / Alignment Forum (see e.g. our interp updates).
Having a separate website doesn't seem that useful to readers. I generally see the value proposition of a separate website as attaching the company's branding to the post, which helps the company build a better reputation. It can also help boost the reach of an individual piece of research, but this is a symmetric weapon, and so applying it to informal research seems like a cost to me, not a benefit. Is there some other value you see?
(Incidentally, I would not say that our blog is targeted towards laypeople. I would say that it's targeted towards researchers in the safety community who have a small amount of time to spend and so aren't going to read a full paper. E.g. this post spends a single sentence explaining what scheming is and then goes on to discuss research about it; that would absolutely not work in a piece targeting laypeople.)
AI capabilities progress is smooth, sure, but it's a smooth exponential.
In what sense is AI capabilities a smooth exponential? What units are you using to measure this? Why can't I just take the log of it and call that "AI capabilities" and then say it is a smooth linear increase?
That means that the linear gap in intelligence between the previous model and the next model keeps increasing rather than staying constant, which I think suggests that this problem is likely to keep getting harder and harder rather than stay "so small" as you say.
It seems like the load-bearing thing for you is that the gap between models gets larger, so let's try to operationalize what a "gap" might be.
We could consider the expected probability that AI_{N+1} would beat AI_N on a prompt (in expectation over a wide variety of prompts). I think this is close-to-equivalent to a constant gap in ELO score on LMArena.[1] Then "gap increases" would roughly mean that the gap in ELO scores on LMArena between subsequent model releases would be increasing. I don't follow LMArena much but my sense is that LMArena top scores have been increasing relatively linearly w.r.t time and sublinearly w.r.t model releases (just because model releases have become more frequent). In either case I don't think this supports an "increasing gap" argument.
Personally I prefer to look at benchmark scores. The Epoch Capabilities Index (which I worked on) can be handwavily thought of as ELO scores based on benchmark performance. Importantly, the data that feeds into it does not mention release date at all -- we put in only benchmark performance numbers to estimate capabilities, and then plot it against release date after the fact. It also suggests AI capabilities as operationalized by handwavy-ELO are increasing linearly over time.
I guess the most likely way in which you might think capabilities are exponential is by looking at the METR time horizons result? Of course you could instead say that capabilities are linearly increasing by looking at log time horizons instead. It's not really clear which of these units you should use.
Mostly I think you should not try to go from the METR results to "are gaps in intelligence increasing or staying constant" but if I had to opine on this: the result says that you have a constant doubling time T for the time horizon. One way to think about this is that the AI at time 2T can do work at 50% success rate that AI at time T could do at 25% probability if you provide a decomposition into two pieces each of time T (each of which it completes with probability 50%). I kinda feel like this suggests more like "constant gap" rather than "increasing gap".
Note that I do expect the first two trends to become superlinear eventually via an intelligence explosion, and the METR trend to become superexponential / superlinear (depending on units) probably some time before that (though probably we will just become unable to measure it well). But your claim seems to be about current progress, and for current progress I think it's basically not true that the gap between successive models is increasing rather than staying constant.
Even in the intelligence explosion, capabilities progress is only superlinear w.r.t time, I expect it would still behave in the same way w.r.t inputs like compute and labor (where automated researchers should also count as labor). I'm not sure how I expect it to behave w.r.t successive model generations, partly because I'm not sure "successive model generations" will even be a sensible abstraction at that point. In any case, I don't expect that to be particularly important in assessing the chances of success of a bootstrapping-type plan.
- I am overall skeptical that the CoT will be that useful (since it's not clear to me that a model doing AI safety research sabotage has to put that in the CoT).
It seems pretty wild to go from "it is possible for an AI to subvert a technique" to "the technique will not be that useful". Is that really what you mean? Are you bearish on all control work?
(At the object-level, I'd say that you're drastically limiting the power of the research sabotage that can be done if the model is forced not to put any of it in the CoT, and you should be very happy about this even if the model can still do some research sabotage.)
Tbc, I also don't expect CoT to be that useful for longer-term concerns, but that's mostly because I expect CoTs to become extremely illegible or to stop existing altogether (partly due to my having "long" timelines; on Anthropic-level short timelines I'd be quite bullish on CoT).
Though I don't know that much about LMArena and I expect in practice there are confounders, e.g. as they change the distribution of models that are being evaluated the meaning of the scores will change.
(I have the same critique of the first two paragraphs, but thanks for the edit, it helps)
Fwiw, I am actively surprised that you have a p(doom) < 50%, I can name several lines of evidence in the opposite direction:
In terms of evidence that you have a p(doom) < 50%, I think the main thing that comes to mind is that you argued against Eliezer about this in late 2021, but that was quite a while ago (relative to the evidence above) and I thought you had changed your mind. (Also iirc the stuff you said then was consistent with p(doom) ~ 50%, but it's long enough ago that I could easily be forgetting things.)
You could point to ~any reasonable subcommunity within AI safety (or the entire community) and I'd still be on board with the claim that there's at least a 10% chance that will make things worse, which I might summarize as "they won't reliably improve things", so I still feel like this isn't quite capturing the distinction. (I'd include communities focused on "science" in that, but I do agree that they are more likely not to have a negative sign.) So I still feel confused about what exactly your position is.