Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
Ah yeah, I think with that one the audiences were "researchers heavily involved in AGI Safety" (LessWrong) and "ML researchers with some interest in reward hacking / safety" (Medium blog)
We mostly just post more informal stuff directly to LessWrong / Alignment Forum (see e.g. our interp updates).
Having a separate website doesn't seem that useful to readers. I generally see the value proposition of a separate website as attaching the company's branding to the post, which helps the company build a better reputation. It can also help boost the reach of an individual piece of research, but this is a symmetric weapon, and so applying it to informal research seems like a cost to me, not a benefit. Is there some other value you see?
(Incidentally, I would not say that our blog is targeted towards laypeople. I would say that it's targeted towards researchers in the safety community who have a small amount of time to spend and so aren't going to read a full paper. E.g. this post spends a single sentence explaining what scheming is and then goes on to discuss research about it; that would absolutely not work in a piece targeting laypeople.)
AI capabilities progress is smooth, sure, but it's a smooth exponential.
In what sense is AI capabilities a smooth exponential? What units are you using to measure this? Why can't I just take the log of it and call that "AI capabilities" and then say it is a smooth linear increase?
That means that the linear gap in intelligence between the previous model and the next model keeps increasing rather than staying constant, which I think suggests that this problem is likely to keep getting harder and harder rather than stay "so small" as you say.
It seems like the load-bearing thing for you is that the gap between models gets larger, so let's try to operationalize what a "gap" might be.
We could consider the expected probability that AI_{N+1} would beat AI_N on a prompt (in expectation over a wide variety of prompts). I think this is close-to-equivalent to a constant gap in ELO score on LMArena.[1] Then "gap increases" would roughly mean that the gap in ELO scores on LMArena between subsequent model releases would be increasing. I don't follow LMArena much but my sense is that LMArena top scores have been increasing relatively linearly w.r.t time and sublinearly w.r.t model releases (just because model releases have become more frequent). In either case I don't think this supports an "increasing gap" argument.
Personally I prefer to look at benchmark scores. The Epoch Capabilities Index (which I worked on) can be handwavily thought of as ELO scores based on benchmark performance. Importantly, the data that feeds into it does not mention release date at all -- we put in only benchmark performance numbers to estimate capabilities, and then plot it against release date after the fact. It also suggests AI capabilities as operationalized by handwavy-ELO are increasing linearly over time.
I guess the most likely way in which you might think capabilities are exponential is by looking at the METR time horizons result? Of course you could instead say that capabilities are linearly increasing by looking at log time horizons instead. It's not really clear which of these units you should use.
Mostly I think you should not try to go from the METR results to "are gaps in intelligence increasing or staying constant" but if I had to opine on this: the result says that you have a constant doubling time T for the time horizon. One way to think about this is that the AI at time 2T can do work at 50% success rate that AI at time T could do at 25% probability if you provide a decomposition into two pieces each of time T (each of which it completes with probability 50%). I kinda feel like this suggests more like "constant gap" rather than "increasing gap".
Note that I do expect the first two trends to become superlinear eventually via an intelligence explosion, and the METR trend to become superexponential / superlinear (depending on units) probably some time before that (though probably we will just become unable to measure it well). But your claim seems to be about current progress, and for current progress I think it's basically not true that the gap between successive models is increasing rather than staying constant.
Even in the intelligence explosion, capabilities progress is only superlinear w.r.t time, I expect it would still behave in the same way w.r.t inputs like compute and labor (where automated researchers should also count as labor). I'm not sure how I expect it to behave w.r.t successive model generations, partly because I'm not sure "successive model generations" will even be a sensible abstraction at that point. In any case, I don't expect that to be particularly important in assessing the chances of success of a bootstrapping-type plan.
- I am overall skeptical that the CoT will be that useful (since it's not clear to me that a model doing AI safety research sabotage has to put that in the CoT).
It seems pretty wild to go from "it is possible for an AI to subvert a technique" to "the technique will not be that useful". Is that really what you mean? Are you bearish on all control work?
(At the object-level, I'd say that you're drastically limiting the power of the research sabotage that can be done if the model is forced not to put any of it in the CoT, and you should be very happy about this even if the model can still do some research sabotage.)
Tbc, I also don't expect CoT to be that useful for longer-term concerns, but that's mostly because I expect CoTs to become extremely illegible or to stop existing altogether (partly due to my having "long" timelines; on Anthropic-level short timelines I'd be quite bullish on CoT).
Though I don't know that much about LMArena and I expect in practice there are confounders, e.g. as they change the distribution of models that are being evaluated the meaning of the scores will change.
Just switch to log_10 space and add? Eg 10k times 300k = 1e4 * 3e5 = 3e9 = 3B. A bit slower but doesn't require any drills.
Seb is explicitly talking about AGI and not ASI. It's right there in the tweet.
Most people in policy and governance are not talking about what happens after an intelligence explosion. There are many voices in AI policy and governance and lots of them say dumb stuff, e.g. I expect someone has said the next generation of AIs will cause huge unemployment. Comparative advantage is indeed one reasonable thing to discuss in response to that conversation.
Stop assuming that everything anyone says about AI must clearly be a response to Yudkowsky.
it'd be fine if you held alignment constant but dialed up capabilities.
I don't know what this means so I can't give you a prediction about it.
I don't really see why it's relevant how aligned Claude is if we're not thinking about that as part of it
I just named three reasons:
- Current models do not provide much evidence one way or another for existential risk from misalignment (in contrast to frequent claims that "the doomers were right")
- Given tremendous uncertainty, our best guess should be that future models are like current models, and so future models will not try to take over, and so existential risk from misalignment is low
- Some particular threat model predicted that even at current capabilities we should see significant misalignment, but we don't see this, which is evidence against that particular threat model.
Is it relevant to the object-level question of "how hard is aligning a superintelligence"? No, not really. But people are often talking about many things other than that question.
For example, is it relevant to "how much should I defer to doomers"? Yes absolutely (see e.g. #1).
I bucket this under "given this ratio of right/wrong responses, you think a smart alignment researcher who's paying attention can keep it in a corrigibility basin even as capability levels rise?". Does that feel inaccurate, or, just, not how you'd exactly put it?
I'm pretty sure it is not that. When people say this it is usually just asking the question: "Will current models try to take over or otherwise subvert our control (including incompetently)?" and noticing that the answer is basically "no".[1] What they use this to argue for can then vary:
I agree with (1), disagree with (2) when (2) is applied to superintelligence, and for (3) it depends on details.
In Leo's case in particular I don't think he's using the observation for much, it's mostly just a throwaway claim that's part of the flow of the comment, but inasmuch as it is being used it is to say something like "current AIs aren't trying to subvert our control, so it's not completely implausible on the face of it that the first automated alignment researcher to which we delegate won't try to subvert our control", which is just a pretty weak claim and seems fine, and doesn't imply any kind of extrapolation to superintelligence. I'd be surprised if this was an important disagreement with the "alignment is hard" crowd.
There are demos of models doing stuff like this (e.g. blackmail) but only under conditions selected highly adversarially. These look fragile enough that overall I'd still say current models are more aligned than e.g. rationalists (who under adversarially selected conditions have been known to intentionally murder people).
E.g. One naive threat model says "Orthogonality says that an AI system's goals are completely independent of its capabilities, so we should expect that current AI systems have random goals, which by fragility of value will then be misaligned". Setting aside whether anyone ever believed in such a naive threat model, I think we can agree that current models are evidence against such a threat model.
Which, importantly, includes every fruit of our science and technology.
I don't think this is the right comparison, since modern science / technology is a collective effort and so can only cumulate thinking through mostly-interpretable steps. (This may also be true for AI, but if so then you get interpretability by default, at least interpretability-to-the-AIs, at which point you are very likely better off trying to build AIs that can explain that to humans.)
In contrast, I'd expect individual steps of scientific progress that happen within a single mind often are very uninterpretable (see e.g. "research taste").
If we understood an external superhuman world-model as well as a human understands their own world-model, I think that'd obviously get us access to tons of novel knowledge.
Sure, I agree with that, but "getting access to tons of novel knowledge" is nowhere close to "can compete with the current paradigm of building AI", which seems like the appropriate bar given you are trying to "produce a different tool powerful enough to get us out of the current mess".
Perhaps concretely I'd wildly guess with huge uncertainty that this would involve an alignment tax of ~4 GPTs, in the sense that if you had an interpretable world model from GPT-10 similar in quality to a human's understanding of their own world model, that would be similarly useful as GPT-6.
a human's world-model is symbolically interpretable by the human mind containing it.
Say what now? This seems very false:
Tbc I can believe it's true in some cases, e.g. I could believe that some humans' far-mode abstract world models are approximately symbolically interpretable to their mind (though I don't think mine is). But it seems false in the vast majority of domains (if you are measuring relative to competent, experienced people in those domains, as seems necessary if you are aiming for your system to outperform what humans can do).
I wish when you wrote these comments you acknowledged that some people just actually think that we can substantially reduce risk via what you call "marginalist" approaches. Not everyone agrees that you have to deeply understand intelligence from first principles else everyone dies. Depending on how you choose your reference class, I'd guess most people disagree with that.
Imo the vast, vast majority of progress in the world happens via "marginalist" approaches, so if you do think you can win via "marginalist" approaches you should generally bias towards them.