Previously "Lanrian" on here. Research analyst at Redwood Research. Views are my own.
Feel free to DM me, email me at [my last name].[my first name]@gmail.com or send something anonymously to https://www.admonymous.co/lukas-finnveden
Dario’s strategy is that we have a history of pulling through seemingly at the last minute under dark circumstances. You know, like Inspector Clouseau, The Flash or Buffy the Vampire Slayer.
He is the CEO of a frontier AI company called Anthropic.
Nice pun. I don't think that the anthropic shadow is real though. See e.g. here.
It seems like I have a tendency to get out of my bets too late (same thing happened with Bitcoin), which I'll have to keep in mind in the future.
Reporting from the future: Bitcoin has kept going up, so I don't think you made the mistake of getting out of it too late.
(I'm also confused about why you thought so April 2020, given that bitcoin's price was pretty high at the time.)
Nagy et al (2013) (h/t Carl) looks at time series data of 62 different technologies and compares Wright's law (cost decreases as a power law of cumulative production) vs. generalized Moore's law (technologies improve exponentially with time).
They find that they're both close to equally good, because exponential increase in production is so common, but that Wright's law is slightly better. (They also test various other measures, e.g. economies of scale with instantaneous production levels, time and experience, experience and scale, but find that Wright's law does best.)
I don't know what they find about semiconductors in particular, but given their larger dataset I'm inclined to prefer Wright's law over Moore's for novel domains.
(FYI @Daniel Kokotajlo, @ryan_greenblatt.)
Dario strongly implies that Anthropic "has this covered" and wouldn't be imposing a massively unreasonable amount of risk if Anthropic proceeded as the leading AI company with a small buffer to spend on building powerful AI more carefully. I do not think Anthropic has this covered and in an (optimistic for Anthropic) world where Anthropic had a 3 month lead I think the chance of AI takeover would be high, perhaps around 20%.
I didn't get this impression. (Or maybe I technically agree with your first sentence, if we remove the word "strongly", but I think the focus on Anthropic being in the lead is weird and that there's incorrect implicature from talking about total risk in the second sentence.)
As far as I can tell, the essay doesn't talk much at all about the difference between Anthropic being 3 months ahead vs. 3 months behind.
"I believe the only solution is legislation" + "I am most worried about societal-level rules" and associated statements strongly imply that there's significant total risk even if the leading company is responsible. (Or alternatively, that at some point, absent regulation, it will be impossible to be both in the lead and to take adequate precautions against risks.)
I do think the essay suggests that the main role of legislation is to (i) make the 'least responsible players' act roughly as responsibly as Anthropic, and (ii) to prevent the race & commercial pressures to heat up even further, which might make it "increasingly hard to focus on addressing autonomy risks" (thereby maybe forcing Anthropic to do less to reduce autonomy risks than they are now).
Which does suggest that, if Anthropic could keep spending their current amount of overhead on safety, then there wouldn't be a huge amount of risks coming from Anthropic's own models. And I would agree with you that this is very plausibly false, and that Anthropic will very plausibly be forced to either proceed in a way that creates a substantial risk of Claude taking over, or would have to massively increase their ratio of effort on safety vs. capabilities relative to where it is today. (In which case you'd want legislation to substantially reduce commercial pressures relative to where they are today, and not just make everyone invest about as much in safety as Anthropic is doing today.)
Thanks, that makes sense.
One way to help clarify the effect from (I) would be to add error bars to the individual data points. Presumably models with fewer data points would have wider error bars, and then it would make sense that they pull less hard on the gregression.
Error bars would also be generally great to better understand how much weight to give to the results. In cases where you get a low p-value, I have some sense of this. But in cases like figure 13 where it's a null result, it's hard to tell whether that's strong evidence of an absence of effect, or whether there's just too little data and not much evidence either way.
Thanks for studying this!
I'm confused about figure 13. The red line does not look like a best fit to the blue data-points. I tried to eyeball the x/y location of the datapoints and fit a line and got a slope of ~0.3 rather than -0.03. What am I missing?
Thanks! Appreciate the clarification & pointer.
It's interesting that your framing is "high confidence there's no underlying corrigible motivation", and mine is more like "unlikely it starts without flaws and the improvement process is under-specified in ways that won't fix large classes of flaws".
I think this particular difference might've been downstream of somewhat uninteresting facts about how I interpreted various arguments. Something like: I read the post and was thinking "Jeremy believes that there's lots of events that can cause a model to act in unaligned ways, hm, presumably I'd evaluate that by looking at the events and see whether I agree whether those could cause unaligned behavior, and presumably the argument about the high likelihood is downstream of there being a lot of (~independent) such potential events". And then reading this thread I was like "oh, maybe actually the important action is way earlier: I agree that if the model is fundamentally deep-down misaligned, then you can make a long list of events that could reveal that. What I'd need isn't a long list of (independent) events that could cause misaligned behavior, what I'd need is a long list of (independent) ways that the model could be fundamentally deep-down misaligned in a way that'd be catastrophic if revealed/understood".
But maybe the way to square this is just that most type of events in your list correspond to a separate type of way that the model could've been deep-down misaligned from the start, so it can just as well be read as either.
I'd be happy to video call if you want to talk about this, I think that'd be a quicker way for us to work out where the miscommunication is.
Appreciated! Probably won't prioritize this in the next couple of weeks but will keep it mind as an option for when I want to properly figure out my views here.
I don't think of most the distribution-shift-induced-changes as changes in motivation, more like revealing/understanding the underlying motivation better.
So then is the whole argument premised on high confidence that there's no underlying corrigible motivation in the model? That the initial iterative process will produce an underlying motivation that, if properly understood by the agent itself, recommends rejecting corrigibility?
If so: What's the argument for that? (I didn't notice one in the OP.)
I wrote up a further unappealing implication of SIA+EDT here. (We've talked about it before, so I don't think anything there should be news to you.)
I think supermajorities could do things like this pretty reliably, if it's something they care a lot about. In the US, if a supermajority of people in congress want something to happen, and are incentivized to do vote their beliefs because a supermajority of voters agree, then they can probably pass a law to make it happen. The president would probably be part of the supermajority and therefore cooperative, and it might work even if they aren't. Laws can do a lot.
Of course, it's easy to construct supermajorities of citizens who can't do this kind of thing, if they disproportionately include non-powerful people and don't include powerful people. But that's more about power being unevenly distributed between humans, and less about humans as a collective being disempowered.