As in, Horizon fellows / people who work at Horizon?
What leads to you believe this?
FWIW this is also my impression but I'm going off weak evidence (I wrote about my evidence here), and Horizon is pretty opaque so it's hard to tell. A couple weeks ago I tried reaching out to them to talk about it but they haven't responded.
I think so, yeah. I think my probability of the next model being catastrophically dangerous is a bit higher than it was a year ago, mainly because the IMO gold medal result and similar improvements on models' ability to reason on hard problems. An argument in the other direction is that the more data points you have along a capabilities curve, the more confident you can be that your model of the curve is accurate, although on balance I think this is probably outweighed by the fact that we are now closer to AGI than we were a year ago.
The next-gen LLM might pose an existential threat
I'm pretty sure that the next generation of LLMs will be safe. But the risk is still high enough to make me uncomfortable.
How sure are we that scaling laws are correct? Researchers have drawn curves predicting how AI capabilities scale based on how much goes into training them. If you extrapolate those curves, it looks like the next level of LLMs won't be wildly more powerful than the current level. But maybe there's a weird bump in the curve that happens in between GPT-5 and GPT-6 (or between Claude 4.5 and Claude 5), and LLMs suddenly become much more capable in a way that scaling laws didn't predict. I don't think we can be more than 99.9% confident that there's not.
How sure are we that current-gen LLMs aren't sandbagging (that is, deliberately hiding their true skill level)? I think they're still dumb enough that their sandbagging can be caught, and indeed they have been caught sandbagging on some tests. I don't think LLMs are hiding their true capabilities in general, and our understanding of AI capabilities is probably pretty accurate. But I don't think we can be more than 99.9% confident about that.
How sure are we that the extrapolated capability level of the next-gen LLM isn't enough to take over the world? It probably isn't, but we don't really know what level of capability is required for something like that. I don't think we can be more than 99.9% confident.
Perhaps we can be >99.99% that the extrapolated capability of the next-gen LLM is still not as smart as the smartest human. But an LLM has certain advantages over humans—it can work faster (at least on many sorts of tasks), it can copy itself, it can operate computers in a way that humans can't.
Alternatively, GPT-6/Claude 5 might not be able to take over the world, but it might be smart enough to recursively self-improve, and that might happen too quickly for us to do anything about.
How sure are we that we aren't wrong about something else? I thought of three ways we could be disastrously wrong:
But we could be wrong about some entirely different thing that I didn't even think of. I'm not more than 99.9% confident that my list is comprehensive.
On the whole, I don't think we can say there's less than a 0.4% chance that the next-gen LLM forces us down a path that inevitably ends in everyone dying.
As I understand, this is how scientific bodies' position statements get written. Scientists do not universally agree about the facts in their field, but they iterate on the statement until none of the signatories have any major objections.
I have a strong intuition that this isn't how it works:
When I have a positive experience, it is readily apparent to me that the experience is positive, and no amount of argument can convince me that actually I didn't enjoy myself.
Suppose I did something that I quite enjoyed, and then Omega came up to me and said "actually somebody else last week (or a simulation of you, or whatever) already experienced those exact same qualia, so your qualia weren't that valuable." I'd say, sorry Omega, that is wrong, my experience was good regardless of whether it had already happened before. I know it was good because I directly experienced its goodness.
Perhaps I'm misunderstanding your objection but I think the issue is that Claude is hosted on AWS servers (among other places), which means Amazon could steal Claude's model weights if it wanted to, and ASL-3 states that Claude needs to be secure against theft by other companies (including Amazon).
I don't know for sure that Zach's assertion is true, but I'm reasonably confident that a dedicated Amazon security team could steal the contents of any AWS server if they really wanted to.
The great thing is outside university nobody cares about how fast I can apply the gauss algorithm. It's just important that I know when to use it.
This particular fact sounds right but I think the generalization is often wrong. At my first software development job, I learned more slowly than my peers, and I took longer than usual to get promoted from entry-level to mid-level. This had a real material impact on my earnings and therefore how much money I could donate. It would have been better for the world if I had been able to learn faster.
But I still basically agree with the lesson, as I understand it: trying to go fast is overrated. I don't think "try to go fast" would've helped me at all. (In fact I often was trying to go fast, and it didn't help.)
Yeah I agree that it's not a good plan. I just think that if you're proposing your own plan, your plan should at least mention the "standard" plan and why you prefer to do something different. Like give some commentary on why you don't think alignment bootstrapping is a solution. (And I would probably agree with your commentary.)
Importantly, AFAICT some Horizon fellows are actively working against x-risk (pulling the rope backwards, not sideways). So Horizon's sign of impact is unclear to me. For a lot of people, "tech policy going well" means "regulations that don't impede tech companies' growth".