Former AI safety research engineer, now AI governance researcher at OpenAI. Blog: thinkingcomplete.com
Thanks for the reply.
A post which focuses on the object-level implications for AI of a theory of rationality which looks very different from the AIXI-flavoured rat-orthodox view.
I'm working on this right now, actually. Will hopefully post in a couple of weeks.
I say this because those sorts of considerations convinced me that we're much less likely to be buggered.
That seems reasonable. But I do think there's a group of people who have internalized bayesian rationalism enough that the main blocker is their general epistemology, rather than the way they reason about AI in particular.
6 seems too general a claim to me. Why wouldn't it work for 1% vs 10%, and likewise 0.1% vs 1% i.e. why doesn't this suggest that you should round down P(doom) to zero.
I think the point of 6 is not to say "here's where you should end up", but more to say "here's the reason why this straightforward symmetry argument doesn't hold".
7 I kinda disagree with. Those models of idealized reasoning you mention generalize Bayesianism/Expected Utility Maximization. But they are not far from the Bayesian framework or EU frameworks.
There's still something importantly true about EU maximization and bayesianism. I think the changes we need will be subtle but have far-reaching ramifications. Analogously, relativity was a subtle change to newtonian mechanics that had far-reaching implications for how to think about reality.
Like Bayesianism, they do say there are correct and incorrect ways of combining beliefs, that beliefs should be isomorphic to certain structures, unless I'm horribly mistaken. Which sure is not what you're claiming to be the case in your above points.
Any epistemology will rule out some updates, but a problem with bayesianism is that it says there's one correct update to make. Whereas radical probabilism, for example, still sets some constraints, just far fewer.
Edited for clarity now.
Some opinions about AI and epistemology:
Part of my view here is that ARA agents will have unique affordances that no human organization will have had before (like having truly vast, vast amounts of pretty high skill labor).
The more labor they have, the more detectable they are, and the easier they are to shut down. Also, are you picturing them gaining money from crimes, then buying compute legitimately? I think the "crimes" part is hard to stop but the "paying for compute" part is relatively easy to stop.
My guess is that you need to be a decent but not amazing software engineer to ARA.
Yeah, you're probably right. I still stand by the overall point though.
1) It’s not even clear people are going to try to react in the first place.
I think this just depends a lot on how large-scale they are. If they are using millions of dollars of compute, and are effectively large-scale criminal organizations, then there are many different avenues by which they might get detected and suppressed.
If we don't solve alignment and we implement a pause on AI development in labs, the ARA AI may still continue to develop.
A world which can pause AI development is one which can also easily throttle ARA AIs.
The central point is:
- At some point, ARA is unshutdownable unless you try hard with a pivotal cleaning act. We may be stuck with a ChaosGPT forever, which is not existential, but pretty annoying. People are going to die.
- the ARA evolves over time. Maybe this evolution is very slow, maybe fast. Maybe it plateaus, maybe it does not plateau. I don't know
- This may take an indefinite number of years, but this can be a problem
This seems like a weak central point. "Pretty annoying" and some people dying is just incredibly small compared with the benefits of AI. And "it might be a problem in an indefinite number of years" doesn't justify the strength of the claims you're making in this post, like "we are approaching a point of no return" and "without a treaty, we are screwed".
An extended analogy: suppose the US and China both think it might be possible to invent a new weapon far more destructive than nuclear weapons, and they're both worried that the other side will invent it first. Worrying about ARAs feels like worrying about North Korea's weapons program. It could be a problem in some possible worlds, but it is always going to be much smaller, it will increasingly be left behind as the others progress, and if there's enough political will to solve the main problem (US and China racing) then you can also easily solve the side problem (e.g. by China putting pressure on North Korea to stop).
you can find some comments I've made about this by searching my twitter
Link here, and there are other comments in the same thread. Was on my laptop, which has twitter blocked, so couldn't link it myself before.
However, it seems to me like ruling out ARA is a relatively naturally way to mostly rule out relatively direct danger.
This is what I meant by "ARA as a benchmark"; maybe I should have described it as a proxy instead. Though while I agree that ARA rules out most danger, I think that's because it's just quite a low bar. The sort of tasks involved in buying compute etc are ones most humans could do. Meanwhile more plausible threat models involve expert-level or superhuman hacking. So I expect a significant gap between ARA and those threat models.
once you do have ARA ability, you just need some moderately potent self-improvement ability (including training successor models) for the situation to look reasonably scary
You'd need either really good ARA or really good self-improvement ability for an ARA agent to keep up with labs given the huge compute penalty they'll face, unless there's a big slowdown. And if we can coordinate on such a big slowdown, I expect we can also coordinate on massively throttling potential ARA agents.
I think the opposite: ARA is just not a very compelling threat model in my mind. The key issue is that AIs that do ARA will need to be operating at the fringes of human society, constantly fighting off the mitigations that humans are using to try to detect them and shut them down. While doing all that, in order to stay relevant, they'll need to recursively self-improve at the same rate at which leading AI labs are making progress, but with far fewer computational resources. Meanwhile, if they grow large enough to be spending serious amounts of money, they'll need to somehow fool standard law enforcement and general societal scrutiny.
Superintelligences could do all of this, and ARA of superintelligences would be pretty terrible. But for models in the broad human or slightly-superhuman ballpark, ARA seems overrated, compared with threat models that involve subverting key human institutions. Remember, while the ARA models are trying to survive, there will be millions of other (potentially misaligned) models being deployed deliberately by humans, including on very sensitive tasks (like recursive self-improvement). These seem much more concerning.
Why then are people trying to do ARA evaluations? Well, ARA was originally introduced primarily as a benchmark rather than a threat model. I.e. it's something that roughly correlates with other threat models, but is easier and more concrete to measure. But, predictably, this distinction has been lost in translation. I've discussed this with Paul and he told me he regrets the extent to which people are treating ARA as a threat model in its own right.
Separately, I think the "natural selection favors AIs over humans" argument is a fairly weak one; you can find some comments I've made about this by searching my twitter.
Such that you can technically do anything you want--you have maximal power/empowerment--but the super-majority of buttons and button combinations you are likely to push result in increasing the number of paperclips.
I think any model of a rational agent needs to incorporate the fact that they're not arbitrarily intelligent, otherwise none of their actions make sense. So I'm not too worried about this.
If you make an empowerment calculus that works for humans who are atomic & ideal agents, it probably breaks once you get a superintelligence who can likely mind-hack you into yourself valuing only power.
Yeah, I agree that a lot of concepts get fragile in the context of superintelligence. But while I think of corrigibility as an actively anti-natural concept, empowerment seems like it could perhaps remain robust and well-founded for longer.
You can think of this as a way of getting around the problem of fully updated deference, because the AI is choosing a policy based on what that policy would have done in the full range of hypothetical situations, and so it never updates away from considering any given goal. The cost, of course, is that we don't know how to actually pin down these hypotheticals.
Yep, totally agree (and in fact I'm at a s-risk retreat right now). Definitely a "could make it decide" rather than a "will make it decide".