A post which focuses on the object-level implications for AI of a theory of rationality which looks very different from the AIXI-flavoured rat-orthodox view.

I'm working on this right now, actually. Will hopefully post in a couple of weeks.

I say this because those sorts of considerations convinced me that we're much less likely to be buggered.

That seems reasonable. But I do think there's a group of people who have internalized bayesian rationalism enough that the main blocker is their general epistemology, rather than the way they reason about AI in particular.

6 seems too general a claim to me. Why wouldn't it work for 1% vs 10%, and likewise 0.1% vs 1% i.e. why doesn't this suggest that you should round down P(doom) to zero.

I think the point of 6 is not to say "here's where you should end up", but more to say "here's the reason why this straightforward symmetry argument doesn't hold".

7 I kinda disagree with. Those models of idealized reasoning you mention generalize Bayesianism/Expected Utility Maximization. But they are not far from the Bayesian framework or EU frameworks.

There's still something importantly true about EU maximization and bayesianism. I think the changes we need will be subtle but have far-reaching ramifications. Analogously, relativity was a subtle change to newtonian mechanics that had far-reaching implications for how to think about reality.

Like Bayesianism, they do say there are correct and incorrect ways of combining beliefs, that beliefs should be isomorphic to certain structures, unless I'm horribly mistaken. Which sure is not what you're claiming to be the case in your above points.

Any epistemology will rule out some updates, but a problem with bayesianism is that it says there's one correct update to make. Whereas radical probabilism, for example, still sets some constraints, just far fewer.

Richard Ngo's Shortform

Richard_Ngo4d20

Edited for clarity now.

Richard Ngo's Shortform

Richard_Ngo6dΩ2560-14

Some opinions about AI and epistemology:

One reasons that many rationalists have such strong views about AI is that they are wrong about epistemology. Specifically, bayesian rationalism is a bad way to think about complex issues.
A better approach is meta-rationality. To summarize one guiding principle of (my version of) meta-rationality in a single sentence: if something doesn't make sense in the context of group rationality, it probably doesn't make sense in the context of individual rationality either.
For example: there's no privileged way to combine many people's opinions into a single credence. You can average them, but that loses a lot of information. Or you can get them to bet on a prediction market, but that depends on a lot on details of the individuals' betting strategies. The group might settle on a number to help with planning and communication, but it's only a lossy summary of many different beliefs and models. Similarly, we should think of individuals' credences as lossy summaries of different opinions from different underlying models that they have.
How does this apply to AI? Suppose we each think of ourselves as containing many different subagents that focus on understanding the world in different ways - e.g. studying different disciplines, using different styles of reasoning, etc. The subagent that thinks about AI from first principles might come to a very strong opinion. But this doesn't mean that the other subagents should fully defer to it (just as having one very confident expert in a room of humans shouldn't cause all the other humans to elect them as the dictator). E.g. maybe there's an economics subagent who will remain skeptical unless the AI arguments can be formulated in ways that are consistent with their knowledge of economics, or the AI subagent can provide evidence that is legible even to those other subagents (e.g. advance predictions).
In my debate with Eliezer, he didn't seem to appreciate the importance of advance predictions; I think the frame of "highly opinionated subagents should convince other subagents to trust them, rather than just seizing power" is an important aspect of what he's missing. I think of rationalism as trying to form a single fully-consistent world-model; this has many of the same pitfalls as a country which tries to get everyone to agree on a single ideology. Even when that ideology is broadly correct, you'll lose a bunch of useful heuristics and intuitions that help actually get stuff done, because ideological conformity is prioritized.
This perspective helps frame the debate about what our "base rate" for AI doom should be. I've been in a number of arguments that go roughly like (edited for clarity):
Me: "Credences above 90% doom can't be justified given our current state of knowledge"
Them: "But this is an isolated demand for rigor, because you're fine with people claiming that there's a 90% chance we survive. You're assuming that survival is the default, I'm assuming that doom is the default; these are symmetrical positions."
But in fact there's no one base rate; instead, different subagents with different domains of knowledge will have different base rates. That will push P(doom) lower because most frames from most disciplines, and most styles of reasoning, don't predict doom. That's where the asymmetry which makes 90% doom a much stronger prediction than 90% survival comes from.
This perspective is broadly aligned with a bunch of stuff that Scott Garrabrant and Abram Demski have written about (e.g. geometric rationality, Garrabrant induction). I don't think the ways I'm applying it to AI risk debates straightforwardly falls out of their more technical ideas; but I do expect that more progress on agent foundations will make it easier to articulate ideas like the ones above.

We might be dropping the ball on Autonomous Replication and Adaptation.

Richard_Ngo15dΩ220

Part of my view here is that ARA agents will have unique affordances that no human organization will have had before (like having truly vast, vast amounts of pretty high skill labor).

The more labor they have, the more detectable they are, and the easier they are to shut down. Also, are you picturing them gaining money from crimes, then buying compute legitimately? I think the "crimes" part is hard to stop but the "paying for compute" part is relatively easy to stop.

My guess is that you need to be a decent but not amazing software engineer to ARA.

Yeah, you're probably right. I still stand by the overall point though.

We might be dropping the ball on Autonomous Replication and Adaptation.

Richard_Ngo15dΩ340

1) It’s not even clear people are going to try to react in the first place.

I think this just depends a lot on how large-scale they are. If they are using millions of dollars of compute, and are effectively large-scale criminal organizations, then there are many different avenues by which they might get detected and suppressed.

If we don't solve alignment and we implement a pause on AI development in labs, the ARA AI may still continue to develop.

A world which can pause AI development is one which can also easily throttle ARA AIs.

The central point is:
At some point, ARA is unshutdownable unless you try hard with a pivotal cleaning act. We may be stuck with a ChaosGPT forever, which is not existential, but pretty annoying. People are going to die.
the ARA evolves over time. Maybe this evolution is very slow, maybe fast. Maybe it plateaus, maybe it does not plateau. I don't know
This may take an indefinite number of years, but this can be a problem

This seems like a weak central point. "Pretty annoying" and some people dying is just incredibly small compared with the benefits of AI. And "it might be a problem in an indefinite number of years" doesn't justify the strength of the claims you're making in this post, like "we are approaching a point of no return" and "without a treaty, we are screwed".

An extended analogy: suppose the US and China both think it might be possible to invent a new weapon far more destructive than nuclear weapons, and they're both worried that the other side will invent it first. Worrying about ARAs feels like worrying about North Korea's weapons program. It could be a problem in some possible worlds, but it is always going to be much smaller, it will increasingly be left behind as the others progress, and if there's enough political will to solve the main problem (US and China racing) then you can also easily solve the side problem (e.g. by China putting pressure on North Korea to stop).

you can find some comments I've made about this by searching my twitter

Link here, and there are other comments in the same thread. Was on my laptop, which has twitter blocked, so couldn't link it myself before.

We might be dropping the ball on Autonomous Replication and Adaptation.

Richard_Ngo15dΩ550

However, it seems to me like ruling out ARA is a relatively naturally way to mostly rule out relatively direct danger.

This is what I meant by "ARA as a benchmark"; maybe I should have described it as a proxy instead. Though while I agree that ARA rules out most danger, I think that's because it's just quite a low bar. The sort of tasks involved in buying compute etc are ones most humans could do. Meanwhile more plausible threat models involve expert-level or superhuman hacking. So I expect a significant gap between ARA and those threat models.

once you do have ARA ability, you just need some moderately potent self-improvement ability (including training successor models) for the situation to look reasonably scary

You'd need either really good ARA or really good self-improvement ability for an ARA agent to keep up with labs given the huge compute penalty they'll face, unless there's a big slowdown. And if we can coordinate on such a big slowdown, I expect we can also coordinate on massively throttling potential ARA agents.

We might be dropping the ball on Autonomous Replication and Adaptation.

Answer by Richard_NgoMay 31, 2024Ω368630

I think the opposite: ARA is just not a very compelling threat model in my mind. The key issue is that AIs that do ARA will need to be operating at the fringes of human society, constantly fighting off the mitigations that humans are using to try to detect them and shut them down. While doing all that, in order to stay relevant, they'll need to recursively self-improve at the same rate at which leading AI labs are making progress, but with far fewer computational resources. Meanwhile, if they grow large enough to be spending serious amounts of money, they'll need to somehow fool standard law enforcement and general societal scrutiny.

Superintelligences could do all of this, and ARA of superintelligences would be pretty terrible. But for models in the broad human or slightly-superhuman ballpark, ARA seems overrated, compared with threat models that involve subverting key human institutions. Remember, while the ARA models are trying to survive, there will be millions of other (potentially misaligned) models being deployed deliberately by humans, including on very sensitive tasks (like recursive self-improvement). These seem much more concerning.

Why then are people trying to do ARA evaluations? Well, ARA was originally introduced primarily as a benchmark rather than a threat model. I.e. it's something that roughly correlates with other threat models, but is easier and more concrete to measure. But, predictably, this distinction has been lost in translation. I've discussed this with Paul and he told me he regrets the extent to which people are treating ARA as a threat model in its own right.

Separately, I think the "natural selection favors AIs over humans" argument is a fairly weak one; you can find some comments I've made about this by searching my twitter.

Richard Ngo's Shortform

Richard_Ngo2mo60

Such that you can technically do anything you want--you have maximal power/empowerment--but the super-majority of buttons and button combinations you are likely to push result in increasing the number of paperclips.

I think any model of a rational agent needs to incorporate the fact that they're not arbitrarily intelligent, otherwise none of their actions make sense. So I'm not too worried about this.

If you make an empowerment calculus that works for humans who are atomic & ideal agents, it probably breaks once you get a superintelligence who can likely mind-hack you into yourself valuing only power.

Yeah, I agree that a lot of concepts get fragile in the context of superintelligence. But while I think of corrigibility as an actively anti-natural concept, empowerment seems like it could perhaps remain robust and well-founded for longer.

Richard Ngo's Shortform

Richard_Ngo2moΩ562

You can think of this as a way of getting around the problem of fully updated deference, because the AI is choosing a policy based on what that policy would have done in the full range of hypothetical situations, and so it never updates away from considering any given goal. The cost, of course, is that we don't know how to actually pin down these hypotheticals.