My biggest counterargument to the case that AI progress should be slowed down comes from an observation made by porby about a fundamental lack of a property we theorize about AI systems, and the one foundational assumption around AI risk:

Instrumental convergence, and it's corollaries like powerseeking.

The important point is that current and most plausible future AI systems don't have incentives to learn instrumental goals, and the type of AI that has enough space and has very few constraints, like RL with sufficiently unconstrained action spaces to learn instrumental goals is essentially useless for capabilities today, and the strongest RL agents use non-instrumental world models.

Thus, instrumental convergence for AI systems is fundamentally wrong, and given that this is the foundational assumption of why superhuman AI systems pose any risk that we couldn't handle, a lot of other arguments for why we might to slow down AI, why the alignment problem is hard, and a lot of other discussion in the AI governance and technical safety spaces, especially on LW become unsound, because they're reasoning from an uncertain foundation, and at worst are reasoning from a false premise to reach many false conclusions, like the argument that we should reduce AI progress.

Fundamentally, instrumental convergence being wrong would demand pretty vast changes to how we approach the AI topic, from alignment to safety and much more to come,

To be clear, the fact that I could only find a flaw within AI risk arguments because they were founded on false premises is actually better than many other failure modes, because it at least shows fundamentally strong locally valid reasoning on LW, rather than motivated reasoning or other biases that transforms true statements into false statements.

One particular case of the insight is that OpenAI and Anthropic were fundamentally right in their AI alignment plans, because they have managed to avoid instrumental convergence from being incentivized, and in particular LLMs can be extremely capable without being arbitrarily capable or having instrumental world models given resources.

I learned about the observation from this post below:

https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty

Porby talks about why AI isn't incentivized to learn instrumental goals, but given how much this assumption gets used in AI discourse, sometimes implicitly, I think it's of great importance that instrumental convergence is likely wrong.

I have other disagreements, but this is my deepest disagreement with your model (and other models around AI is especially dangerous).

EDIT: A new post on instrumental convergence came out, and it showed that many of the inferences made weren't just unsound, but invalid, and in particular Nick Bostrom's Superintelligence was wildly invalid in applying instrumental convergence to strong conclusions on AI risk.

Reply

Against John Searle, Gary Marcus, the Chinese Room thought experiment and its world

Noosphere8911d20

I have a better argument now, and the answer is that the argument fails in the conclusion.

The issue is that conditional on assuming that a computer program (speaking very generally here) is able to give a correct response to every input of Chinese characters, and it knows the rules of Chinese completely, then it must know/understand Chinese in order to do the things that Searle claims it to be doing, and in this instance we'd say that it does understand Chinese/decide Chinese for all purposes.

Basically, I'm claiming that the premises lead to a different, opposite conclusion.

These premises:

“Imagine a native English speaker who knows no Chinese locked in a room full of boxes of Chinese symbols (a data base) together with a book of instructions for manipulating the symbols (the program). Imagine that people outside the room send in other Chinese symbols which, unknown to the person in the room, are questions in Chinese (the input). And imagine that by following the instructions in the program the man in the room is able to pass out Chinese symbols which are correct answers to the questions (the output).

assuming that every input has in fact been used, contradicts this conclusion:

The program enables the person in the room to pass the Turing Test for understanding Chinese but he does not understand a word of Chinese.”

The correct conclusion, including all assumptions is that they do understand/decide Chinese completely.

The one-sentence slogan is "Look-up table programs are a valid form of intelligence/understanding, albeit the most inefficient form of intelligence/understanding."

What it does say is that without any restrictions on how the program computes Chinese or any problem, other than it must give a correct answer to every input, the answer to the question of "Is it intelligent on this specific problem/does it understand this specific problem?" is always yes, and to have the possibility of it being no, you need to add more restrictions than that to make the answer be no.

Reply

When is Goodhart catastrophic?

Noosphere8911d31

Essentially, the paper's model requires, by assumption, that it is impossible to get any efficiency gains (like "don't sleep on the floor" or "use this more efficient design instead) or mutually-beneficial deals (like helping two sides negotiate and avoid a war).

Yeah, that was a different assumption that I didn't realize, because I thought the assumption was solely that we had a limited budget and every increase in a feature has a non-zero cost, which is a very different assumption.

I sort of wish the assumptions were distinguished, because these are very, very different assumptions (for example, you can have positive-sum interactions/trade so long as the cost is sufficiently low and the utility gain is sufficiently high, which is pretty usual.)

Reply

When is Goodhart catastrophic?

Noosphere8912d20

The real issue IMO is assumption 1, the assumption that utility strictly increases. Assumption 2 is, barring rather exotic regimes far into the future, basically always correct, and for irreversible computation, this always happens, since there's a minimum cost to increase the features IRL, and it isn't 0.

Increasing utility IRL is not free.

Assumption 1 is plausibly violated for some goods, provided utility grows slower than logarithmic, but the worry here is status might actually be a utility that strictly increases, at least relatively speaking.

Reply

Inference cost limits the impact of ever larger models

Noosphere8914d20

My general prior on inference cost is that it is the same order of magnitude as training cost, and thus neither dominates the other in general, due to tradeoffs.

I don't remember where I got that idea from, though.

Reply

How does the ever-increasing use of AI in the military for the direct purpose of murdering people affect your p(doom)?

Answer by Noosphere89Apr 13, 202442

I basically agree with John Wentworth here that it affects p(doom) not at all, but one thing I will say is that it kind of makes claims that humans will make decisions/be accountable once AI gets very useful rather uncredible.

More generally, one takeaway I see from the military's use of AI is that there are strong pressures to let them operate on their own, and this is going to be surprisingly important in the future.

Reply

Ackshually, many worlds is wrong

Noosphere8915d2-1

My read of the post is not that many worlds is wrong, but rather it's not uniquely correct, and that many worlds has some issues of it's own, and that other theories are at least coherent.

Is this a correct reading of this post?

Reply

Any evidence or reason to expect a multiverse / Everett branches?

Noosphere8917d90

What's the technical objection you have to it?

Reply

On green

Noosphere891mo3-2

Yeah, the basic failure mode of green is that it is reliant on cartoonish descriptions of nature that is much closer to Pocahontas or really any Disney movie than real-life nature, and in general is extremely non-self reliant in the sense that it relies heavily on both Blue and Red's efforts to preserve the idealized Green.

Otherwise, it collapses into large scale black and arguably red personalities of nature.

Reply

Natural Latents: The Concepts

Noosphere891mo22

Your point on laws and natural abstractions expresses nicely a big problem with postmodernism that was always there, but wasn't clearly pointed out:

Natural Abstractions and more generally almost every concept is subjective, in the sense that people can change what a concept means, and are quite subjective, but that doesn't mean you can deny the concept/abstraction and instantly make it non-effective, you actually have to do real work, and importantly change stuff in the world, and you can't simply assign different meanings or different concepts to the same data, and expect the concept to no longer work. You actually have to change the behavior of lots of other different humans, and if you fail, the concept is still real.

This also generalizes to a lot of other abstractions like gender or sexuality, where real work, especially in medicine and biotech is necessary if you want concepts on gender or sex to change drastically.

This is why a lot of postmodernism is wrong to claim that denying concepts automatically negates it's power, you have to do real work to change concepts, which is why I tend to favor technological progress.

I'll put the social concepts one in the link below, because it's so good as a response to postmodernism:

https://www.lesswrong.com/posts/mMEbfooQzMwJERAJJ/natural-latents-the-concepts#Social_Constructs__Laws

Reply