My biggest counterargument to the case that AI progress should be slowed down comes from an observation made by porby about a fundamental lack of a property we theorize about AI systems, and the one foundational assumption around AI risk:

Instrumental convergence, and it's corollaries like powerseeking.

The important point is that current and most plausible future AI systems don't have incentives to learn instrumental goals, and the type of AI that has enough space and has very few constraints, like RL with sufficiently unconstrained action spaces to learn instrumental goals is essentially useless for capabilities today, and the strongest RL agents use non-instrumental world models.

Thus, instrumental convergence for AI systems is fundamentally wrong, and given that this is the foundational assumption of why superhuman AI systems pose any risk that we couldn't handle, a lot of other arguments for why we might to slow down AI, why the alignment problem is hard, and a lot of other discussion in the AI governance and technical safety spaces, especially on LW become unsound, because they're reasoning from an uncertain foundation, and at worst are reasoning from a false premise to reach many false conclusions, like the argument that we should reduce AI progress.

Fundamentally, instrumental convergence being wrong would demand pretty vast changes to how we approach the AI topic, from alignment to safety and much more to come,

To be clear, the fact that I could only find a flaw within AI risk arguments because they were founded on false premises is actually better than many other failure modes, because it at least shows fundamentally strong locally valid reasoning on LW, rather than motivated reasoning or other biases that transforms true statements into false statements.

One particular case of the insight is that OpenAI and Anthropic were fundamentally right in their AI alignment plans, because they have managed to avoid instrumental convergence from being incentivized, and in particular LLMs can be extremely capable without being arbitrarily capable or having instrumental world models given resources.

I learned about the observation from this post below:

https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty

Porby talks about why AI isn't incentivized to learn instrumental goals, but given how much this assumption gets used in AI discourse, sometimes implicitly, I think it's of great importance that instrumental convergence is likely wrong.

I have other disagreements, but this is my deepest disagreement with your model (and other models around AI is especially dangerous).

EDIT: A new post on instrumental convergence came out, and it showed that many of the inferences made weren't just unsound, but invalid, and in particular Nick Bostrom's Superintelligence was wildly invalid in applying instrumental convergence to strong conclusions on AI risk.

Reply

Catastrophic Goodhart in RL with KL penalty

Noosphere891d20

My expectation is that error and utility are both extremely heavy tailed, and arguably in the same order of magnitude for heavy tails.

But thanks for answering, the real answer is we can predict effectively nothing without independence, and thus we can justify virtually every outcome of real-life Goodhart.

Maybe it's catastrophic, maybe it doesn't matter, or maybe there's anti-goodhart, but I don't see a way to predict what will reasonably happen.

Also, why do you think that error is heavier tailed than utility?

Reply

Catastrophic Goodhart in RL with KL penalty

Noosphere891d20

I have a question about this post, and it has to do with the case where both utility and error are heavy tailed:

Where does the expected value converge to if both utility and errors are heavy tailed? Is it 0, infinity, some other number, or does it not converge to any number at all?

Reply

Please stop publishing ideas/insights/research about AI

Noosphere8916d42

Privacy of communities isn't a solvable problem in general, as soon as your community is large enough to compete with the adversary, it's large enough and conspicuous enough that the adversary will pay attention to it and send in spies and extract leaks.

I disagree with this in theory as a long-term concern, but yes in practice the methods to have privacy of communities haven't been implemented or tested at all, and I agree with the general sentiment that it isn't worth the steep drawbacks of privacy to protect secrets, which does unfortunately make me dislike the post due to it's strength of recommendations.

So while I could in theory disagree with you, in practice right now I mostly have to agree with the comment that there will not be such an infrastructure for private alignment ideas.

Also to touch on something here that isn't too relevant and could be considered a tangent:

If your acceptable lower limit for basically anything is zero you wont be allowed to do anything, really anything.

This is why perfectionism is such a bad thing, and why you need to be able to accept that failure happens. You cannot have 0 failures IRL.

Reply

1

tlevin's Shortform

Noosphere8917d1-1

Unless you're talking about financial conflicts of interest, but there are also financial incentives for orgs pursuing a "radical" strategy to downplay boring real-world constraints, as well as social incentives (e.g. on LessWrong IMO) to downplay boring these constraints and cognitive biases against thinking your preferred strategy has big downsides.

It's not just that problem though, they will likely be biased to think that their policy is helpful for safety of AI at all, and this is a point that sometimes gets forgotten.

But correct on the fact that Akash's argument is fully general.

Reply

The first future and the best future

Noosphere8917d20

I kind of agree with this, and in this way is where I fundamentally differ from a lot of e/accs and AI progress boosters quite a lot.

However, I think 2 things matter here that limit the force of this, though I don't know to what extent:

People have pretty different values, and while I mostly don't consider it a bottleneck to alignment as understood on LW, it does impact this post specifically because there are differences in what people consider the best future, and this is why I'm unsure that we should pursue your program specifically.
I think there are semi-reasonable arguments that lock-in concerns are somewhat overstated, and while I don't totally buy them, they are at least somewhat reasonable, and thus I don't fully support the post at this time.

However, this post has a lot of food for thought, especially given my world model of AI development is notably skewed more towards optimistic outcomes than most of LW by a lot, so thank you for at least trying to argue for a slow down without assuming existential risk.

Reply

Against John Searle, Gary Marcus, the Chinese Room thought experiment and its world

Noosphere891mo20

I have a better argument now, and the answer is that the argument fails in the conclusion.

The issue is that conditional on assuming that a computer program (speaking very generally here) is able to give a correct response to every input of Chinese characters, and it knows the rules of Chinese completely, then it must know/understand Chinese in order to do the things that Searle claims it to be doing, and in this instance we'd say that it does understand Chinese/decide Chinese for all purposes.

Basically, I'm claiming that the premises lead to a different, opposite conclusion.

These premises:

“Imagine a native English speaker who knows no Chinese locked in a room full of boxes of Chinese symbols (a data base) together with a book of instructions for manipulating the symbols (the program). Imagine that people outside the room send in other Chinese symbols which, unknown to the person in the room, are questions in Chinese (the input). And imagine that by following the instructions in the program the man in the room is able to pass out Chinese symbols which are correct answers to the questions (the output).

assuming that every input has in fact been used, contradicts this conclusion:

The program enables the person in the room to pass the Turing Test for understanding Chinese but he does not understand a word of Chinese.”

The correct conclusion, including all assumptions is that they do understand/decide Chinese completely.

The one-sentence slogan is "Look-up table programs are a valid form of intelligence/understanding, albeit the most inefficient form of intelligence/understanding."

What it does say is that without any restrictions on how the program computes Chinese or any problem, other than it must give a correct answer to every input, the answer to the question of "Is it intelligent on this specific problem/does it understand this specific problem?" is always yes, and to have the possibility of it being no, you need to add more restrictions than that to make the answer be no.

Reply

When is Goodhart catastrophic?

Noosphere891mo31

Essentially, the paper's model requires, by assumption, that it is impossible to get any efficiency gains (like "don't sleep on the floor" or "use this more efficient design instead) or mutually-beneficial deals (like helping two sides negotiate and avoid a war).

Yeah, that was a different assumption that I didn't realize, because I thought the assumption was solely that we had a limited budget and every increase in a feature has a non-zero cost, which is a very different assumption.

I sort of wish the assumptions were distinguished, because these are very, very different assumptions (for example, you can have positive-sum interactions/trade so long as the cost is sufficiently low and the utility gain is sufficiently high, which is pretty usual.)

Reply

When is Goodhart catastrophic?

Noosphere891mo20

The real issue IMO is assumption 1, the assumption that utility strictly increases. Assumption 2 is, barring rather exotic regimes far into the future, basically always correct, and for irreversible computation, this always happens, since there's a minimum cost to increase the features IRL, and it isn't 0.

Increasing utility IRL is not free.

Assumption 1 is plausibly violated for some goods, provided utility grows slower than logarithmic, but the worry here is status might actually be a utility that strictly increases, at least relatively speaking.

Reply

Inference cost limits the impact of ever larger models

Noosphere891mo20

My general prior on inference cost is that it is the same order of magnitude as training cost, and thus neither dominates the other in general, due to tradeoffs.

I don't remember where I got that idea from, though.

Reply