A potentially high impact differential technological development area

Noosphere89

Specifically, this is boosting LLMs and Non-RL AI relative to RL AI.

I am making this post because I want to signal boost an observation by porby, which could be lossily compressed as, "Instrumental goals/Instrumental convergence is at best a debatable assumption for LLMs and Non-RL AI, and may not be there at all for LLMs/Non-RL AI."

Expressed in this post and in a comment below:

https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty

https://www.lesswrong.com/posts/rmfjo4Wmtgq8qa2B7/?commentId=GKhn2ktBuxjNhmaWB

I want to point this out, because on LW/EA analyses of AI risk that conclude it's a big problem (especially the more pessimistic ones) generally require instrumental convergence/instrumental goals in order for AI risks that isn't part of AI ethics, or AI misuse concerns.

So it being false, or at least having a lot higher probability of being false for LLMs/Non-RL AI is a very big deal, because the concern over AI risk that isn't a standard misuse/ethics story pretty much relies on instrumental convergence being right.

The basic reason for this, again lossily compressed, if anyone wants to know this, is basically because the steps between the input to output don't contain many intermediate steps unlike in RL, so there's far less incentive for instrumental goals/instrumental convergence since instrumental goals are much less useful when there are only a few steps between the input and output.

Given this new reality, I have identified a pretty high impact area for making AI progress safer:

Help OpenAI/Microsoft and Anthropic win the race to AGI, relative to Deepmind, or at least have Non-RL AI win the race to AGI, relative to RL AI

The reason I'm suggesting this as a way to make safer AI is because Non-RL AI, especially SOTA LLMs have far, far less incentive to have instrumental goals, relative to RL. And if you can avoid instrumental convergence, you have a much easier alignment problem, since you don't have to solve the adversarial cases of AI Alignment, you only need to solve the non-adversarial cases, and in general, solving problems in the non-adversarial case is radically easier compared to solving it in adversarial cases.

This also makes OpenAI's alignment plan described here work a lot better, particularly the part where they want to make AI that solves the AI alignment problem, and it has been criticized before, in a comment and in some posts below, like Akash's post on OpenAI's alignment plan. But OpenAI's alignment plan, in light of the probable incorrectness of the instrumental convergence, looks a lot better than before, as it's much easier to align an LLM AI alignment/safety researcher than standard RL AI, because there's little incentive for power-seeking where we don't want to have, so almost every AI risk story that isn't misuse or ethics falls apart.

In particular, since the base AI is easy to align, it's much easier to recursively align successors of LLMs.

The OpenAI alignment plan, and criticism is linked below:

https://openai.com/blog/our-approach-to-alignment-research

https://www.lesswrong.com/posts/tD9zEiHfkvakpnNam/?commentId=MWQpJrTsYzFoQiPoE

https://www.lesswrong.com/posts/FBG7AghvvP7fPYzkx/my-thoughts-on-openai-s-alignment-plan-1

Implications of this new finding

I'd say the most important implication of the probable non-existence of instrumental convergence in LLMs is that a lot of the standard LW framing on the AI alignment problem is basically wrong, because we aren't dealing with an adversarial problem, we are dealing with a problem that normal science can probably solve given more resources.

Equivalently, Eliezer Yudkowsky's and others emphasis on security mindset is fundamentally oversold and the wrong frame for the problem.

It also implies that fairly radical downshifting of p(DOOM) is warranted for a lot of people, or equivalently that we should be way more optimistic about the AI Alignment problem being solved.

The people that are skeptical of AI doom stories or optimistic around AI progress are basically right, or at least a lot more right than the pessimists, even if it's for the wrong reasons.

AI governance should probably be careful about restricting AI, given that at least one AI technology path to AGI is way, way safer than the others.

OpenAI was fundamentally right about it's claims to being a safe AGI company, compared to every other company. In particular, people working on non-RL AI are making safer AI, at least compared to other paths.

And finally, while the AI Alignment problem still matters, I'd say it probably matters a lot less than it used to, and in particular I suspect it's big enough to downgrade the problem such that it matters way less than other problems, so LWers should be less fixated on AI existential risks, compared to other problems.

I agree that LLMs are less intrinsically agent-y and may be less prone to instrumental convergence than other AI approaches. (I'm less sure whether this can't be overcome with relatively simple code that calls LLMs as a part of its operation, a la some better version of AutoGPT). I think the central argument of the post, and the linked discussions (which I see as something like "As long as we try our best at each LLM iteration to use what we have for alignment research, we can get better at aligning the next iteration"), is plausible, iff we slow down capabilities research long enough to do so. "We" meaning everyone. So far, we have not done anything close to that, despite some noises in that direction. Nor do we have anything like a reliable plan for achieving such a capabilities slowdown, and no single company or country is in a position to create and enforce one. Possibly no proper subset of countries or companies is in a position to do so. I'd love to see a good way to get to that as a stable equilibrium where we make sure we're only increasing capabilities and trying new algorithms after we've done enough to be sufficiently sure they're safe. What would that look like?

Hm, I'd don't agree that at least some slowdown is probably necessary, mostly because I don't think that the capabilities of a system actually correlate all that well with how instrumentally convergent a system is, but my key point here is that yes, the iterative approach does in fact work, and most importantly, instrumental convergence either likely doesn't exist, is way more bounded than what LWers imagine, and newly, instrumental convergence probably doesn't even support the claims that AI is existentially risky, even if true, from a new post, so a lot of inferences that LW made are maybe false, given the implicit use of the assumption was pretty large, even conditional on my argument being false.

The post is below:

https://www.lesswrong.com/posts/w8PNjCS8ZsQuqYWhD/instrumental-convergence-draft

Now to get at a point that seems less appreciated, but the argument is more or less not correlated to capabilities, in that AI can get essentially arbitrarily capable of things and still have much more bounded instrumental convergence.

So no, we mostly don't have to slow down AI, because the capabilities of an AI are not negatively corrrelated with safety, and in more probable scenarios like AI misuse, you might actually want to race to the finish line, and an argument is given by Simeon here (called simeon_c).

https://www.lesswrong.com/posts/oadiC5jmptAbJi6mS/the-cruel-trade-off-between-ai-misuse-and-ai-x-risk-concerns

The post is below:

https://www.lesswrong.com/posts/w8PNjCS8ZsQuqYWhD/instrumental-convergence-draft

https://www.lesswrong.com/posts/oadiC5jmptAbJi6mS/the-cruel-trade-off-between-ai-misuse-and-ai-x-risk-concerns