Noosphere89

Sequences

An Opinionated Guide to Computability and Complexity

Wiki Contributions

Comments

My biggest counterargument to the case that AI progress should be slowed down comes from an observation made by porby about a fundamental lack of a property we theorize about AI systems, and the one foundational assumption around AI risk:

Instrumental convergence, and it's corollaries like powerseeking.

The important point is that current and most plausible future AI systems don't have incentives to learn instrumental goals, and the type of AI that has enough space and has very few constraints, like RL with sufficiently unconstrained action spaces to learn instrumental goals is essentially useless for capabilities today, and the strongest RL agents use non-instrumental world models.

Thus, instrumental convergence for AI systems is fundamentally wrong, and given that this is the foundational assumption of why superhuman AI systems pose any risk that we couldn't handle, a lot of other arguments for why we might to slow down AI, why the alignment problem is hard, and a lot of other discussion in the AI governance and technical safety spaces, especially on LW become unsound, because they're reasoning from an uncertain foundation, and at worst are reasoning from a false premise to reach many false conclusions, like the argument that we should reduce AI progress.

Fundamentally, instrumental convergence being wrong would demand pretty vast changes to how we approach the AI topic, from alignment to safety and much more to come,

To be clear, the fact that I could only find a flaw within AI risk arguments because they were founded on false premises is actually better than many other failure modes, because it at least shows fundamentally strong locally valid reasoning on LW, rather than motivated reasoning or other biases that transforms true statements into false statements.

One particular case of the insight is that OpenAI and Anthropic were fundamentally right in their AI alignment plans, because they have managed to avoid instrumental convergence from being incentivized, and in particular LLMs can be extremely capable without being arbitrarily capable or having instrumental world models given resources.

I learned about the observation from this post below:

https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty

Porby talks about why AI isn't incentivized to learn instrumental goals, but given how much this assumption gets used in AI discourse, sometimes implicitly, I think it's of great importance that instrumental convergence is likely wrong.

I have other disagreements, but this is my deepest disagreement with your model (and other models around AI is especially dangerous).

EDIT: A new post on instrumental convergence came out, and it showed that many of the inferences made weren't just unsound, but invalid, and in particular Nick Bostrom's Superintelligence was wildly invalid in applying instrumental convergence to strong conclusions on AI risk.

My question is why do you consider most work on concentration of power risk net-negative?

This wasn't specifically connected to the post, just providing general commentary.

If I were to take anything away from this, it's that you can have cognition/intelligence that is efficient, or rational/unexploitable cognition like full-blown Bayesianism, but not both.

And that given the constraints of today, it is far better to have efficient cognition than rational/unexploitable cognition, because the former can actually be implemented, while the latter can't be implemented at all.

My point isn't that the easier option always exists, or even that a problem can't be impossible.

My point is that if you are facing a problem that requires 1-shot complete plans, and there's no second try, you need to do something else.

There is a line where a problem becomes too difficult to productively work on, and that constraint is a great sign of an impossible problem (if it exists.)

I was focusing on runs eligible for the prize in this short linkpost.

Plans obviously need some robustness to things going wrong, and in a sense I agree with John Wentworth, if weakly, that some robustness is a necessary feature of a plan, and some verification is actually necessary.

But I have to agree that there is a real failure mode identified by moridinamael and Quintin Pope, and that is perfectionism, meaning that you discard ideas too quickly as not useful, and this constraint is the essence of perfectionism:

I have an exercise where I give people the instruction to play a puzzle game ("Baba is You"), but where you normally have the ability to move around and interact with the world to experiment and learn things, instead, you need to make a complete plan for solving the level, and you aim to get it right on your first try.

It asks for both a complete plan to solve the whole level, and also asks for the plan to work on the first try, which outside of this context implies either the problem is likely unsolvable or you are being too perfectionist with your demands.

In particular, I think that Quintin Pope's comment here is genuinely something that applies in lots of science and problem solving, and that it's actually quite difficult to reasoin well about the world in general without many experiments.

What I take away from this is that they should have separated the utility from an assumption being true, from the probability/likelihood of an assumption being true, and indeed this shows some calibration problems.

There is slipping into more convenient worlds for reasons based on utility rather than evidence, which is a problem (assuming it's solvable for you.)

This is an important takeaway, but I don't think your other takeaways help as much as this one.

That said, this constraint IRL makes almost all real-life problems impossible for humans and AIs:

I have an exercise where I give people the instruction to play a puzzle game ("Baba is You"), but where you normally have the ability to move around and interact with the world to experiment and learn things, instead, you need to make a complete plan for solving the level, and you aim to get it right on your first try.

In particular, if such a constraint exists, then it's a big red flag that the problem you are solving is impossible to solve, given that constraint.

Almost all plans fail on the first try, even for really competent plans and humans, and outside of very constrained regimes, 0 plans work out on the first try.

Thus, if you are truly in a situation where you are encountering such constraints, you should give up on the problem ASAP, and rest a little to make sure that the constraint actually exists.

So while this is a fun experiment, with real takeaways, I'd warn people that constraining a plan to work on the first try and requiring completeness makes lots of problems impossible to solve for us humans and AIs.

Very interesting. Yeah, I'm starting to doubt the idea that Reversal Curse is any sort of problem for LLMs at all, and is probably trivial to fix.

In retrospect, I probably should have updated much less than i did, I though that it was actually testing a real LLM, which makes me less confident in the paper.

Should have responded long ago, but responding now.

Load More