Sequences

An Opinionated Guide to Computability and Complexity

Wiki Contributions

Comments

My biggest counterargument to the case that AI progress should be slowed down comes from an observation made by porby about a fundamental lack of a property we theorize about AI systems, and the one foundational assumption around AI risk:

Instrumental convergence, and it's corollaries like powerseeking.

The important point is that current and most plausible future AI systems don't have incentives to learn instrumental goals, and the type of AI that has enough space and has very few constraints, like RL with sufficiently unconstrained action spaces to learn instrumental goals is essentially useless for capabilities today, and the strongest RL agents use non-instrumental world models.

Thus, instrumental convergence for AI systems is fundamentally wrong, and given that this is the foundational assumption of why superhuman AI systems pose any risk that we couldn't handle, a lot of other arguments for why we might to slow down AI, why the alignment problem is hard, and a lot of other discussion in the AI governance and technical safety spaces, especially on LW become unsound, because they're reasoning from an uncertain foundation, and at worst are reasoning from a false premise to reach many false conclusions, like the argument that we should reduce AI progress.

Fundamentally, instrumental convergence being wrong would demand pretty vast changes to how we approach the AI topic, from alignment to safety and much more to come,

To be clear, the fact that I could only find a flaw within AI risk arguments because they were founded on false premises is actually better than many other failure modes, because it at least shows fundamentally strong locally valid reasoning on LW, rather than motivated reasoning or other biases that transforms true statements into false statements.

One particular case of the insight is that OpenAI and Anthropic were fundamentally right in their AI alignment plans, because they have managed to avoid instrumental convergence from being incentivized, and in particular LLMs can be extremely capable without being arbitrarily capable or having instrumental world models given resources.

I learned about the observation from this post below:

https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty

Porby talks about why AI isn't incentivized to learn instrumental goals, but given how much this assumption gets used in AI discourse, sometimes implicitly, I think it's of great importance that instrumental convergence is likely wrong.

I have other disagreements, but this is my deepest disagreement with your model (and other models around AI is especially dangerous).

EDIT: A new post on instrumental convergence came out, and it showed that many of the inferences made weren't just unsound, but invalid, and in particular Nick Bostrom's Superintelligence was wildly invalid in applying instrumental convergence to strong conclusions on AI risk.

I actually wish this is done sometime in the future, but I'm okay with focusing on other things for now.

(specifically the Training vs Out Of Distribution test performance experiment, especially on more realistic neural nets.)

Odd that ‘a model autonomously engaging in a sustained sequence of unsafe behavior’ only counts as an ‘AI safety incident’ if it is not ‘at the request of a user.’ If a user requests that, aren’t you supposed to ensure the model doesn’t do it?

I actually agree with this. This is a good thing since a lot of the bill's provisions are useful in the case of misalignment, but not misuse. In particular, I would not support a lot of the provisions like fully shutting down AI in the misuse case, so I'm happy for that.

Overall, I must say as an optimist on AI safety, I am reasonably happy with the bill. Admittedly, the devil is in what standards of evidence are required to not have a positive safety determination, and how much evidence would they need.

I want to note that just because the probability is 0 for X happening does not in general mean that X can never happen.

A good example of this is that you can decide with probability 1 whether a program halts, but that doesn't let me turn it into a decision procedure on a Turing Machine that will analyze arbitrary/every Turing Machine and decide whether they halt or not, for well known reasons.

(Oracles and hypercomputation in general can, but that's not the topic for today here.)

In general, one of the most common confusions on LW is assuming that probability 0 equals the event can never happen, and probability 1 meaning the event must happen.

This is a response to this part of the post.

And while 0 is the mode of this distribution, it’s still just a single point of width 0 on a continuum, meaning the probability of any given effect size being exactly 0, represented by the area of the red line in the picture, is almost 0.

That's much more reasonable of a claim, though it might be too high still (but much more reasonable.)

Potentially, but that would require a lot of bitcoin people to admit that government intervention in their activity is at least sometimes good, and given all the other flaws of bitcoin like having irreversible transactions, it truly is one of those products that isn't valuable at all in the money role except in extreme edge cases, and pretty much all other inventions had more use than this, which is why I think that in order for crypto to be useful, you need to entirely remove the money aspect via some means, and IMO, governments are the most practical means of doing so.

My primary concern here is that biology remains substantial as the most important cruxes of value to me such as love, caring and family all are part and parcel of the biological body.

I'm starting to think a big crux of my non-doominess probably rests on basically rejecting this premise, alongside a related premise that holds that value is complex and fragile, and the arguments for them being there being surprisingly weak, and the evidence in neuroscience is coming to the opposite conclusion, where values and capabilities are fairly intertwined, and the value generators are about as simple and general as we could have gotten, which makes me much less worried about several alignment problems like deceptive alignment.

people have written what I think are good responses to that piece; many of the comments, especially this one, and some posts.

There are responses by Quintin Pope and Ryan Greenblatt that addressed their points, where Ryan Greenblatt pointed out that the argument used in support of autonomous learning is only distinguishable from supervised learning if there are data limitations, and we can tell an analogous story about supervised learning having a fast takeoff without data limitations, and Quintin Pope has massive comments that I can't really summarize, but one is a general purpose response to Zvi's post, and the other is adding context to the debate between Quintin Pope and Jan Kulevit on culture:

https://www.lesswrong.com/posts/hvz9qjWyv8cLX9JJR/evolution-provides-no-evidence-for-the-sharp-left-turn#hkqk6sFphuSHSHxE4

https://www.lesswrong.com/posts/Wr7N9ji36EvvvrqJK/response-to-quintin-pope-s-evolution-provides-no-evidence#PS84seDQqnxHnKy8i

https://www.lesswrong.com/posts/wCtegGaWxttfKZsfx/we-don-t-understand-what-happened-with-culture-enough#YaE9uD398AkKnWWjz

Yep, that's what I was talking about, Seth Herd.

I agree with the claim that deception could arise without deceptive alignment, and mostly agree with the post, but I do still think it's very important to recognize if/when deceptive alignment fails to work, it changes a lot of the conversation around alignment.

Load More