Noosphere89

Wiki Contributions

Comments

A lot of my hope is definitely in the ‘we don’t find a way to build an AGI soon’ bucket.

My biggest hopes, at least for doom, lie in the fact that instrumental convergence both is too weak of an assumption, without other assumptions to be a good argument for doom, and the fact that unbounded instrumental convergence is actually useless for capabilities, compared to much more bounded instrumental convergence making alignment way easier (It's still hard, but not nearly as hard as many doomers probably think).

Cf this post on how instrumental convergence is mostly wrong for predicting that AI doom will happen:

https://www.lesswrong.com/posts/w8PNjCS8ZsQuqYWhD/instrumental-convergence-draft

But now, onto my main comment here:

None. Of. That. Has. Anything. To. Do. With. Us. Not. Dying.

It is deeply troubling to see the question of extinction risk not even dismissed. It is ignored entirely.

I do like the discussion about releasing low-confidence findings with warnings attached, rather than censoring low-confidence results. You love to see it.

I'm going to be blunt and say that the attitude expressed in the first sentence is a good representative of an attitude I hate on LW: The idea that the scientific method, which generally includes empirical evidence is fundamentally irrelevant to a field is a very big problem that I see here, because as Richard Ngo in my view correctly said what the problem is with the attitude that the scientific method and empirical evidence are irrelevant to AI safety, and here goes:

Historically, the way that great scientists have gotten around this issue is by engaging very heavily with empirical data (like Darwin did) or else with strongly predictive theoretical frameworks (like Einstein did). Trying to do work which lacks either is a road with a lot of skulls on it. And that's fine, this might be necessary, and so it's good to have some people pushing in this direction, but it seems like a bunch of people around here don't just ignore the skulls, they seem to lack any awareness that the absence of the key components by which scientific progress has basically ever been made is a red flag at all.

In particular, this is essentially why I'm so concerned about the epistemics of AI safety, especially on LW, because this dissing of empirical evidence/the scientific method is to put it bluntly, a good example of not realizing that basically all of our ability to know much of anything is based on that.

I really, really wish LWers aren't nearly this hostile to admitting that the scientific method/empirical evidence mattered as Zvi is showing here.

This is actually interesting, because it implies that instrumental convergence is too weak to, on it's own, be much of an argument around AI x-risk, without other assumptions, and that makes it a bit interesting, as I was arguing against the inevitability of instrumental convergence, given that enough space for essentially unbounded instrumental goals is essentially useless for capabilities, compared to the lack of instrumental convergence, or perhaps very bounded instrumental convergence.

On the one hand, this makes my argument less important, since instrumental convergence mattered less than I believed it did, but on the other hand it means that a lot of LW reasoning is probably invalid, not just unsound, because it incorrectly assumes that instrumental convergence alone is sufficient to predict very bad outcomes.

And in particular, it implies that LWers, including Nick Bostrom, incorrectly applied instrumental convergence as if it were somehow a good predictor of future AI behavior, beyond very basic behavior.

I'd especially read footnote 3, because it gave me a very important observation for why instrumental convergence is actually bad for capabilities, or at least not obviously good for capabilities and incentivized, especially with a lot of space to roam:

This also means that minimal-instrumentality training objectives may suffer from reduced capability compared to an optimization process where you had more open, but still correctly specified, bounds. This seems like a necessary tradeoff in a context where we don't know how to correctly specify bounds.

Fortunately, this seems to still apply to capabilities at the moment- the expected result for using RL in a sufficiently unconstrained environment often ranges from "complete failure" to "insane useless crap." It's notable that some of the strongest RL agents are built off of a foundation of noninstrumental world models.

The point I'm trying to make is that the types of AI that are best for capabilities, including some of the more general capabilities like say automating alignment research also don't have that much space for instrumental convergence, and that matters because it's very easy to get alignment research for free, as well as safe AI by default, without disturbing capabilities research, because the most unconstrained power seeking AIs are very incapable, and thus in practice the most capable AIs that can solve the full problem of alignment and safety are by default safe because instrumental convergence harms capabilities currently.

In essence, the AI systems that are both capable enough to do alignment and safety research on future AI systems and are instrumentally convergent is a much smaller subset of capable AIs, and enough space for extreme instrumental convergence harms capabilities today, so it's not incentivized.

This matters because it's much, much easier to bootstrap alignment and safety, and it means that OpenAI/Anthropic's plans of automating alignment research have a good chance of working.

It's not that we cannot lose or go extinct, but that it isn't the default anymore, and in particular means that a lot of changes to how we do alignment research are necessary, as a first step. But the impact of the instrumental convergence assumption is so deep that even if it only is wrong up until a much later point of AI capability increases matters a lot more than you think.

EDIT: A footnote in porby's post actually expresses it a bit cleaner than I said it, so here goes:

This also means that minimal-instrumentality training objectives may suffer from reduced capability compared to an optimization process where you had more open, but still correctly specified, bounds. This seems like a necessary tradeoff in a context where we don't know how to correctly specify bounds.

Fortunately, this seems to still apply to capabilities at the moment- the expected result for using RL in a sufficiently unconstrained environment often ranges from "complete failure" to "insane useless crap." It's notable that some of the strongest RL agents are built off of a foundation of noninstrumental world models.

The fact that instrumental goals with very few constraints is actually useless compared to non-instrumentally convergent models is really helpful, as it means that a capable system is inherently easy to align and be safe by default, or equivalently there is a strong anti-correlation between capabilities and instrumental convergent goals.

I actually don't think the distinction between slow and fast takeoff matters too much here, at least compared to what the lack of instrumental convergence offers us. The important part here is that AI misuse is a real problem, but this is importantly much more solvable, because misuse isn't as convergent as the hypothesized instrumental convergence is. It matters, but this is a problem that relies on drastically different methods, and importantly still reduces the danger expected from AI.

My biggest counterargument to the case that AI progress should be slowed down comes from an observation made by porby about a fundamental lack of a property we theorize about AI systems, and the one foundational assumption around AI risk:

Instrumental convergence, and it's corollaries like powerseeking.

The important point is that current and most plausible future AI systems don't have incentives to learn instrumental goals, and the type of AI that has enough space and has very few constraints, like RL with sufficiently unconstrained action spaces to learn instrumental goals is essentially useless for capabilities today, and the strongest RL agents use non-instrumental world models.

Thus, instrumental convergence for AI systems is fundamentally wrong, and given that this is the foundational assumption of why superhuman AI systems pose any risk that we couldn't handle, a lot of other arguments for why we might to slow down AI, why the alignment problem is hard, and a lot of other discussion in the AI governance and technical safety spaces, especially on LW become unsound, because they're reasoning from an uncertain foundation, and at worst are reasoning from a false premise to reach many false conclusions, like the argument that we should reduce AI progress.

Fundamentally, instrumental convergence being wrong would demand pretty vast changes to how we approach the AI topic, from alignment to safety and much more to come,

To be clear, the fact that I could only find a flaw within AI risk arguments because they were founded on false premises is actually better than many other failure modes, because it at least shows fundamentally strong locally valid reasoning on LW, rather than motivated reasoning or other biases that transforms true statements into false statements.

One particular case of the insight is that OpenAI and Anthropic were fundamentally right in their AI alignment plans, because they have managed to avoid instrumental convergence from being incentivized, and in particular LLMs can be extremely capable without being arbitrarily capable given resources.

I learned about the observation from this post below:

https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty

Porby talks about why AI isn't incentivized to learn instrumental goals, but given how much this assumption gets used in AI discourse, sometimes implicitly, I think it's of great importance that instrumental convergence is likely wrong.

I have other disagreements, but this is my deepest disagreement with your model (and other models around AI is especially dangerous).

The point I was trying to make is that we click on and read negative news, and this skews our perceptions of what's happening, and critically the negativity bias operates regardless of the actual reality of the problem, that is it doesn't distinguish between the things that are very bad, just merely bad but solvable, and not bad at all.

In essence, I'm positing a selection effect, where we keep hearing more about the bad things, and hear less or none about the good things, so we are biased to believe that our world is more negative than it actually is.

And to connect it to the first comment, the reason you keep noticing precursors to existentially risky technology but not precursors existentially safe technology, or why this is happening:

To me it is as if technologies that tend to do more good than harm, or at least, would improve our odds by their introduction, social or otherwise, do not exist. That can't be right, surely?...

Is essentially an aspect of negativity bias because your information sources emphasize the negative over the positive news, no matter what reality looks like.

The link where I got this idea is below:

https://archive.is/lc0aY

It's essentially a frame that views things in a negative light, or equivalently a frame that views a certain issue as by default negative unless action is taken.

For example, climate change can be viewed in the negative, which is that we have to solve the problem or we all die, or as a positive frame where we can solve the problem by green tech

Shouting Boo just delays it a little and makes it more likely to be good instead of bad. (Currently is it quite likely to be bad).

I wouldn't be nearly as confident as a lot of LWers here, and in particular I suspect this depends on some details and assumptions that aren't made explicit here.

The problem is that focusing on a negative frame enabled by negativity bias will blind you to solutions, and is in general a great way to get depressed fast, which kills your ability to solve problems. Even more importantly, the problems might be imaginary, created by negativity biases.

Load More