yieldthought — LessWrong

Signals of war in August 2021

This is a fair point. I don’t know what economic cost Russia paid by reducing gas nor if they could expect to make that up by shipping more later on. Perhaps this was a relatively low-cost and sensible extension of the military positioning.

I guess I have updated to: could we have known that Putin was fully-prepared for war and making a credible threat of invasion. I didn’t really see discussion of that so early, and would still love to find sources that did so.

Also: a threat implies demands, negotiation. If we think in these terms, did Putin make genuinely fulfilled demands that would have avoided the war? Or was he driven by internal needs?

Signals of war in August 2021

yieldthought3y10

This is a good one and the timing suggests it is true at least in the short term. The Olympics only started in Feb ‘22 though. Do we have any indication that China made Putin wait for several months?

Loss of Alignment is not the High-Order Bit for AI Risk

yieldthought3y10

I guess my point is that individual humans are already misaligned with humanity’s best interests. If each human had the power to cause extinction at will, would we survive long enough for one of them to do it by accident?

Loss of Alignment is not the High-Order Bit for AI Risk

yieldthought3y21

To the extent that reinforcement models could damage the world or become a self-replicating plague, they will do so much earlier in the takeoff when given direct aligned reward for doing so.

Loss of Alignment is not the High-Order Bit for AI Risk

yieldthought3y10

Consider someone consistently giving each new AI release the instructions “become superintelligent and then destroy humanity”. This is not the control problem, but doing this will surely manifest x-risk behaviour at least some degree earlier than when given innocuous instructions?

Loss of Alignment is not the High-Order Bit for AI Risk

yieldthought3y55

A thoughtful decomposition. If we take the time dimension out and consider AGI just appears ready-to-go I think I would directionally agree with this.

My key assertion is that we will get sub-AGI capable of causing meaningful harm when deliberately used for this purpose significantly ahead of getting full AGI capable of causing meaningful harm through misalignment. I should unpack that a little more:

Alignment primarily becomes a problem when solutions produced by an AI are difficult for a human to comprehensively verify. Stable Diffusion could be embedding hypnotic-inducing mind viruses that will cause all humans to breed cats in an effort to maximise the cute catness of the universe, but nobody seriously thinks this is taking place because the model has no representation of any of those things nor the capability to do so.
Causing harm becomes a problem earlier. Stable Diffusion can be used to cause harm, as can Alpha Fold. Future models that offer more power will have meaningfully larger envelopes for both harm and good.
Given that we will have the harm problem first, we will have to solve it in order to have a strong chance of facing the alignment problem at all.
If, when we face the alignment problem, we have already solved the harm problem, addressing alignment becomes significantly easier and arguably is now a matter of efficiency rather than existential risk.

It's not quite as straightforward as this, of course, as it's possible that whatever techniques we come up with for avoiding deliberate harm by sub-AGIs might be subverted by stronger AGIs, but the primary contention of the essay is that assigning a 15% x-risk to alignment implicitly assumes a solution to the harm problem, but this is not currently being invested in to similar or appropriate levels.

In essence, alignment is not unimportant but alignment-first is the wrong order, because to face an alignment x-risk we must first overcome an unstated harm x-risk.

In this formulation, you could argue that the alignment x-risk is 15% conditional on us solving the harm problem, but given current investment in AI safety is dramatically weighted towards alignment and not harm the unconditional alignment x-risk is well below 5% - accounting for the additional outcomes that we may not face it because we fail an AI-harm filter, or because in solving AI-harm we de-risk alignment, or because AI-harm is sufficiently difficult that AI research becomes significantly impacted, slowing or stopping us from reaching the alignment x-risk filter by 2070 (cf global moratoriums on nuclear and biological weapons research, which dramatically slowed progress in those areas).

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments