tl;dr: because policy discussions follow contextualizing norms, value judgements are often interpreted as endorsements of (uncooperative) actions; people used to decoupling norms should carefully account for this in order to communicate effectively.

One difficulty in having sensible discussions of AI policy is a gap between the norms used in different contexts - in particular the gap between decoupling and contextualizing norms. Chris Leong defines them as follows:

  • Decoupling norms: It is considered eminently reasonable to require the truth of your claims to be considered in isolation - free of any potential implications. An insistence on raising these issues despite a decoupling request are often seen as sloppy thinking or attempts to deflect.
  • Contextualising norms: It is considered eminently reasonable to expect certain contextual factors or implications to be addressed. Not addressing these factors is often seen as sloppy or an intentional evasion.

LessWrong is one setting which follows very strong decoupling norms. Another is discussion of axiology in philosophy (i.e. which outcomes are better or worse than others). In discussions of axiology, it's taken for granted that claims are made without considering cooperative or deontological considerations. For example, if somebody said "a child dying by accident is worse than an old person being murdered, all else equal", then the local discussion norms would definitely not treat this as an endorsement of killing old people to save children from accidents; everyone would understand that there are other constraints in play.

By contrast, in environments with strong contextualizing norms, claims about which outcomes are better or worse than others can be interpreted as endorsements of related actions. Under these norms, the sentence above about accidents and murders could be taken as (partial) endorsement of killing old people in order to save children, unless the speaker added relevant qualifications and caveats.

In particular, I claim that policy discussions tend to follow strong contextualizing norms. I think this is partly for bad reasons (people in politics avoid decoupled statements because they're easier to criticise) and partly for good reasons. In particular, decoupling norms make it easier to:

  • Construct negative associations such as stereotypes by saying literally true things about non-central examples of a category.
  • "Set the agenda" in underhanded ways—e.g. by steering conversations towards hypotheticals which put the burden of proof on one's opponents, while retreating to the defence of "just asking questions".
  • Exploit the lossiness of some communication channels (e.g. by getting your opponents to say things which will be predictably misinterpreted by the media).

However, I'm less interested in arguing about the extent to which these norms are a good idea or not, and more interested in the implications of these norms for one's ability to communicate effectively. One implication: there are many statements where saying them directly in policy discussions will be taken as implying other statements that the speaker didn't mean to imply. Most of these are not impossible to say, but instead need to be said much more carefully in order to convey the intended message. The additional effort required may make some people decide it's no longer worthwhile to say those statements; I think this is not dishonesty, but rather responsiveness to costs of communication. To me it seems analogous to how there are many statements that need to be said very carefully in order to convey the intended message under high-decoupling norms (e.g. under high-decoupling norms, saying that someone is arguing in bad faith is a serious accusation which requires strong justification).

In particular, under contextualizing norms, saying "outcome X is worse than outcome Y" can be seen as an endorsement of acting in ways which achieve outcome Y instead of outcome X. There are a range of reasons why you might not endorse this despite believing the original statement (even aside from reputational/coalitional concerns). For example, if outcome Y is "a war":

  • You might hold yourself to deontological constraints about not starting wars.
  • You might worry that endorsing some wars would make other non-endorsed wars more likely.
  • You might hold yourself to decision-theoretic constraints like "people only gave me the ability to start the war because they trusted that I wouldn't do it".
  • If many people disagree with the original claim, then you might think that unilaterally starting the war is defecting in an epistemic prisoner's dilemma.
  • If many people have different values from you, then you might think that unilaterally starting the war is defecting in a moral prisoner's dilemma.

One way of summarizing these points is as different facets of cooperative/deontological morality. To restate the point, then: under contextualizing norms, unless you carefully explain that you're setting aside cooperative constraints, then claims like "X is worse than Y" can often reasonably be interpreted as a general endorsement of causing Y in order to avoid X. And I strongly recommend that people who are used to decoupling norms take that into account when interacting in environments with contextualizing norms, especially public ones.

This also goes the other way, though. For example, I think discussions of international policy are higher-decoupling in some ways than discussions of domestic policy or individual morality. In particular, advocacy of widely-supported interventions backed by violence in the international context is not taboo; whereas advocacy of violence by individuals, or by governments against their citizens, is strongly taboo. I think this taboo is very important, and urge others not to break it; nor to conflate advocacy of lawful force (carried out by legitimate (Schelling point) authorities) and unlawful force (carried out unilaterally, in a way that undermines our general ability to coordinate to avoid violence).

New Comment
61 comments, sorted by Click to highlight new comments since: Today at 4:24 PM

One of the big differences between decoupling norms and contextualizing norms is that, in practice, it doesn't seem possible to make statements with too many moving parts without contextualizers misinterpreting. Under contextualizing norms, saying "X would imply Y" will be interpreted as meaning both X and Y. Under decoupling norms, a statement like that usually means a complicated inference is being set up, and you are supposed to hold X, Y, and this relation in working memory for a moment while that inference is explained. There's a communication-culture interpretation of this, where interpretations and expectations interact; if people translate "X would imply Y" into "X and Y", and speakers anticipate this, then that's what it means.

But I don't think the communication culture interpretation correctly identifies what's happening. I think what the phrase "contextualizing norms" points at is not a norm at all.

For most people, holding two sentences and a relationship between them in working memory, is not something they can do reliably. If you say "X would imply Y", they will lose track of the relationship, and think that you said X directly and you said Y directly. Not because this is a correct interpretation under their communication culture, but because they literally can't track the distinction, without slowing down or using a diagram or something.

This generalizes across most things that have subtlety. "Decoupling norms" make it easy to say things that contextualizing norms make it impossible to say.

When Eliezer wrote:

Make it explicit in international diplomacy that preventing AI extinction scenarios is considered a priority above preventing a full nuclear exchange, and that allied nuclear countries are willing to run some risk of nuclear exchange if that’s what it takes to reduce the risk of large AI training runs.

What's supposed to happen with this sentence, cognitively speaking, is that you read the sentence, slot it into a preexisting model of how deterrence and red lines work. If you don't have that model, because you're not an international relations expert and haven't read Eliezer's previous writing on the subject, then you have to hold onto the whole sentence, with all its moving parts and all its subtlety. This isn't cognitively feasible for everyone, so they substitute a simpler sentence, and unfortunately in this case that means substituting something crazy that was not actually said.

The alternative would have been to embed a small lecture about international relations into the article. I think in this specific case that would have made things better, but that habitually dumbing things down in that way would be catastrophic. Because we still need to discover a solution to AI alignment, and I don't think it's possible to do that without discussing a lot of high-complexity things that can't be said at all under contextualizing norms.

The alternative would have been to embed a small lecture about international relations into the article.

I don't think that's correct, there are cheap ways of making sentences like this one more effective as communication. (E.g. less passive/vague phrasing than "run some risk", which could mean many different things.) And I further claim that most smart people, if they actually spent 5 minutes by the clock thinking of the places where there's the most expected disvalue from being misinterpreted, would have identified that the sentences about nuclear exchanges are in fact likely to be the controversial ones, and that those sentences are easy to misinterpret (or prime others to misinterpret). Communication is hard in general, and we're not seeing all the places where Eliezer did make sensible edits to avoid being misinterpreted, but I still think this example falls squarely into the "avoidable if actually trying" category.

habitually dumbing things down in that way would be catastrophic. Because we still need to discover a solution to AI alignment, and I don't think it's possible to do that without discussing a lot of high-complexity things that can't be said at all under contextualizing norms

That's why you do the solving in places that are higher-fidelity than twitter/mass podcasts/open letters/etc, and the communication or summarization in much simpler forms, rather than trying to shout sentences that are a very small edit distance from crazy claims in a noisy room of people with many different communication norms, and being surprised when they're interpreted differently from how you intended. (Edit: the "shouting" metaphor is referring to twitter, not to the original Time article.)

What's supposed to happen with this sentence, cognitively speaking, is that you read the sentence, slot it into a preexisting model of how deterrence and red lines work.

I think it’s a mistake to qualify this interpretation as an example of following decoupling norms. Deterrence and red lines aren’t mentioned in Eliezer’s comment at all; they’re just extra context that you’ve decided to fill in. That’s generally what people do when they read things under contextualizing norms. Interpreting this comment as a suggestion to consider initiating a nuclear exchange is also a contextualized reading, just with a different context filled in.

A highly-decoupled reading, by contrast, would simply interpret “some risk of nuclear exchange” as, well, some unquantified/unspecified risk.

extremely insightful point, and for some reason it seems I deeply disagree with the aggregate point, but I can't figure out why at the moment. strong upvoted though.

Not because this is a correct interpretation under their communication culture, but because they literally can't track the distinction, without slowing down or using a diagram or something.

I'm unsure but I suspect that in many cases you're thinking of, this is incorrect. I think people can track implication when it's something that "makes sense" to them, that they care about. I suspect that at least in a certain subset of these cases, what's really happening is this:

They believe not-X (in some fashion). You start trying to say "X implies Y". They get confused and are resistant to your statement. When pressed, it comes to light that they are refusing to think about possible worlds in which X is the case. They're refusing because they believe not-X (in some fashion), and it's pointless to think about worlds that are impossible--it won't affect anything because it's unreal, and it's impossible to reason about because it's contradictory. (They wouldn't be able to say all of this explicitly.)

This is great context. With Eliezer being brought up in White House Press Corps meetings, it looks like a flood of people might soon enter the AI risk discourse. Tyler Cowen has been making some pretty bad arguments on AI lately, but I though this quote was spot on:

“This may sound a little harsh, but the rationality community, EA movement, and the AGI arguers all need to radically expand the kinds of arguments they are able to process and deal with. By a lot. One of the most striking features of the “six-month Pause” plea was how intellectually limited and non-diverse — across fields — the signers were. Where were the clergy, the politicians, the historians, and so on? This should be a wake-up call, but so far it has not been.”

Nearly all reasonable discussions of AI x-risk have taken place in the peculiar cultural bubble of rationality and EA. These past efforts could be multiplied by new interest from mainstream folks in AI, policy, philosophy, economics, and other fields. Or they could be misunderstood and discarded in favor of distractions that claim the mantle of AI safety. Hopefully we can find ways of communicating with new people in other disciplines that will lead to productive conversations on AI x-risk.

Where were the clergy, the politicians, the historians, and so on? This should be a wake-up call, but so far it has not been.

The problem is that for those groups to show up, they'd need to think there is a problem, and this specific problem, first. In my bubble, from more humanities-minded people, I have seen reactions such as "the tech people have scared themselves into a frenzy because they believe their own hype", or even "they're doing this on purpose to pass the message that the AI they sell is powerful". The association with longtermists also seems harmful, in a "if they're involved, there must be something shady about it" way.

I think if one wants involvement from those parts of the public there needs to be a way to overcome the barrier of denial of AI capabilities, and a reframing of the problem in more human terms (for example, how developing dangerous AI is another form of corporations privatising profits and socialising costs, same as with many forms of environmental damage).

This is already a solved problem for smaller stakes communication, send irrevocably costly signals.

For example, a startup founder mortgages their aging parents house and then puts that money in their startup to show skin in the game to attract serious investors. 

And investors find that pretty reliable because they can call up the mortgage lender and confirm it's real.

The questions is what kind of irrevocably costly signals can be coordinated around, that the counter-parties can also reliably confirm.  And how many folks would actually pay the price.

One would imagine that companies saying "our product might kill everyone" and trying to suggest stopping operations for six months at an industry level would be one such cost: usually companies are really loathe to admit the dangers of their products, let alone such extreme ones. Yet some people have worked themselves up into believing that's only marketing.

One would imagine that companies saying "our product might kill everyone" and trying to suggest stopping operations for six months at an industry level would be one such cost:

Which companies have said this on the record?

None explicitly, but the argument that "AI companies drum up the idea that AI can kill everyone for marketing" has emerged a lot after the moratory open letter, seeing how many signatories are in one way or the other involved in the industry. There's also plenty of examples of employees at least admitting that existential risk from AI is a thing.

That's the issue though, "the clergy, the politicians, the historians" have not heard of these people, so it's barely better than totally random people saying it in their view.

If a major company said this on the record, that's different, because everyone's heard of Microsoft or Google, and their corporate credibility, reputation, etc., is in aggregate literally worth millions of times more than even the most influential individual who has signed on so far.

I mean, the open letter made front page news. And there's been a few mainstream news stories about the topic, like this one from CBS.

How does that relate to the perception by "the clergy, the politicians, the historians"?

"One of the most striking features of the “six-month Pause” plea was how intellectually limited and non-diverse — across fields — the signers were."
[...]
Nearly all reasonable discussions of AI x-risk have taken place in the peculiar cultural bubble of rationality and EA.

Some counter-examples that come to mind: Joshua Bengio, Geoffrey Hinton, Stephen Hawking, Bill Gates, Steve Wozniak. Looking at the Pause Giant Experiments Open letter now, I also see several signatories from fields like history, philosophy, some signers identifying as teachers, priests, librarians, psychologists, etc.

(Not that I disagree broadly with your point that the discussion has been strongly weighted in the rationality and EA communities.)

My clergy spouse wishes to remind people that there are some important religious events this week, so many clergy are rather busy. I'm quite hopeful that there will be a strong religious response to AI risk, as there is already to climate risk.

Great article! This helped me reframe some of the strong negative reactions to Yudkowsky's article on Twitter

I suspect a lot of the negative reactions to Yudkowsky's article isn't about norms, exactly, but rather a disagreement of how far we should be willing to go to slow down AI.

Yudkowsky is on the extreme end of the spectrum, which views airstrikes leading to global nuclear warfare as okay if AI is slowed down.

Suffice it to say, if you don't believe that doom is certain, then you will have massive issues with going this far for AI safety.

Yes, this is why I take issue with his position.

Despite disagreements with you and TurnTrout's models of how optimistic alignment is, I also agree that I have many issues with Eliezer's position on AI safety.

My problem isn't people disagreeing, it's none of the people disagreeing actually pointing out what they think is the specific flaw in EY's worries, and what are we doing to avoid them materialising. When so many of the people who are confident there's no danger don't seem to understand key points of the argument I just get early COVID vibes all over again.

none of the people disagreeing actually pointing out what they think is the specific flaw in EY's worries

Here's a ~12,000 word post of me doing exactly that: My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

I'll explain the issues I have with Eliezer Yudkowsky's position in a nutshell:

  1. Alignment is almost certainly easier than Yudkowsky or the extremely pessimistic people think. In particular, alignment is progressing way more than the extremely pessimistic models predicted.

  2. I don't think that slowing AI down instead of accelerating alignment is the best choice, primarily because I think we should mostly try to improve our chances on the current path than overturn our current path.

  3. Given that I am an optimist on AI safety, I don't really agree with Eliezer's suggestions on how AI should be dealt with.

No. 1 would convince me more if we were seeing good alignment in the existing, still subhuman models. I honestly think there are multiple problems with alignment as a concept, but I also expect that there would be a significant difficulty jump when dealing with superhuman AI (for example, RLHF becomes entirely useless).

No. 2 I don't quite understand - we care about the relative speed of the two, wouldn't anything that says "let's move people from capabilities to alignment research" like the moratorium asks do exactly what you say? You can't arbitrarily speed up one field without affecting the other, really, human resources are limited, and you won't get much mileage out of suddenly funding thousands of CS graduates to work on alignment while the veterans all advance capabilities. There's trade offs at work. You need to actively rebalance your resources to avoid alignment just playing catch up. It's kind of an essential part of the work anyway; imagine NASA having hundreds of engineers developing gigantic rocket engines and a team of like four guys working on control systems. You can't go to the Moon with raw power alone.

No. 3 depends heavily on what we think the consequences of misaligned AGI are. How dangerous Eliezer's proposal is also depends on how much do countries want to develop AGI. Consider e.g. biological weapons, where it's probably fairly easy to get a consensus on "let's just not do that" once everyone has realised how expensive, complex and ultimately still likely to blow up in your face they are. Vice versa, if alignment is easy, there's probably no reason why anyone would put such an agreement in place; but there needs to be some evidence of alignment being easy, and we need it soon. We can't wait for the point where if we're wrong the AI destroys the world to find out. That's not called a plan, that's just called being lucky, if it goes well.

No. 1 would convince me more if we were seeing good alignment in the existing, still subhuman models. I honestly think there are multiple problems with alignment as a concept, but I also expect that there would be a significant difficulty jump when dealing with superhuman AI (for example, RLHF becomes entirely useless).

The good news is we have better techniques than RLHF, which as you note is not particularly useful as an alignment technique.

On alignment not making sense as a concept, I agree somewhat. In the case of an AI and a human, I think that alignment is sensible, but as you scale up, it increasingly devolves into nonsense until you just force your own values.

No. 2 I don't quite understand - we care about the relative speed of the two, wouldn't anything that says "let's move people from capabilities to alignment research" like the moratorium asks do exactly what you say?

Not exactly, though what I'm envisioning is that you can use a finetuned AI to do alignment research, and while there are capabilities externalities, they may be necessary depending on how much feedback we need in order to solve the alignment problem.

Also, I think part of the disagreement is we are coming from different starting points on how much progress we did on alignment.

No. 3 depends heavily on what we think the consequences of misaligned AGI are.

This is important, but there are other considerations.

For example, the most important thing to think about with Eliezer's plan is what ethics/morals do you use.

Next, you need to consider the consequences of both aligned and misaligned AGIs. And I suspect they net out to much smaller consequences for AGI once you sum up the positives and negatives, assuming a consequentialist ethical system.

The problem is Eliezer's treaty would basically imply GPU production is enough to start a war. This is a much more severe consequence than almost any treaty ever done, and this has very negative impacts under a consequentialist ethical system.

evidence of alignment being easy, and we need it soon. We can't wait for the point where if we're wrong the AI destroys the world to find out. That's not called a plan, that's just called being lucky, if it goes well.

I think this is another point of disagreement. While I wouldn't like to test the success without dignity hypothesis, also known as luck, I do think there's a non-trivial probability of that happening, compared to other alignment people who think the chance is effectively epsilon.

Under international law, counterfeiting another nation's currency is considered an act of war and you can "legally" go to war to stop it... if you can bomb a printing press, is it ridiculous to say you can't have a treaty that says you can bomb a GPU foundry?

(The two most recent cases of a government actually counterfeiting another nation's currency were Nazi Germany during World War II which made counterfeit British pounds as part of its military strategy, and the "supernote" US dollar produced by North Korea.)

And in the end no one bombed North Korea, because saying something is an act of war doesn't imply automatic war anyway, it's subtler than that. Honestly in the hypothetical "no GPUs" world you'd probably have all the major States agreeing it's a danger to them and begrudgingly cooperating on those lines, and the occasional pathetic attempt by some rogue actor with nothing to lose might be nipped in the bud via sanctions or threats. The big question really is how detectable such attempts would be compared to developing e.g. bacteriological weapons. But if tomorrow we found out that North Korea is developing Super Smallpox and plans to release it, what would we do? We are already in a similar world, we just don't think much about it because we've gotten used to this being the precarious equilibrium we exist in.

Next, you need to consider the consequences of both aligned and misaligned AGIs. And I suspect they net out to much smaller consequences for AGI once you sum up the positives and negatives, assuming a consequentialist ethical system.

I find this sort of argument kinda nonsensical. Like, yes, it's useful to conceptualise goods and harms as positives and negatives you balance, but in practice you can't literally put numbers on them and run the sums, especially not with so many uncertainties at stake. It's always possible to fudge the numbers and decide that some values are unimportant and some are super important and lo and behold, the calculation turns in your favour! In the end it's no better than deontology or simply saying "I think this is good"; there is no point trying to vest it with a semblance of objectivity that just isn't there. I am a consequentialist and I think that overall AGI is on the net probably bad for humanity, and I include also some possible outcomes from aligned AGI in there.

I do think there's a non-trivial probability of that happening, compared to other alignment people who think the chance is effectively epsilon.

I don't think it's that improbable either, I just think it's irresponsible either way when so much is at stake. I think the biggest possible points of failure of the doom argument are:

  • we just aren't able to build AGI any soon (but in that case the whole affair turns out to be much ado about nothing), or

  • we are able to build AGI, but then AGI can't really push past to ASI. This might be purely chance, or the result of us using approaches that merely "copy" human intelligence but aren't able to transcend it (for example, if becoming superintelligent would require being trained on text written by superintelligent entities)

So, sure, we may luck out, thought that leaves us "only" with already plenty disruptive human-level AGI. Regardless, this makes the world potentially a much more unstable powder keg. Even without going specifically down the road EY mentions, I think nuclear and MAD analogies do apply because the power in play is just that great (in fact am writing a post on this, will go up tomorrow if I can finish it).

It's always possible to fudge the numbers and decide that some values are unimportant and some are super important and lo and behold, the calculation turns in your favour! In the end it's no better than deontology or simply saying "I think this is good"; there is no point trying to vest it with a semblance of objectivity that just isn't there.

Is this not simply the fallacy of gray?

As saying goes, it's easy to lie with statistics, but even easier to lie without them. Certainly you can fudge the numbers to make the result say anything, but if you show your work then the fudging gets more obvious.

I agree that laying out your thinking at least forces you to specifically elucidate your values. That way people can criticise the precise assumptions they disagree with, and you can't easily back out of them. I don't think the "lying with statistics" saying applies in its original meaning because really this is entirely about subjective terminal values. "Because I like it this way" is essentially what it boils down to no matter how you slice it.

In the end it's no better than deontology or simply saying "I think this is good"; there is no point trying to vest it with a semblance of objectivity that just isn't there.

You're right that it isn't an objective calculation, and apparently it requires more subjective assumptions, so I'll agree that we really shouldn't be treating this as though it's an objective calculation.

I don't think it's that improbable either, I just think it's irresponsible either way when so much is at stake.

I agree that testing that hypothesis is dangerously irresponsible, given the stakes involved. That's why I still support alignment work.

I think the biggest things if success without dignity happens, I think it will be due to some of the following factors:

  1. Alignment turns out to be really easy by default, that is something like the naive ideas like RLHF just work, or it turns out that value learning is almost trivial.

  2. Corrigibility is really easy or trivial to do, such that alignment isn't relevant, because humans can redirect it's goals easily. In particular, it's easy to get AIs to respect a shutdown order.

  3. We can't make AGI, or it's too hard to progress AGI to ASI.

These are the major factors I view as likely in a success without dignity case, or we survive AGI/ASI via luck.

I find 1 unlikely, 2 almost impossible (or rather, it would imply partial alignment, in which at least you managed to impress Asimov's Second Law of Robotics into your AGI above all else), and 3 the most likely, but also unstable (what if your 10^8 instances of AGI engineers suddenly achieve a breakthrough after 20 years of work?). So this doesn't seem particularly satisfying to me.

Responding to your #1, do you think we're on track to handle the cluster of AGI Ruin scenarios pointed at in 16-19? I feel we are not making any progress here other than towards verifying some properties in 17.

16: outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.
17: on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over. 
18: There's no reliable Cartesian-sensory ground truth (reliable loss-function-calculator) about whether an output is 'aligned'
19: there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment

This is not something he said, and not something he thinks. If you read what he wrote carefully, through a pedantic decoupling lens, or alternatively with the context of some of his previous writing about deterrence, this should be pretty clear. He says that AI is bad enough to put a red line on; nuclear states put red lines of lots of things, most of which are nowhere near as bad as nuclear war is.

In response to the question,

“[Y]ou’ve gestured at nuclear risk. … How many people are allowed to die to prevent AGI?”,

he wrote:

“There should be enough survivors on Earth in close contact to form a viable reproductive population, with room to spare, and they should have a sustainable food supply. So long as that's true, there's still a chance of reaching the stars someday.”

He later deleted that tweet because he worried it would be interpreted by some as advocating a nuclear first strike.

I’ve seen no evidence that he is advocating a nuclear first strike, but it does seem to me to be a fair reading of that tweet that he would trade nuclear devastation for preventing AGI.

I’ve seen no evidence that he is advocating a nuclear first strike, but it does seem to me to be a fair reading of that tweet that he would trade nuclear devastation for preventing AGI.

Most nuclear powers are willing to trade nuclear devastation for preventing the other side's victory. If you went by sheer "number of surviving humans", your best reaction to seeing the ICBMs fly towards you should be to cross your arms, make your peace, and let them hit without lifting a finger. Less chance of a nuclear winter and extinction that way. But the way deterrence prevents that from happening is by pre-commitment to actually just blowing it all up if someone ever tries something funny. That is hardly less insane than what EY suggests, but it kinda makes sense in context (but still, with a God's eye view on humanity, it's insane, and just the best way we could solve our particular coordination problem).

There’s a big difference between pre-committing to X so you have a credible threat against Y, vs. just outright preferring X over Y. In the quoted comment, Eliezer seems to have been doing the latter.

"Most humans die in a nuclear war, but human extinction doesn't happen" is presumably preferable to "all biological life on Earth is eaten by nanotechnology made by an unaligned AI that has worthless goals". It should go without saying that both are absolutely terrible outcomes, but one actually is significantly more terrible than the other.

Note that this is literally one of the examples in the OP - discussion of axiology in philosophy.

Right, but of course the absolute, certain implication from “AGI is created” to “all biological life on Earth is eaten by nanotechnology made by an unaligned AI that has worthless goals” requires some amount of justification, and that justification for this level of certainty is completely missing.

In general such confidently made predictions about the technological future have a poor historical track record, and there are multiple holes in the Eliezer/MIRI story, and there is no formal, canonical write up of why they’re so confident in their apparently secret knowledge. There’s a lot of informal, non-canonical, nontechnical stuff like List of Lethalities, security mindset, etc. that’s kind of gesturing at ideas, but there are too many holes and potential objections to have their claimed level of confidence, and they haven’t published anything formal since 2021, and very little since 2017.

We need more than that if we’re going to confidently prefer nuclear devastation over AGI.

The trade-off you're gesturing at is really risk of AGI vs. risk of nuclear devastation. So you don't need absolute certainty on either side in order to be willing to make it.

Did you intend to say risk off, or risk of?

If the former, then I don't understand your comment and maybe a rewording would help me.

If the latter, then I'll just reiterate that I'm referring to Eliezer's explicitly stated willingness to trade off the actuality of (not just some risk of) nuclear devastation to prevent the creation of AGI (though again, to be clear, I am not claiming he advocated a nuclear first strike). The only potential uncertainty in that tradeoff is the consequences of AGI (though I think Eliezer's been clear that he thinks it means certain doom), and I suppose what follows after nuclear devastation as well.

And how credible would your precommitment be if you made it clear that you actually prefer Y, you're just saying you'd do X for game theoretical reasons, and you'd do it, swear? These are the murky cognitive waters in which sadly your beliefs (or at least, your performance of them) affects the outcome.

One's credibility would be less of course, but Eliezer is not the one who would be implementing the hypothetical policy (that would be various governments), so it's not his credibility that's relevant here.

I don't have much sense he's holding back his real views on the matter.

But on the object level, if you do think that AGI means certain extinction, then that's indeed the right call (consider also that a single strike on a data centre might mean a risk of nuclear war, but that doesn't mean it's a certainty. If one listened to Putin's barking, every bit of help given to Ukraine is a risk of nuclear war, but in practice Russia just swallows it up and lets it go, because no one is actually very eager to push that button, and they still have way too much to lose from it).

The scenario in which Eliezer's approach is just wrong is if he is vastly overestimating the risk of an AGI extinction event or takeover. This might be the case, or might become so in the future (for example imagine a society in which the habit is to still enforce the taboo, but alignment has actually advanced enough to make friendly AI feasible). It isn't perfect, it isn't necessarily always true, but it isn't particularly scandalous. I bet you lots of hawkish pundits during the Cold War have said that nuclear annihilation would have been preferable to the worldwide victory of Communism, and that is a substantially more nonsensical view.

I agree that if you're absolutely certain AGI means the death of everything, then nuclear devastation is preferable.

I think the absolute certainty that AGI does mean the death of everything is extremely far from called for, and is itself a bit scandalous.

(As to whether Eliezer's policy proposal is likely to lead to nuclear devastation, my bottom line view is it's too vague to have an opinion. But I think he should have consulted with actual AI policy experts and developed a detailed proposal with them, which he could then point to, before writing up an emotional appeal, with vague references to air strikes and nuclear conflict, for millions of lay people to read in TIME Magazine.)

I think the absolute certainty in general terms would not be warranted; the absolute certainty if AGI is being developed in a reckless manner is more reasonable. Compare someone researching smallpox in a BSL-4 lab versus someone juggling smallpox vials in a huge town square full of people, and what probability does each of them make you assign to a smallpox pandemic being imminent. I still don't think AGI would mean necessarily doom simply because I don't fully buy that its ability to scale up to ASI is 100% guaranteed.

However, I also think in practice that would matter little, because states might still see even regular AGI as a major threat. Having infinite cognitive labour is such a broken hax tactic it basically makes you Ruler of the World by default if you have an exclusive over it. That alone might make it a source of tension.

We don’t know with confidence how hard alignment is, and whether something roughly like the current trajectory (even if reckless) leads to certain death if it reaches superintelligence.

There is a wide range of opinion on this subject from smart, well-informed people who have devoted themselves to studying it. We have a lot of blog posts and a small number of technical papers, all usually making important (and sometimes implicit and unexamined) theoretical assumptions which we don’t know are true, plus some empirical analysis of much weaker systems.

We do not have an established, well-tested scientific theory like we do with pathogens such as smallpox. We cannot say with confidence what is going to happen.

Yeah, at the very least it's calling for billions dead across the world, because once we realize what Eliezer wants, this is the only realistic outcome.

[This comment is no longer endorsed by its author]Reply

I don’t agree billions dead is the only realistic outcome of his proposal. Plausibly it could just result in actually stopping large training runs. But I think he’s too willing to risk billions dead to achieve that.

Agree with this post.

Another way I think about this is, if I have a strong reason to believe my audience will interpret my words as X, and I don’t want to say X, I should not use those words. Even if I think the words are the most honest/accurate/correct way of precisely conveying my message.

People on LessWrong have a high honesty and integrity bar but the same language conveys different info in other contexts and may therefore be de facto less honest in those contexts.

This being said, I can see a counterargument that is: it is fundamentally more honest if people use a consistent language and don’t adapt their language to scenarios, as it is easier for any other agent to model and extract facts from truthful agent + consistent language vs truthful agent + adaptive language.

I think this is the motivation behind ideas like tabooing solutions and defining the problem first. This sets an explicit context of decoupling for talking about the facts of the problem, and then a different explicit context of “contextualizing” the agreed-upon problem in terms of proposing solutions.

I wonder if running these as an ongoing, alternating pattern would be a useful method for breaking long-standing log jams.

You wrote "causing Y in order to achieve X" but I believe you meant "causing Y to prevent X"

Decoupling is like pre-training, refining good features, potential capability. While contextualizing is like fine-tuning, assembling the features into prosaically aligned behaviors.

Thus "decoupling" might be a noncentral way of describing this natural role, it should instead be about receptiveness to learning all contexts, all points of view, all ideologies and paradigms. Decoupling is only a relevant move to describe this when there is a sticky contextualization that needs to be suppressed to enable exploration. But the new thing that's arrived-at (when studying something in particular) is not in itself "decoupled", it might well be as contextualizing as the original frame.

Thank you for this post. I didn't get anything out of your other one on Knightian norms, but found this one much more straightforward to engage with.

high-decoupling

Did you mean high-contextualizing here?

Ah, no line number. Context:

To me it seems analogous to how there are many statements that need to be said very carefully in order to convey the intended message under high-decoupling norms, like claims about how another person's motivations or character traits affect their arguments.

Nope, I meant high decoupling - because the most taboo thing in high decoupling norms is to start making insinuations about the speaker rather than the speech.

I see. I guess hadn't made the connection of attributing benefits to high-contextualizing norms. Only got as far as observing that certain conversations go better with comp lit friends than with comp sci peers. That was the only sentence that gave me a parse failure. I liked the post a lot.

This is really good! I am commenting because merely upvoting doesn't feel sufficient.