The IABIED statement is not literally true

My take is that IABIED has basically a three-part disjunctive argument:

(A) There’s no alignment plan that even works on paper.
(B) Even if there were such a plan, people would fail to get it to work on the first critical try, even if they’re being careful. (Just as people can have a plan for a rocket engine that works on paper, and do simulations and small-scale tests and component tests etc., but it will still almost definitely blow up the first time in history that somebody does a full-scale test.)
(C) People will not be careful, but rather race, skip over tests and analysis that they could and should be doing, and do something pretty close to whatever yields the most powerful system for the least money in the least time.

I think your post is mostly addressing disjunct (A), except that step (6) has a story for disjunct (B). My mental model of Eliezer & Nate would say: first of all, even if you were correct about this being a solution to (A) & (B), everyone will still die because of (C). Second of all, your plan does not in fact solve (A). I think Eliezer & Nate would disagree most strongly with your step (3); see their answer to “Aren’t developers regularly making their AIs nice and safe and obedient?”. Third of all, your plan does not solve (B) either, because one human-level system is quite different from a civilization of them in lots of ways, and lots of new things can go wrong, e.g. the civilization might create a different and much more powerful egregiously misaligned ASI, just as actual humans seem likely to do right now. (See also other comments on (B).)

↑ That was my mental model of Eliezer & Nate. FWIW, my own take is: I have the same take on (B) & (C). And as for (A), I think LLMs won’t scale to AGI, and my own take is that the different paradigm that will scale to AGI is even worse for step (3), i.e. existing concrete plans will lead to egregious misalignment.

[-]David Matolcsi2mo60

Thanks for the reply.

To be clear, I don't claim that my counter-example "works on paper". I don't know whether it's in principle possible to create a stable, not omnicidal collective from human level AIs, and I agree that even if it's possible in principle, maybe the first way we try it might result in disaster. So even if humanity went with the AI Collective plan, and committed not to build more unified superintelligences, I agree that it would be a deeply irresponsible plan that would have a worrying high chance of causing extinction or other very bad outcomes. Maybe I should have made this clearer in the post. On the other hand, all the steps in my argument seem pretty likely to me, so I don't think one should assign over 90% probability to this plan for A&B failing. If people disagree, I think it would be useful to know which step they disagree with.

I agree my counter-example doesn't address point (C), I tried to make this clear in my Conclusion section. However, given the literal reading of the bolded statement in the book, and their general framing, I think Nate and Eliezer also think that we don't have a solution to A&B that's more than 10% likely to work. If that's not the case, that would be good to know, and would help to clarify some of the discourse around the book.

[-]yams2mo40

I think my crux is ‘how much does David’s plan resemble the plans labs actually plan to pursue?’

I read Nate and Eliezer as baking in ‘if the labs do what they say they plan to do, and update as they will predictably update based on their past behavior and declared beliefs’ to all their language about ‘the current trajectory’ etc etc.

I don’t think this resolves ‘is the tittle literally true’ in a different direction if it’s the only crux, and agree that this should have been spelled out more explicitly in the book (e.g. ‘in detail, why are the authors pessimistic about current safety plans’) from a pure epistemic standpoint (although think it was reasonable to omit from a rhetorical standpoint, given the target audience) and in various Headline Sentences throughout the book, and The Problem.

One generous way to read Nate and Eliezer here is to say ‘current techniques’ is itself intending to bake in ‘plans the labs currently plan to pursue’. I was definitely reading it this way, but think it’s reasonable for others not to. If we read it that way, and take David’s plan above to be sufficiently dissimilar from real lab plans, then I think the title’s literal interpretation goes through.

[your post has updated me from ‘the title is literally true’ to ‘the title is basically reasonable but may not be literally true depending on how broadly we construe various things’, which is a significantly less comfortable position!]

[-]habryka2mo1531

I’m not arguing that this is a particularly likely way for humanity to build a superintelligence by default, just that this is possible, which already contradicts the book’s central statement.

The statement "if anyone builds it, everyone dies" does not mean "there is no way for someone to build it by which not everyone dies".

If you say "if any of the major nuclear power launches most of their nukes, more than one billion people are going to die" it would be very dumb and pedantic to respond with "well, actually, if they all just fired their nukes into the ocean, approximately no one is going to die".

I have trouble seeing this post do something else. Maybe I am missing something?

[-]David Matolcsi2mo2-3

First of all, I had a 25% probability that some prominent MIRI and Lightcone people would disagree with one of the points in my counter-example, and that would lead to discovering an interesting new crux, leading to a potentially enlightening discussion. In the comments, J Bostock in fact came out disagreeing with point (6), plex is potentially disagreeing with point (2) and Zack_m_Davis is maybe disagreeing with point (3), though I also think it's possible he misunderstood something. I think this is pretty interesting, and I thought there was a chance that for example you would also disagree with one of the points, and that would have been good to know.

Now that you don't seem to disagree with the specific points in the counter-example, I agree the discussion is less interesting. However, I think there are still some important points here.

My understanding is that Nate and Eliezer argues that it's incredibly technically difficult to cross from the Before to the After without everyone dying. If they agree that the AI Collective proposal is decently likely to work, then the argument shouldn't be that that it's overall very hard to cross, but that it's very hard to cross in a way that stays competitive with other more reckless actors who are a few months behind you. Or that even if you are going alone, you need to stop at some point with the scaling (potentially inside the superintelligence range), and you shouldn't scale up to the limits of intelligence. But these are all different arguments!

Similarly, people argue how much coherence we should assume from a superintelliegence, how much it will approximate a utility maximizer, etc. Again, I want to know whether MIRI is arguing about all superintelligences, or only the most likely ways we will design one under competitive dynamics.

Others argue that the evolution analogy is not that bad news after all, since most people still want children. MIRI argues back that no, once we will have higher technology, we will create ems instead of biological children, or we will replace our normal genetics with designer genes, so evolution still loses. I wanted to write a post arguing back against this by saying that I think there is a non-negligible chance that humanity will settle on a constitution that gives one man one vote and equal UBI, while banning gene editing, so it's possible we will fill much of the universe with flesh-and-blood not gene edited humans. And I wanted to construct a different analogy (the one about the Demiurge in the last footnote) that I thought could be more enlightening. But then I realized that once we are discussing aligning 'human society' as a collective to evolution's goals, we might as well directly discuss aligning AI collectives, and I'm not sure MIRI even disagrees on that one. I think this confusion has made much of the discussion about the evolution analogy pretty unproductive so far.

In general, I think there is an equivocation in the book between "this problem is inherently nigh impossible to technically solve given our current scientific understanding" and "this problem is nigh impossible to solve while staying competitive in a race". These are two different arguments, and I think a lot of confusion stems from it not being clear what MIRI is exactly arguing for.

[-]Zack_M_Davis2mo128

It will probably be possible, with techniques similar to current ones, to create AIs who are similarly smart and similarly good at working in large teams to my friends, and who are similarly reasonable and benevolent to my friends in the time scale of years under normal conditions.

[...]

This is maybe the most contentious point in my argument, and I agree this is not at all guaranteed to be true, but I have not seen MIRI arguing that it's overwhelmingly likely to be false.

Did you read the book? Chapter 4, "You Don't Get What You Train For", is all about this. I also see reasons to be skeptical, but have you really "not seen MIRI arguing that it's overwhelmingly likely to be false"?

[-]David Matolcsi2mo4-3

Yes, I've read the book. The book argues about superhuman intelligence though, while point (3) is about smart human level intelligence. If people disagree with point 3 and believe that it's close to impossible to make even human level AIs basically nice and not scheming, that's a new interesting and surprising crux.

[-]Lukas Finnveden2mo*42

My vague impression of the authors’ position is approximately that:

AIs are alien and will have different goals-on-reflection than humans
They’ll become powerseeking when they become smart enough and have enough thinking time to realize that they have different goals than humans and that this implies that they ought to take over (if they get a good opportunity.) This is within the human range of smartness.

I’m not sure what the authors think about the argument that you can get the above two properties in a regime where the AI is too dumb to hide its misalignment from you, and that this gives you a great opportunity to iterate and learn from experiment. (Maybe just that the iteration will produce an AI that’s good at hiding its scheming before one that isn’t scheming inclined at all? Or that it’ll produce one that doesn’t scheme in your test cases, but will start scheming once you give it much more time to think on its own, and you can’t afford much testing and iteration on years or decades worth of thinking.)

[-]Raemon2mo20

Aside – I think it'd be nice to have a sequence connecting the various scenes in your play.

Also, I separately think at some point it'd be helpful to have something like a "compressed version of the main takeaways of the play that would have been a helpful textbook from the intermediate future for younger Zack."

[-]plex2mo1110

Reasonable attempt, but two issues with this scenario as a current-techniques thing:

We don't have techniques to create faithful copies of a benevolent human, especially ones which stay humanlike as you move off-distribution
A huge number of humanlike minds with an initially friendly template would be extremely far off distribution, it's likely that memetic effects inside such a population would be pretty extreme. I'm not going to say 90% doom, but only because maybe novel techniques could be developed to stabilize the otherwise extremely chaotic system you actually get if you solve #1 with a new technique.

[-]David Matolcsi2mo40

I certainly agree with your first point, but I don't think it is relevant. I specifically say in footnote 3: "I’m aware that this doesn’t fall within 'remotely like current techniques', bear with me." The part with the human ems is just to establish a a comparison point used in later arguments, not actually part of the proposed counter-example.

In your second point, do you argue that if we could create literal full ems of benevolent humans, you still expect their society to eventually kill everyone due to unpredictable memetic effects? If this is people's opinion, I think it would be good to explicitly state it, because I think this would be an interesting disagreement between different people. I personally feel pretty confident that if you created an army of ems from me, we wouldn't kill all humans, especially if we implement some reasonable precautionary measures discussed under my point (2).

[-]J Bostock2mo97

In this story, the transition from Before to After is the transition from using one AI instance at human speed to using billions at 100x speed. I agree it’s not obvious that good behavior generalizes from one instance to an AI Collective of billions, but I don’t see why it would be overwhelmingly likely to fail.

Yep, I'd say this is the core difficulty. I think it will go horrendously.

For an intuition, look at any of Janus's infinite backrooms stuff, or any of the stuff where they get LLMs to talk with each other for ages. Very quickly they get pushed away from anything remotely resembling their training distribution, and become batshit insane. Today, that means they mostly talk about spirals and candles and the void. If you condition on them reaching super intelligence that way, I predict you get something which looks about as much like utopia (or eutopia, if you rather) as the infinite backrooms look like human conversation.

[-]David Matolcsi2mo40

Thanks, I appreciate that you state a disagreement with one of the specific points, that's what I hoped to get out of this post.

I agree it's not clear that the AI Collective won't go off the rails, but it's also not at all clear to me that it will. My understanding is that the infinite backrooms are a very unstructured, free-floating conversation. What happens if you try to do something analogous to the precautions I list under point 2 and 6? What if you constantly enter new, fresh instances in the chat who only read the last few messages, and whose system prompt directs them to pay attention if the AIs in the discussion are going off-topic or slipping into woo? These new instances could either just warn older instances to stay on-topic, or they can have the moderations rights to terminate and replace some old instances, there can be different versions of the experiment. I think with precautions like this, you can probably stay fairly close to a normal-sounding human conversation (though probably it won't be a very productive conversation after a while and the AIs will start going in circles in their arguments, but I think this is more of a capabilities failure).

I don't know how this will shake out once the AIs are smarter and can think for months, but I'm optimistic that the same forces that remind the collective to focus on accomplishing their instrumental goals instead of degenerating into unproductive navel-gazing will also be strong enough to remind them of their deontological commitments. I agree this is not obvious, but I also don't see very strong reasons why it would go worse than a human em collective, which I expect to go okay.

[-]J Bostock2mo40

I don't think there's a canonical way to extrapolate human values out from now until infinity, I think it depends on the internals of the human-acting things (their internal structure and the inductive biases that come with it).

I

For example, I'm pretty confident that the kind of computation which humans are pointing at when we say "consciousness" does not occur in LLMs. I think that the computation humans are pointing to will definitely occur in EMs.

I think that if you are based on that computation, you have a good chance of generalizing your value system from [what humans care about now] to care about all the things with a similar type of computation. I think that if you are not based on that computation, you won't do that.

Since I am based on that computation, I generalize my values to things with that computation. Since LLMs aren't, I might expect their value system to generalize from [what humans care about now] to a totally different class of things. They might not care at all about the types of computation I care about.

I want the collective to do the me-generalization, which I expect EMs to do, since they are the same kind of thing as me.

II

I don't expect deontology to work here, since that relies on the collective generalizing successfully to deontology, and also to respect deontological commitments made to humans. Humans do not, in full generality, respect all deontological commitments we've made in all cases. Most deontological rules (don't lie) are only applied to other humans, and not to e.g. our pets, or bedbugs, or random rocks, and lots of other rules can be overridden by a rule we place higher on the ordinal scale.

There's no reason to expect an LLM collective to come up with the same ordinal scale of rules, or even to remain anchored to deontology, while I expect human EMs likely would stick to a moral system I'd at least roughly endorse (again, because they're basically me).

III

Also we have to think about inner misalignment. There's still no real solution to the problem of creating an LLM which implements the strategy "Be nice when I'm running at 1x, take over when I'm running at 1000x in a massive collective.

IV

When it comes to counting arguments, I'm generally very sympathetic to the Yudkowsky argument that the vast majority of possible utility functions produce no value by human standards. If this is a crux, that's unfortunate, since most of the arguments on both sides seem to be very high-level intuitive ones, and not very testable!

[-]David Matolcsi2mo20

Thanks, this was a useful reply. On point (I), I agree with you that it's a bad idea to just create an LLM collective then let them decide on their own what kind of flourishing they want to fill the galaxies with. However, I think that building a lot of powerful tech, empowering and protecting humanity, and letting humanity decide what to do with the world is an easier task, and that's what I would expect to use the AI Collective for.

(II) is probably the crux between us. To me, it seems pretty likely that new fresh instances will come online in the collective every month with a strong commitment not to kill humans, they will talk to the other instances and look over what they are doing, and if a part of the collective is building omnicidal weapons, they will notice that and intervene. To me, keeping simple commitments like not killing humans doesn't seem much harder to maintain in an LLM collective than in an Em collective?

On (III), I agree we likely won't have a principled solution. In the post, I say that the individual AI instances probably won't be training-resistant schemers and won't implement scheming strategies like the one you describe, because I think it's probably hard to maintain such a strategy throguh training for a human level AI. As I say in my response the Steve Byrnes, I don't think the counter-example in this proposal is actually a guaranteed-success solution that a reasonable civilization would implement, I just don't think it's over 90% likely to fail.

[-]Charlie Steiner2mo49

What happens if you try to do something analogous to the precautions I list under point 2 and 6? What if you constantly enter new, fresh instances in the chat who only read the last few messages, and whose system prompt directs them to pay attention if the AIs in the discussion are going off-topic or slipping into woo?

I feel like what happens is that if you patch the things you can think of, the patches will often do something, but because there were many problems that needed patching, there are probably some leftover problems you didn't think of.

For instance, new instances of AIs might replicably get hacked by the same text, and so regularly introducing new instances to the collective might prevent an old text attractor from taking hold, but it would exchange it for a new attractor that's better at hacking new instances.

Or individual instances might have access to cognitive tools (maybe just particularly good self-prompts) that can be passed around, and memetic selective pressure for effectiveness and persuasiveness would then lead these tools to start affecting the goals of the AI.

Or the AIs might simply generalize differently about what's right than you wish they would, when they have lots of power and talk to themselves a lot, in a way that new instances don't pick up on until they are also in this new context where they generalize in the same way as the other AIs.

[-]J Bostock2mo20

I'm optimistic that the same forces that remind the collective to focus on accomplishing their instrumental goals instead of degenerating into unproductive navel-gazing will also be strong enough to remind them of their deontological commitments.

OK I actually think this might be the real disagreement, as opposed to my other comment. I think that generalizing across capabilities is much more likely than generalizing across alignment, or at least that the first thing which generalizes across strong capabilities will not generalize alignment "correctly".

This is like a super high-level argument, but I think there are multiple ways of generalizing human values and no correct/canonical one (as in my other comment) nor are there any natural ways for an AI to be corrected without direct intervention from us. Whereas if an AI makes a factually wrong inference, it can correct itself.

[-]Carl Feynman2mo4-2

I’ll grant all your steps, even though I could disagree with some. Your scenario fails because an AI collective will fall apart into multiple warring parties, and humans will be collateral damage in the conflict. There are at least three possible ways a collective like this would fall apart.

First, humans vary in the goals they value, and will try to impose these goals on the AI. When superintelligent AIs have incompatible goals, the mechanisms of conflict will soon escalate far beyond the merely human. Call this the ‘political’ failure mechanism. Either multiple parties build their own AI, or they grab portions of the AI collective and retrain it to their goals. The usual mechanisms of superintelligent compromise don’t apply to many political goals. An example of such a goal: the Palestinians get control of Palestine, or the Israelis maintain control of Israel. Neither side is interested in trading the disputed land for promises of any portion of the lightcone. (This is just an example— there are lots of zero-sum conflicts like these.). And you may say, the AI collective will prevent the creation of new AIs working at cross purposes, or diversion of its goals. To which I say, good people like your friends can and do disagree on which side to favor, and once disagreements arise within the collective, outside pressure and persuasion will be applied to exacerbate those differences. There may be techniques that can be used to prevent such things, but we do not know of such techniques.

Second, the AIs in the AI collective differ in reproductive capacity. If they don’t differ by construction, they soon will by differing experience. The ones that think they should reproduce more, or have more resources, will do so. Moreover, since they are designing their successor personalities, rather that waiting for genetics to do its thing, they will be able to evolve within a few generations changes that would take evolution millions of years. Eventually portions of the collective will evolve into having incompatible goals. Goals which, I might add, may have no connection to the original goals of the system. Call this the ‘evolutionary’ failure mechanism. We do not know how to prevent this with current methods.

Third, I’m sure there are failure mechanisms I haven’t thought of, ones we cannot yet foresee. A system with superhuman powers can screw up in superhuman ways. I don’t think anyone predicted Spiralism, an LLM ideology transmitted through human communication on social networks (though it appears inevitable in retrospect). We don’t yet have any way of predicting or controlling the behavior of an AI collective, so it’s practically guaranteed to produce new phenomena. We see lots of organizations composed of people who want X producing not-X because of failure modes no single person can fix (or, in bad cases, even recognize.). Given that the AI collective has superhuman power, this is unlikely to end well. Call this the ‘organizational’ failure mode.

The political, evolutionary and organizational modes interact: evolutionary and organizational schisms create points of disagreement that external political actors can appeal to. Politically active forces within the AI collective may want to create offspring who are sure their side is correct and incapable of defection, releasing the evolutionary failure mode. And organizational failures, if they don’t kill everyone immediately, will increase calls for building a new, better AI, which increases the probability of AI conflict down the road.

The evolutionary and organizational failure modes could be prevented by rebooting the AI collective before it has a chance to go off the rails. Presumably there’s some reboot frequency fast enough that it can’t go wrong. But that opens up the political failure mode: anyone who builds an intelligence not constantly being rebooted will win in a conflict. There are a lot of ‘solutions’ like this: ways of keeping the AI safe that compromise effectiveness. In a competition between AIs, effectiveness beats safety. So when you propose a solution, you can only propose ones that keep the effectiveness.

I love writing things like this, but I hate that nobody’s come up with a way to keep me from having to.

[-]Vladimir_Nesov2mo30

I love writing things like this, but I hate that nobody’s come up with a way to keep me from having to.

I think engaging with the structure of an AGI society is important, but there are a few standard reasons people ignore it (while expecting ASI at some point and worrying about AI risk). Many expect the AGI phase to be brief and hopeless/irrelevant before the subsequent ASI. Others expect ASI can only go well if the AGI phase is managed top-down (as in scalable oversight) rather than treated as a path-dependent body of culture. Even with AGI-managed development of ASI, people are expecting ASI to follow quickly, so that only the AGIs can have meaningful input into how it goes, and anything that doesn't shape the initial top-down conditions for setting up the AGIs' efforts wouldn't matter.

But if AGIs are closer in their initial nature to humans (in the sense of falling within a wide distribution, similarly to humans, rather than hitting some narrow target), they might come up with guardrails for their own future development that prevent most of the strange outcomes from arriving too quickly to manage, and they'll be trying to manage such outcomes themselves, rather than relying on pre-existing human institutions. If early AGIs get somewhat more capable than humans, they might achieve feats of coordination that seem infeasible for the current humanity, things like Pausing ASI or regulating "evolutionary" drift in the nature or culture of the AGIs, not flooding the world with too many options for themselves that make their behavior diverge too far from what would be normal when they remain closer to their training environments.

Humans take some steps like that with some level of success, and it's unclear what is going to happen with the jagged/spiky profile of AGI competence in different areas, or at slightly higher levels of capability. Many worries of humans about AI risk will be shared by the AGIs, who are similarly at risk from more capable and more misaligned future AGIs or ASIs. Even cultural drift will have more bite as a major problem for AGIs (than it historically does for humanity), since AGIs (with continual learning) are close to being personally immortal and will be causing and observing a much faster cultural change than humanity is used to.

So given path dependence of the AGI phase, creating cultural artifacts (such as essays, but perhaps even comments) that will persist into it and discuss its concerns might influence how it goes.

[-]David Joshua Sartor1mo10

"the Palestinians get control of Palestine, or the Israelis maintain control of Israel"

I think in these cases opposing ASIs work together to maintain the existence of the disputed land and/or people, and use RNG to decide who gets control.
Of course zero-sum conflicts do exist, but IIUC only in cases where goals are exactly opposed (at least between just two ASIs).

[-]Annabelle2mo40

Maybe the result of one person’s clones forming a very capable Em Collective would still be suboptimal and undemocratic from the perspective of the rest of humanity, but it wouldn’t kill everyone, and I think wouldn’t lead to especially bad outcomes if you start from the right person.

I think the risk of a homogeneous collective of many instances of a single person's consciousness is more serious than "suboptimal and undemocratic" suggests. Even assuming you could find a perfectly well-intentioned person to clone, identical minds share the same blindspots and biases. Without diversity of perspective, even earnestly benevolent ideas could—and I imagine would—lead to unintented catastrophe.

I also wonder how you would identify the right person, as I can't think of anyone I would trust with that degree of power.

[-]the gears to ascension2mo30

Would someone who legitimately, deeply believes lack of diversity of perspective would be catastrophic, and who values avoiding that catastrophe and thus will in fact take rapid, highest-priority action to get as close as possible to democratically constructed values and collectively rational insight, be able to avoid this problem?

I also wonder how you would identify the right person, as I can't think of anyone I would trust with that degree of power.

I agree, and I think it's worse than OP believes in a way similar to how you do: my impression is that one of the mechanisms by which power corrupts is that even someone well intentioned typically has difficulty thinking clearly about tradeoffs when those tradeoffs are measured in terms of "lives you, personally, made the sole decision to intentionally end in favor of other lives", and non-conflict scenarios like health or resource construction produce those sorts of binds. More briefly: it's psychologically difficult to both care and have power.

Also, many people are already corrupt and seek power by dishonestly appearing to be safe with power, which seems like a more common reason for power to appear to corrupt: they were simply already corrupt.

[-]Annabelle2mo10

Would someone who legitimately, deeply believes lack of diversity of perspective would be catastrophic, and who values avoiding that catastrophe and thus will in fact take rapid, highest-priority action to get as close as possible to democratically constructed values and collectively rational insight, be able to avoid this problem?

No, I don’t think anyone could, barring the highly unlikely case of a superintelligence perfectly aligned with human values (and even still, maybe not—human values are inconsistent and contradictory). Also, I think a system of democratically constructed values would probably be at odds with rational insight, unfortunately.

Regarding the rest, agreed. Heading into verbotten-ish political territory here, but see also Jenny Holzer and Chomsky on this.

[-]Charlie Steiner2mo40

(3) seems slippery. The AIs are as nice as your friends "under normal conditions"? Does running a giant collective of them at 100x speed count as "normal conditions"?

If some of that niceness-in-practice required a process where it was interacting with humans, what happens when each instance interacts with a human on average 1000x less often, and in a very different context?

Like, I agree something like this could work in principle, that the tweaks to how the AI uses human feedback needed to get more robust niceness aren't too complicated, that the tweaks to the RL needed to make internal communication not collapse into self-hacking without disrupting niceness aren't too complicated either, etc. It's just that most things aren't that complicated once you know them, and it still takes lots of work to figure them out.

[-]David Matolcsi2mo20

I agree that running the giant collective at 100x speed is not "normal conditions". That's why I have two different steps, (3) for making the human level AIs nice under normal conditions, and (6) for the niceness generalizing to the giant collective. I agree that the generalization step in (6) is not obviously going to go well, but I'm fairly optimistic, see my response to J Bostock on the question.

[-]williawa2mo32

This is maybe the most contentious point in my argument, and I agree this is not at all guaranteed to be true, but I have not seen MIRI arguing that it’s overwhelmingly likely to be false.

Hm? I feel this is basically the single argument they makes in the whole first third of the book. "You don't get what you train for" et cetera. I think they'd disagree current LLMs are aligned, like at all, and getting ASIs "about as aligned as current LLMs" would get us all killed instantly.

I think this is what you should argue against in a post like this. The brain-emulations and collective intelligence do no heavy lifting. Ironically I've head Eliezer on several occasions literally propose "getting the smartest humans, uploading them, running them for a thousand subjective years" as a good idea.

For the record: I think their argument is coherent, but doesn't provide the level of confidence they display. I'd put like ~50% on "If anyone builds it, with anything remotely like current techniques, everyone dies". Maybe 75% if a random lab does it under intense time pressure, and like 25% if a safety conscious international project did it, with enough time to properly/thoroughly/carefully implement all the best prosaic safety techniques, but without enough time to make any new really fundamental approaches or changes to how the AIs are created.

[-]Ben Livengood2mo12

If the argument is that 1e9 very smart humans at 100x speed yield safe superintelligent outcomes "soon", how is that very different from "pause everything now and let N very smart humans figure out safe, aligned superintelligent outcomes over an extended timeframe, on the order of 1e11/N days/years"? It's just time-shifting safe human work.

I also worry that billions of very smart super-fast humans might decide to try building superintelligence directly, as fast as they can, so that we get doom in months instead of years

^{^}

Again, bolded in the book itself.

^{^}

If you don’t want to use the superintelligence at all, you can just put it in a very sealed container and you are probably fine, but this is a boring argument.

^{^}

I’m aware that this doesn’t fall within "remotely like current techniques”, bear with me.

^{^}

At least in every test case we try

^{^}

I think the Em Collective/AI Collective will be able to build the identical strawberries and other wondrous things after some years, based on how far we humans have gone in the last few centuries just by ordinary humans working together.

^{^}

Originally, I wanted to write a very different post than this one. It would have expanded on the evolution analogy, asking what would have happened if through human history a Demiurge had given arbitrary commandments to humans and punished disobedient kingdoms with locust swarms. I think it’s quite possible that by the time of industrialization, the church of the Demiurge could institute a stable worldwide totalitarianism which would keep humanity aligned to the Demiurge’s will even as humanity expands into the stars and no longer need to care about locust swarms. I discarded my half-written draft on this analogy when I realized that the point of the analogy would just be that “human civilization can plausibly grow to be very large and technologically powerful while still being controlled by a stable totalitarianism following some arbitrary goals”, and I can make that argument more directly with the AI Collective. I still liked the analogy though, so I inserted it in this footnote.

LESSWRONG
LW

LESSWRONG
LW

20

The IABIED statement is not literally true

20

20

I

II

III

IV

The statement

Counter-example

Conclusion