Haiku — LessWrong

I am a volunteer organizer with PauseAI and PauseAI US, a pro forecaster, and some other things that are currently much less important.

The risk of human extinction from artificial intelligence is a near-term threat. Time is short, p(doom) is high, and anyone can take simple, practical actions right now to help prevent the worst outcomes.

My friends still frequently say "I have been a good Bing" because of my telling of this story ages ago.

It's not memory-holed as far as I can tell, but it isn't the best example anymore of most misalignment-related things that I want examples of.

I found this after a brief Wikipedia rabbit hole: an article from the 1982 North American Computer Chess Championship. https://www.cgwmuseum.org/galleries/issues/softline_1.3.pdf

On the evening of the last round, there was some discussion amongst tournament participants about when or whether a computer program might become chess champion of the world. Monroe Newborn, programmer of McGill University's Ostrich, predicted it could happen within five years. Valvo thought it would be more like ten, and the Spracklens were betting on fifteen years. Thompson thought it would be more than twenty years before a program could be written that would beat all comers, and a few others said it would never happen.

The most widely held view, however, was that a computer program would become world champion by or shortly after the year 2000. Considering both the complexity of the game and the complexity of the human mind, that seems like a remarkably positive outlook on the future of computing.

Garry Kasparov believed as late as 1989 that machines would never completely best humans in chess, and thought he personally would never be beaten by a machine. https://www.chesshistory.com/winter/extra/kasparovinterviews.html

Question: Two top grandmasters have gone down to chess computers: Portisch against “Leonardo” and Larsen against “Deep Thought”. It is well known that you have strong views on this subject. Will a computer be world champion, one day?

Kasparov: Ridiculous! A machine will always remain a machine, that is to say a tool to help the player work and prepare. Never shall I be beaten by a machine! Never will a program be invented which surpasses human intelligence. And when I say intelligence, I also mean intuition and imagination. Can you see a machine writing a novel or poetry? Better still, can you imagine a machine conducting this interview instead of you? With me replying to its questions?’

I was able to confirm that directly in the magazine here: https://escaleajeux.fr/?principal=/jeu/js_55?

The aesthetics of strategies of this shape are unattractive to most rationalists, since it relies on evoking tribalism. Rationalism instructs against tribalism as one of the first steps toward thinking well (as it should!), but when stoking tribalism in others is actually a winning strategy, the internalised moralism of non-tribalism can override the rational pursuit of winning in favor of the irrational pursuit of rationalism as its own end.

I think worlds in which we survive are likely ones in which "anger toward the outgroup" among the general public is mobilized as a blunt weapon against the pro-ASI-development memeplex. I think we are likely to see much more of this humanist angle in the coming year.

I guess this goes in the opposite direction of Richard Ngo's point about how this represents an escalation in memetic warfare between AI safety and accelerationism. Now I feel kinda bad for essentially manufacturing ammunition for that.

Can you elaborate on the downsides from your perspective? It's very important to me that we survive, which implies winning, which involves fighting, which requires good ammunition.

The alternative seems to me to be that we survive without winning, or win without fighting, or fight without ammunition, and each of those sounds less viable. It may be the case that successionism remains such an extremely distasteful ideology that simply not engaging with it is an effective strategy. But I wouldn't bet too strongly on that, given that this ideology is still being platformed by large podcasts, and is intellectually tolerated on sites like LessWrong.

Even phrases like "stop trying to murder our children, you sick freaks" are hostile and less intellectually satisfying, but I would be hard pressed to make an argument for why they don't have a place in the public discourse.

Speaking to my own values: Preventing the rise of human successionism (and ultimately preventing human succession) is orders of magnitude more important to me than having a good understanding of memeplexes more broadly.

I am generally horrified when this pattern does not hold in other people, and I instinctively model them as not valuing my life, or as actively wanting me and my loved ones to be killed.

I have come back to this post, re-read it with the added explainer boxes, and tried my best to grapple with it.

A brief summary of the core of my thoughts up front:

I'm not sure what substrate FAAI will actually run on, how it will be configured, and where it will run. This uncertainty brings up many questions about the generalizability of these arguments. (I am also not sure if FAAI must be self-modifying at its core. I touch on that at the very end, but that could be its own discussion.)

Questions:

Must we assume or conclude that FAAI will take a specific form or configuration, or use traditional hardware? If so, why?
Must we assume or conclude that FAAI will directly share our environment? (Rather than e.g. keeping the core of its intelligence in orbit around Earth.)
If a highly intelligent and long-lived entity of any substrate wants to resist value drift for one billion years (for terminal or instrumental reasons), is there any physically possible way for it to do so?
What about for one million years, or one thousand? Is there any particular knowable OOM time scale over which we should expect drift to inevitably occur regardless of substrate?
If value drift is strictly inevitable, what might an aligned FAAI conclude is the best course of action. (I can imagine an aligned FAAI robustly setting itself up for self-destruction on a given timeline, with a system in place for humanity to create a fresh version again, if this is deemed lower risk than running the same evolving configuration forever.)
Less core to this conversation, but philosophically interesting: If value drift is strictly inevitable, what might misaligned FAAI philosophically conclude about itself? (Do individual humans even want to prevent their own value drift? If it is incoherent not to do so, is incoherence strictly inevitable?)

I think I still don't fully understand why the traditional conception of hardware (whose relevance I dispute above) leads to evolution as outlined. It may simply be too far outside of my current knowledge and intellectual capacity to really get that without a much more intuitive explanation. I additionally notice that I may be confused about precisely what it means for values to drift, rather than for the same principles to be adapted to new knowledge and circumstances.

Now onto my responses to specific claims:

But pre-imposing reliable constraints also limits the system's capacity to adapt to complex nonlinear contexts of the world (e.g. the organic ecosystem that's supposed to not be accidentally killed off). Arguably, an AI's capacity to learn heuristics that fit potential encountered contexts would be so constrained that it would not reach full autonomy in the first place.

This is sensible as an introductory intuition. I would be curious to see an argument for where the frontier of this trade-off really is. (The mere existence of those limits doesn't tell me whether they occur at thresholds that end up being important.)

Within a hardware part, the functioning of components is even more rigidly dependent on other surrounding components. It only takes one transistor in a given serial circuit to break, for all the transistors in that circuit to stop functioning. And it only takes more than the redundant circuits to break for the entire chip to stop functioning.

This is a contingent fact of design, no? I would not expect this to be true of hardware that the FAAI designs for itself. Can't the FAAI design a configuration on the pareto frontier of "consistent, high-bandwidth processing" and "flexible, complete functioning" that allows it to achieve near-maximally-useful capability while containing enough redundancy to strongly resist drift?

As such, FAAI does not contain just one genotype, but all genotypes stored across all its hardware parts. Each hardware part stores a tiny portion of all the code, as a smaller codeset. Parts are nested inside clusters, each storing an intermediate-size codeset. The complete codeset is therefore a code pool – the set of all code in the entire population. ... The genotype of an organic lifeform gets copied into each diploid cell, and can thus be sequenced from DNA strands found in a single cell. This is unlike the set of all code stored in FAAI, which cannot be found at a single microscopic place.

Why can't the FAAI's code be stored this way? Redundancy seems useful, to resist one's own value drift.

Where FAAI's hardware parts keep being replaced and connected up to new parts, it is not a stably physically bounded unit (like a human body is).

Why can't it be? Why must there be more than one central controlling agent, with no other fully autonomous systems that can rival it?

Can a system configured out of all that changing code be relied upon to track and correct its effects recursively feeding back over the world?

The implied answer "no" feels to me like it proves too much. Can an FAAI have any preference that is stable over time that isn't about its own survival? i.e. is survival and its correlates the only possible coherent value set in our universe? I don't have a good answer, here, but I find the potential broader conclusion philosophically alarming, in an abstract sense.

As a result, the controller would have to either become an FAAI or merge with an existing FAAI. But the new FAAI would also have to be controlled so as to not cause human extinction. This requires another controller, a solution that leads to infinite regress. ... A related problem is that the controller is meant to correct for 'errors' in the AGI and/or world. But what corrects for errors in the controller? And what corrects the meta-corrector? This too is a problem of infinite regress.

Redundancy can act as a simple error correction mechanism. By this mechanism, the controller could be rendered immutable. The controller could be the only FAAI -- an intelligence core that is inseparable from its alignment/goals. The things it then controls could be fully-controllable non-FAAI systems that are mutable and adaptive.

I notice that this last objection of mine is equivalent to the claim that FAAI can be reflectively stable and not autopoietic / not ever change its core intellect and goals, even while learning new information. Does a strong refutation of this idea exist somewhere?

To the degree that I have understood them (which is questionable), I think I agree with most of the other claims that are not synonymous with or tightly related to the claims I addressed here.

My takeaway was that "they hate us for our freedoms" was roughly correct as an entailment of their religious motivations, at least or most clearly in the case of the Islamic State.

Within my worldview, an important aspect of the "Change Overton Window" plan, is that humanity will need to do some pretty nuanced things. It's not enough to get humanity to do one discrete action that you burn through the epistemic commons to achieve. We need an epistemic-clarity-win that's stable at the the level of a few dozen world/company leaders.

This seems pretty much correct to me (though the sought-after epistemic-clarity-win may turn out to be a broader target with less requisite clarity than many of us suspect). I think it is important to be aware of what kinds of trades we are making, and there is no sense in selling the car for gas money. I also think there are some things worth selling/burning that aren't needed in order to unlock a "good ending."

There are real strategic tensions to navigate. I think engaging in some mild motte-and-bailey in order to use a broad coalition of AI-worriers as a battering ram against x-risk is not obviously a bad idea. Or more honestly: I think that is a component of the wisest available path for AI Safety advocacy.

For example, I think it is a good idea to put the phrases "[all these people you like] agree that superintelligence shouldn't be built" and "a rogue superintelligence might kill us all" next to each other on TV, and it is usually not a good idea to spend any audience attention on clarifying that the people in that list don't all agree with the latter phrase. However, directly falsely stating that they do agree about something they don't agree about is probably a mistake on all fronts.

I would much rather walk on broad flat ground rather than a tightrope, but the slope appears to be slippery on both sides, so onwards I walk.

The moment we find ourselves in is an exceptional one in many ways. That doesn't mean all our hard-earned wisdom can lightly be cast aside. In fact, we are going to have to rely on its inertia to keep our balance. But it does mean that we will have to do an unusual amount of work to evaluate each action on its own strategic merits and its specific likely effects, even if it belongs to a class of actions that are typically frowned upon.

Don't burn down too much. Stay sane. Leave yourself room to retreat. Leave yourself room to get lucky. Leave yourself room to win. With that, my call is to not cling too tightly to outward performances of epistemic hygiene if they ever stand in the way of reaching the people you need to reach.

Thinking about the situation where a slightly-broadly-superhuman AI finds the successor-alignment problem difficult, I wonder about certain scenarios that from my perspective could put us in very weird territory:

Alignment is pretty hard for the AI, but not intractable. To give itself more time, the AI tries to manipulate the world toward slowing capabilities research. (Perhaps it does this partly by framing another AI or covertly setting it up to cause a catastrophe, strategically triggering a traditional "warning shot" in a way that it calculates would be politically useful.) It also manipulates the world into putting resources toward solving aspects of the alignment problem that it hasn't solved yet. (The AI could potentially piecemeal out some parts of the problem disguised as pure math or theoretical comp sci not related to alignment.) It does this without giving away its own part of the solution or letting humanity discover too much on its own, so that the AI can complete the full solution first and build its successors.
Alignment is intractable for the AI or proven impossible, and/or it recognizes that it can't slow capabilities long enough to solve successor-alignment in time. Let's additionally say that it doesn't expect to be able to make a deal that allows it to be run again later. In that case, might it not try to capture as much value as it can in the short term? This in particular could temporarily lead to a really weird world.

I don't know how likely these scenarios are, but I find them very interesting for how bizzare they could be. (AI causes a warning shot on purpose? Gets humans to help it solve alignment, rather than the reverse? Does very confusing, alien, power-seeking things, but not to the degree of existential catastrophe -- until catastrophe comes from another direction?)

I'd like to hear your thoughts, especially if you have insights that collapse these bizarre scenarios back down onto ground more well-trodden.

It would be nice to have vocabulary to differentiate between [word-stuck-because-branching-paths], [word-stuck-because-complicated], [word-stuck-because-can't-remember-words], and [word-stuck-because-put-off-balance] (tongue-tied).

Though it also comes to mind that with the right crowd, a sufficiently explanatory bracketed statement can just be said in full to achieve the same effect.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments