Worlds Where Iterative Design Fails

[-]Steven Byrnes3yΩ91810

Fast takeoff: there will be a sudden phase shift in capabilities, and the design of whatever system first undergoes that phase shift needs to be right on the first try.

I would have said "irreversible catastrophe", not "fast takeoff". Isn't that the real problem? Iterative design presumably gets you a solution eventually, if one exists, but it's not guaranteed to get you a solution after N iterations, where N is some number determined ex ante. In extreme fast takeoff, we need to solve alignment with N=0 iterations. In slow takeoff (in a competitive uncoordinated world), we need to succeed within N<(whatever) iterations. The latter is less bad than the former, but as long as there's a deadline, there's a chance we'll miss it.

(Slow takeoff is a bit worse than that because arguably as the AIs get gradually more capable, the problem keeps changing; you're not iterating on exactly the same problem for the whole takeoff.)

[-]johnswentworth3yΩ670

You are correct. I was trying to list the two frames which I think people most often use, not necessarily the best versions of those frames, since I wanted to emphasize that there are lots of other ways the iterative design loop fails.

[-]Dave92F13y*153

Many good thoughts here.

One thing I think you underappreciate is that our society has already evolved solutions (imperfect-but-pretty-good ones, like most solutions) to some of these problems. Mostly, evolved these thru distributed trial-and-error over long time periods (much the way biological evolution works).

Most actors in society - businesses, governments, corporations, even families - aren't monolithic entities with a single hierarchy of goals. They're composed of many individuals, each with their own diverse goals.

We use this as a lever to prevent some of the pathologies you describe from getting too extreme - by letting organizations die, while the constituent individuals live on.

Early on you said The only reason we haven’t died of [hiding problems] yet is that it is hard to wipe out the human species with only 20th-century human capabilities.

I think instead that long before these problems get serious enough to threaten those outside the organization, the organization itself dies. The company goes out of business, the government loses an election, suffers a revolution, is conquered by a neighbor, the family breaks up. The individual members of the organization scatter and re-join other, heathier, organizations.

This works because virtually all organizations in modern societies face some kind of competition - if they become too dysfunctional, they lose business, lose support, lose members, and eventually die.

As well, we have formal institutions such as law, which is empowered to intervene from the outside when organizational behavior gets too perverse. And concepts like "human rights" to help delineate exactly what is "too" perverse. To take your concluding examples:

Corporations will deliver value to consumers as measured by profit. Eventually this mostly means manipulating consumers, capturing regulators, extortion and theft.

There's always some of that, but it's limited by the need to continue to obtain revenue from customers. And by competing corporations which try to redirect that revenue to themselves, by offering better deals. And in extremis, by law.

Investors will “own” shares of increasingly profitable corporations, and will sometimes try to use their profits to affect the world. Eventually instead of actually having an impact they will be surrounded by advisors who manipulate them into thinking they’ve had an impact.

Investors vary in tolerance and susceptibility to manipulation. Every increase in manipulation will drive some investors away (to other advisors or other investments) at the margin.

Law enforcement will drive down complaints and increase reported sense of security. Eventually this will be driven by creating a false sense of security, hiding information about law enforcement failures, suppressing complaints, and coercing and manipulating citizens.

Law enforcement competes for funding with other government expenses, and its success in obtaining resources is partly based on citizen satisfaction. In situations where citizens are free to leave the locality ("voting with their feet"), poorly secured areas depopulate themselves (see: Detroit). The exiting citizens take their resources with them.

Legislation may be optimized to seem like it is addressing real problems and helping constituents. Eventually that will be achieved by undermining our ability to actually perceive problems and constructing increasingly convincing narratives about where the world is going and what’s important.

For a while, and up to a point. When citizens feel their living conditions trail behind that of their neighbors, they withdraw support from the existing government. If able, they physically leave (recall the exodus from East Germany in 1989).

These are all examples a general feedback mechanism, which appears to work pretty well:

There are many organizations of any given type (and new ones are easy to start)
Each requires resources to continue
Resources come from individuals who if dissatisfied withhold them, or redirect those resources at different (competing) organizations

These conditions limit how much perversity and low performance organizations can produce and still survive.

The failure of an organization is rarely a cause for great concern - there are others to take up the load, and failures are usually well-deserved. Individual members/employees/citizens continue even as orgs die.

[-]Archimedes3y76

Most actors in society - businesses, governments, corporations, even families - aren't monolithic entities with a single hierarchy of goals. They're composed of many individuals, each with their own diverse goals.

The diversity of goals of the component entities is good protection to have. In the case of an AI, do we still have the same diversity? Is there a reason why a monolithic AI with a single hierarchy of goals cannot operate on the level of a many-human collective actor?

I'm not sure how the solutions our society have evolved apply to an AI due to the fact that it isn't necessarily a diverse collective of individually motivated actors.

[-]Noosphere893y*54

Even more importantly, the biggest reason our world is stable is that humans have a very narrow range of capabilities, and this importantly applies to intelligence, which is normally distributed, meaning that societies can usually defeat outlier humans. AI capabilities will not nearly be this constrained, and the variance is worrying because there a real chance that one AI will be far more intelligent than any human that has ever lived, and it's relatively easy to cross the human range, ala Go and Starcraft. It's a similar reason why superpowers in the real world would doom us by default.

EDIT: I no longer think superpowers would doom us by default.

[-]Richard_Ngo3y*Ω9121

In worlds where the iterative design loop works for alignment, we probably survive AGI. So, if we want to improve humanity’s chances of survival, we should mostly focus on worlds where, for one reason or another, the iterative design loop fails. ... Among the most basic robust design loop failures is problem-hiding. It happens all the time in the real world, and in practice we tend to not find out about the hidden problems until after a disaster occurs. This is why RLHF is such a uniquely terrible strategy: unlike most other alignment schemes, it makes problems less visible rather than more visible. If we can’t see the problem, we can’t iterate on it.

This argument is structurally invalid, because it sets up a false dichotomy between "iterative design loop works" and "iterative design loop fails". Techniques like RLHF do some work towards fixing the problem and some work towards hiding the problem, but your bimodal assumption says that the former can't move us from failure to success. If you've basically ruled out a priori the possibility that RLHF helps at all, then of course it looks like a terrible strategy!

By contrast, suppose that there's a continuous spectrum of possibilities for how well iterative design works, and there's some threshold above which we survive and below which we don't. You can model the development of RLHF techniques as pushing us up the spectrum, but then eventually becoming useless if the threshold is just too high. From this perspective, there's an open question about whether the threshold is within the regime in which RLHF is helpful; I tend to think it will be if not overused.

[-]johnswentworth3yΩ464

The argument is not structurally invalid, because in worlds where iterative design works, we probably survive AGI without anybody (intentionally) thinking about RLHF. Working on RLHF does not particularly increase our chances of survival, in the worlds where RLHF doesn't make things worse.

That said, I admit that argument is not very cruxy for me. The cruxy part is that I do in fact think that relying on an iterative design loop fails for aligning AGI, with probability close to 1. And I think the various examples/analogies in the post convey my main intuition-sources behind that claim. In particular, the excerpts/claims from Get What You Measure are pretty cruxy.

[-]Richard_Ngo3yΩ240

in worlds where iterative design works, we probably survive AGI without anybody (intentionally) thinking about RLHF

In worlds where iterative design works, it works by iteratively designing some techniques. Why wouldn't RLHF be one of them?

In particular, the excerpts/claims from Get What You Measure are pretty cruxy.

It seems pretty odd to explain this by quoting someone who thinks that this effect is dramatically less important than you do (i.e. nowhere near causing a ~100% probability of iterative design failing). Not gonna debate this on the object level, just flagging that this is very far from the type of thinking that can justifiably get you anywhere near those levels of confidence.

[-]johnswentworth3yΩ233

In worlds where iterative design works, it works by iteratively designing some techniques. Why wouldn't RLHF be one of them?

Wrong question. The point is not that RLHF can't be part of a solution, in such worlds. The point is that working on RLHF does not provide any counterfactual improvement to chances of survival, in such worlds.

Iterative design is something which happens automagically, for free, without any alignment researcher having to work on it. Customers see problems in their AI products, and companies are incentivized to fix them; that's iterative design from human feedback baked into everyday economic incentives. Engineers notice problems in the things they're building, open bugs in whatever tracking software they're using, and eventually fix them; that's iterative design baked into everyday engineering workflows. Companies hire people to test out their products, see what problems come up, then fix them; that's iterative design baked into everyday processes. And to a large extent, the fixes will occur by collecting problem-cases and then training them away, because ML engineers already have that affordance; it's one of the few easy ways of fixing apparent problems in ML systems. That will all happen regardless of whether any alignment researchers work on RLHF.

When I say that "in worlds where iterative design works, we probably survive AGI without anybody (intentionally) thinking about RLHF", that's what I'm talking about. Problems which RLHF can solve (i.e. problems which are easy for humans to notice and then train away) will already be solved by default, without any alignment researchers working on them. So, there is no counterfactual value in working on RLHF, even in worlds where it basically works.

[-]Richard_Ngo3yΩ476

I think you're just doing the bimodal thing again. Sure, if you condition on worlds in which alignment happens automagically, then it's not valuable to advance the techniques involved. But there's a spectrum of possible difficulty, and in the middle parts there are worlds where RLHF works, but only because we've done a lot of research into it in advance (e.g. exploring things like debate); or where RLHF doesn't work, but finding specific failure cases earlier allowed us to develop better techniques.

[-]johnswentworth3yΩ56-5

Yeah, ok, so I am making a substantive claim that the distribution is bimodal. (Or, more accurately, the distribution is wide and work on RLHF only counterfactually matters if we happen to land in a very specific tiny slice somewhere in the middle.) Those "middle worlds" are rare enough to be negligible; it would take a really weird accident for the world to end up such that the iteration cycles provided by ordinary economic/engineering activity would not produce aligned AI, but the extra iteration cycles provided by research into RLHF would produce aligned AI.

[-]Richard_Ngo3yΩ462

Upon further thought, I have another hypothesis about why there seems like a gap here. You claim here that the distribution is bimodal, but your previous claim ("I do in fact think that relying on an iterative design loop fails for aligning AGI, with probability close to 1") suggests you don't actually think there's significant probability on the lower mode, you essentially think it's unimodal on the "iterative design fails" worlds.

I personally disagree with both the "significant probability on both modes, but not in between" hypothesis, and the "unimodal on iterative design fails" hypothesis, but I think that it's important to be clear about which you're defending - e.g. because if you were defending the former, then I'd want to dig into what you thought the first mode would actually look like and whether we could extend it to harder cases, whereas I wouldn't if you were defending the latter.

[-]johnswentworth3yΩ221

Yeah, that's fair. The reason I talked about it that way is that I was trying to give what I consider the strongest/most general argument, i.e. the argument with the fewest assumptions.

What I actually think is that:

nearly all the probability mass is on worlds the iterative design loop fails to align AGI, but...
conditional on that being wrong, nearly all the probability mass is on the number of bits of optimization from iterative design resulting from ordinary economic/engineering activity being sufficient to align AGI, i.e. it is very unlikely that adding a few extra bits of qualitatively-similar optimization pressure will make the difference. ("We are unlikely to hit/miss by a little bit" is the more general slogan.)

The second claim would be cruxy if I changed my mind on the first, and requires fewer assumptions, and therefore fewer inductive steps from readers' pre-existing models.

[-]Richard_Ngo3yΩ253

In general I think it's better to reason in terms of continuous variables like "how helpful is the iterative design loop" rather than "does it work or does it fail"?

My argument is more naturally phrased in the continuous setting, but if I translated it into the binary setting: the problem with your argument is that conditional on the first being wrong, then the second is not very action-guiding. E.g. conditional on the first, then the most impactful thing is probably to aim towards worlds in which we do hit or miss by a little bit; and that might still be true if it's 5% of worlds rather than 50% of worlds.

[-]johnswentworth3yΩ232

(Thinking out loud here...) In general, I am extremely suspicious of arguments that the expected-impact-maximizing strategy is to aim for marginal improvement (not just in alignment - this is a general heuristic); I think that is almost always false in practice, at least in situations where people bother to explicitly make the claim. So let's say I were somehow approximately-100% convinced that it's basically possible for iterative design to produce an AI. Then I'd expect AI is probably not an X-risk, but I still want to reduce the small remaining chance of alignment failure. Would I expect that doing more iterative design is the most impactful approach? Most probably not. In that world, I'd expect the risk is dominated by some kind of tail risks which iterative design could maybe handle in principle, but for which iterative design is really not the optimal tool - otherwise they'd already be handled by the default iterative design processes.

So I guess at that point I'd be looking at quantitative usefulness of iterative design, rather than binary.

General point: it's just really hard to get a situation where "do marginally more of the thing we already do lots of by default" is the most impactful strategy. In nearly all cases, there will be problems which the things-we-already-do-lots-of-by-default handle relatively poorly, and then we can have much higher impact by using some other kind of strategy which better handles the kind of problems which are relatively poorly handled by default.

[+][comment deleted]3yΩ340

[-]curtisrussell3yΩ390

I wonder, almost just idle curiosity, whether or not the "measuring-via-proxy will cause value drift" is something we could formalize and iterate on first. Is the problem stable on the meta-level, or is there a way we can meaningfully define "not drifting from the proxy" without just generally solving alignment.

Intuitively I'd guess this is the "don't try to be cute" class of thought, but I was afraid to post at all and decided that I wanted to interact, even at the cost of (probably) saying something embarassing.

[-]johnswentworth3y30

It is at least not obvious to me that this is a "don't try to be cute" class of thought, though not obvious that it isn't either. Depends on the details.

[-]curtisrussell3y30

This started out as more of an intuition, so this is mostly an attempt to verbalize that in a concrete way.

If we could formalize a series of relatively simple problems, and similarly formalize what "drifting" from the core values of those toy problems would look like, I wonder if we would either find new patterns, rules, or intuitions.

(I had a pithy remark to the effect of While(drifting) { dont() } )

I think I'm wondering if we can expand and formalize our knowledge of what values drift means, in a way that generalizes independent of any specific, formalized values.

[-]Soren3y96

I think this is a very useful post that is talking about many of the right things. One question though: isn't it only worth focusing on the worlds where iterative design does not work for alignment to the extent to which progress can still be made towards mitigating those worlds? It appears to me that progress in technical fields is usually accomplished through iterative design, so it makes sense to have a high prior on non-iterative approaches being less effective. Depending on your specific numbers here, it seems like it could be worth it to pay attention to the areas more tractable for iterative design or less. I think its also misleading to think of iterative design as either working or failing. Fields have gradations of ability for prompt and high-quality feedback and ability for repeated trials. It also seems like problems that initially seem hard to iterate on can often be formulated in ways that allows better iteration (like the ELK problem being formulated in a way that allows for testing toy solutions and counterexamples). I worry that trying to focus in an unnuanced way about worlds where iterative design fails may miss out on opportunities to formulate some of these hard problems in ways that might make them easier to iterate on.

[-]johnswentworth3y51

Certainly making some of the hard-to-see stuff visible enough to iterate on is one of the main lines of attack; that's the central reason why interpretability work is so valuable.

[-]phillchris3yΩ241

I wonder if implications for this kind of reasoning go beyond AI: indeed, you mention the incentive structure for AI as just being a special case of failing to incentivize people properly (e.g. the software executive), and the only difference being AI occurring at a scale which has the potential to drive extinction. But even in this respect, AI doesn't really seem unique: take the economic system as a whole, and "green" metrics, as a way to stave off catastrophic climate change. Firms, with the power to extinguish human life through slow processes like gradual climate destruction, will become incentivized towards methods of pollution that are easier to hide as regulations on carbon and greenhouse gases become more stringent. This seems like just a problem of an error-prone humanity having greater and greater control of our planet, and our technology and metrics, as a reflection of this, also being error-prone, only with greater and greater consequence for any given error.

Also, what do you see, more concretely, as a solution to this iterative problem? You argue that coming up with the right formalism for what we want, for example, as a way to do this, but this procedure is ultimately also iterative: we inevitably fail to specify our values correctly on some subset of scenarios, and then your reasoning equally applies on the meta-iteration procedure of specifying values, and waiting to see what it does in real systems. Whether with RL from human feedback or RL from human formalism, a sufficiently smart agent deployed on a sufficiently hard task will always find unintended easy ways to optimize an objective, and hide them, vs. solving the original task. Asking that we "get it right", and figuring out what we want, seems kind of equivalent to waiting for the right iteration of human feedback, except on a different iteration pipeline (which, to me, don't seem fundamentally different on the AGI scale).

[-]Caspar Oesterheld3yΩ440

Nice post!

What would happen in your GPT-N fusion reactor story if you ask it a broader question about whether it is a good idea to share the plans?

Perhaps relatedly:

>Ok, but can’t we have an AI tell us what questions we need to ask? That’s trainable, right? And we can apply the iterative design loop to make AIs suggest better questions?

I don't get what your response to this is. Of course, there is the verifiability issue (which I buy). But it seems that the verifiability issue alone is sufficient for failure. If you ask, "Can this design be turned into a bomb?" and the AI says, "No, it's safe for such and such reasons", then if you can't evaluate these reasons, it doesn't help you that you have asked the right question.

[-]johnswentworth3yΩ341

My response to the "get the AI to tell us what questions we need to ask" is that it fails for multiple reasons, any one of which is sufficient for failure. One of them is the verifiability issue. Another is the Gell-Mann Amnesia thing (which you could view as just another frame on the verifiability issue, but up a meta level). Another is the "get what we measure" problem.

Another failure mode which this post did not discuss is the Godzilla Problem. In the frame of this post: in order to work in practice the iterative design loop needs to be able to self-correct; if we make a mistake at one iteration it must be fixable at later iterations. "Get the AI to tell us what questions we need to ask" fails that test; just one iteration of acting on malicious advice from an AI can permanently break the design loop.

[-]Shmi3y4-4

I love the way you explained the iterative design failure modes! Accessible and clear. Well, the lead gasoline inventors knew what they were doing was harmful to humans, but they were driven by profits. But there are plenty of examples where the unexpected long-term effects kill you, like the CFC ozone depletion.

My main takeaway is similar to what I had gestured at some years ago: you can't use an AGI as a tool. Instead an AGI's goal would be to understand the universe, including humans in it, and only act well within the boundaries of its understanding. So, no little fusicles in every phone unless it cannot be exploited by malevolent humans, without anyone thinking to check. And refuse to do anything unexpectedly x-risky humans may ask or try:

Or even https://archive.is/30TA6.

That would imply a frustrating experience where an AGI would refuse a seemingly perfectly reasonable request for the reasons unfathomable to us. May not quite be the "CEV", but at least the chances of survival will go up.

[-]metachirality2y30Review for 2022 Review

Doesn't say anything particularly novel but presents it in a very clear and elegant manner which has value in and of itself.

[-]johnswentworth2y50

Doesn't say anything particularly novel

On the one hand: I know, right?

On the other hand: somehow lots of people kept being like "You think RLHF is terrible? You must just not be thinking about the value of an iterative design loop!" and I'm like "... I am not the one here who has not thought through strategy around iterative design loops".

Like, one major thing which prompted this post was Richard Ngo outright saying "I take this comment as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is.". I promptly and by his own admission proved him wrong about that, but my major update from that exchange (and a few similar exchanges on that and adjacent threads) was: a lot of people just have some vague halo around iterative development, and have not actually thought through the gears and the failure-modes (especially in the context of AI).

(Also I'd be interested to hear @Richard_Ngo's response to this. He did leave a comment on OP which seemed to me like a somewhat-confused tangent at the time, but I haven't heard him respond to what I'd consider the central point of the post: iterative design loops are not vague magic good things, they have specific predictable failure modes, and a bunch of stuff like RLHF which people say is good because handwave iterative design handwave in fact look quite terrible when we actually think about the failure modes of iterative design.)

[-]metachirality2y30

I'm actually somewhat surprised. Maybe this idea has saturated my water supply to the point where it seems trivial.

[-]Stephen McAleese3y32

One more reason why iterative design could fail is if we build AI systems with low corrigibility. If we build a misaligned AI with low corrigibility that isn't doing what we want, we might have difficulty shutting it down or changing its goal. I think that's one of the reasons why Yudkowsky believes we have to get alignment right on the first try.

[-]Aprillion3yΩ230

“Can’t we test whether the code works without knowing anything about programming?”

Knowing what to test to reliably decrease uncertainty about "whether the code works" includes knowing "a fair bit" about software engineering.

I agree with the distinction that being a programmer is not the only way how to know about programming, many hiring managers are not programmers themselves, they just have to know a fair bit about software engineering.

[-]Gurkenglas3y30

The only reason we haven’t died of it yet is that it is hard to wipe out the human species with only 20th-century human capabilities.

In fact, the power of hiding problems once saved everyone!

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

227

Worlds Where Iterative Design Fails

227

Ω 70

227

Ω 70

Basics: Hiding Problems

Example/Analogy: The Software Executive

Why RLHF Is Uniquely Terrible

Generalization: Iterate Until We Don’t See Any Problems

Does This Prove Too Much?

Less Basic: Knowing What To Look For

Example/Analogy: The Fusion Power Generator

Example/Analogy: Gunpowder And The Medieval Lord

Example/Analogy: Leaded Gasoline

Meta Example/Analogy: Expertise and Gell-Mann Amnesia

More Fundamental: Getting What We Measure

Summary & Takeaways