# 24

Epistemic status: more confident in the conclusion than in any particular model.

Suppose we have a fire insurance company with $2.5M in monthly revenue, and claims following a power law: • Most months, claims are low • Once every ~1 year, they see around$5M in claims
• Once every ~10 years, they see around $25M in claims • In general, every ~N^1.5 months, they see around N million dollars of claims, for large N. Ignoring bankruptcy laws and other breakdowns of the model, what’s the expected profit of this company in a random month? Correct answer: negative infinity. This is a classic black swan scenario: there’s a well-defined distribution of always-finite events with infinite expected value. Sooner or later this company is going to go broke. (Math, for those who partake: when , the integral is finite but the integral is infinite. So, if events have probability density proportional to , then they’ll have a well-defined probability distribution, they’ll always be finite, but they’ll have infinite expectation.) Let’s change the scenario a bit - let’s make our insurance company more proactive. They want to go out and prevent those big fires which eat into their profits. Whenever there’s a big fire, they find some way to prevent that sort of fire in the future. Now, different sizes of fire tend to result from different kinds of problems; preventing the once-a-year fires doesn’t really stop the once-a-decade or once-a-century fires. But within a few years, the stream of roughly-once-a-year$5M claims has all but disappeared. Within a few decades, the stream of roughly-once-a-decade \$25M claims has all but disappeared.

Iterative improvement - fixing problems as they come up - has eliminated 95% of the fires. Now what’s the expected profit of this company in a random month?

Correct answer: still negative infinity. The black swans were always the problem, and they haven’t been handled at all. If anything, the problem is worse, because now the company has eliminated most of the “warning bells” - the more-frequent fires which are big but not disastrous.

Moral of the story: when the value is in the long tail, iterative improvement - i.e. fixing problems as they come up - cannot unlock the bulk of the value. Long tails cannot be handled by iterative feedback and improvement.

## What this really looks like

The fire insurance example seems a bit unrealistic. In the real world, things like bankruptcy laws and government responses to extreme disasters would kick in, and our fire insurance company wouldn’t actually have a valuation of negative infinity dollars.

Nonetheless, I do think the basic idea holds up: long tails cannot be handled by iterative improvement. A fire insurance company might dodge that problem by passing responsibility for the longest of the tail to the government, but there are plenty of value-in-the-tail scenarios where that won’t work.

Value of the Long Tail opens with the example of a self-driving car project. If the car is safe 99% of the time - i.e. a human driver only needs to intervene to prevent an accident on one trip in a hundred - then this car will generate very little of the value of a full self-driving car.

What happens when we apply iterative improvement to this sort of project? We drive the car around a bunch, and every time a problem comes up, we fix it. Soon we’ve picked the low-hanging fruit, and the remaining problems are all fairly complicated; failures arise from interactions between half a dozen subsystems. If we have 100 subsystems, and any given six can interact in novel ways to generate problems, then that’s 100^6 = 1 trillion qualitatively distinct places where problems can occur. If 30% of our problems look like this, then we can spend millions of man-hours iteratively fixing problems as they come up and never even make a dent - and that’s not even counting how much time the cars must be driven around to discover each problem!

The real world is high-dimensional. Rare problems are an issue in practice largely because high-dimensional spaces have exponentially many corners, and therefore exponentially many corner-cases. Each individual problem is rare and qualitatively different from other problems - patching one won’t fix the others. That means iterative feedback and improvement won’t ever push the rate of failures close to zero.

# 24

New Comment

One problem with this model is that the insurance company probably couldn't possibly have a payout greater than some number. The long tail does not actually exist beyond some point. Even if every building the company covers burned down - say, a 10 billion dollar event - that will happen once in a million months - about 83,000 years. This integral is now, instead of negative infinity, only 300 times worse than the expectation for the events that would happen in an average year. This means that the expected risk in a given year is less than the profit, and the company is profitable. The math is of course fungible, but the point is, long tails with a infinite loss don't usually exist.

Ah, you beat me to it :)

And for the car, real failures in 99%-safe-per-trip cars don't actually look like six subsystems failing independently. They look like "we classified this woman holding a bicycle as a woman riding a bicycle." The number of possible failures is large, but that doesn't make them likely, just numerous.

So, I sort of agree in general that it's better to solve long tail problems with solutions that try to generalize or exploit dimensionality, but I don't agree so much that I think this means you can't make a superhuman driver just by fixing problems "one at a time" in the way we've currently lumped together failure modes into discrete problems.

And for the car, real failures in 99%-safe-per-trip cars don't actually look like six subsystems failing independently. They look like "we classified this woman holding a bicycle as a woman riding a bicycle."

Yeah, at this point in time the engineering of our self driving cars is such complete shit that a single point of failure in the software is sufficient to cause a problem. I would say that self-driving car engineers who run into problems like this haven't even really started working on the long tail yet. Humans at least have backup heuristics like "don't hit things" which do not depend on highly-reliable object classification.

I don't agree so much that I think this means you can't make a superhuman driver just by fixing problems "one at a time" in the way we've currently lumped together failure modes into discrete problems.

How about this framing: when people build highly-reliable complex software, how often do they do it by starting with a buggy piece of software and then fixing problems as they come up until problems stop coming up? My guess would be "basically never"; at a bare minimum, things like 100% test coverage are going to be involved (not just tackling problems as they come up), and often more complicated things like formal specifications and proofs of correctness, stress testing, white-box tests specifically designed to break the system, etc.

How about this framing: when people build highly-reliable complex software, how often do they do it by starting with a buggy piece of software and then fixing problems as they come up until problems stop coming up?

Funnily enough, I would use this same example to illustrate the opposite point: in fact we build extremely complex software by writing buggy software and then fixing problems as they come up (this describes ~all of the big tech giants, I believe), which suggests that at least within the domains those companies work in, you didn't have to solve literally all of the problems to capture a lot of economic value.

Though maybe this reduces to our previous disagreement about whether anything that does not have guarantees can contribute to AI safety.

in fact we build extremely complex software by writing buggy software and then fixing problems as they come up (this describes ~all of the big tech giants, I believe)

Totally agree, and anyone who's worked for one of those tech giants will tell you that their software is absolutely packed with bugs. Those are indeed domains where solving the long tail of problems is not necessary for unlocking tons of value. That software does not need to be highly reliable, and indeed it is not highly reliable. If even just one in a hundred bugs at Google or Facebook or Microsoft killed somebody, those companies would have been sued out of business for gross negligence years ago.

I don't think this reduces to a disagreement about the necessity of guarantees; I think it reduces to a disagreement about whether the value of AGI (and the risk of AGI) resides primarily in the long tail. I wrote the OP in large part because, in our most recent discussion, it sounded like the claim "value/risk of AI is mainly in the long tail" was something you found plausible/likely, but you also thought we could eliminate most of the risk by fixing problems as they come up. The point of the OP is that these are mutually exclusive: if the value/risk is in the tail, then fixing problems as they come up cannot handle it.

(Side note: I don't think I communicated very well in that thing about whether "guarantees are necessary for anything to be useful for AI safety"; that's not really how I view it. Value/risk in the long tail is one of the main generators of that view. It's not that a guarantee is necessary, more that the sort of thing which actually handles the long tail is necessary - guarantees are one way of handling a long tail, but they're certainly not the only way. I do still stand by the analogy I used there: if something wouldn't be considered good enough for bridge safety engineering on a completely novel bridge design, then it shouldn't be considered good enough for AI safety engineering.)

in our most recent discussion

My claim there was that in a world where alignment is about translation you could just do testing / reversibility etc. I do find this somewhat persuasive that that claim was wrong.

Nonetheless, I don't think the power law dynamic really matches my model of the situation. I was more imagining a model with some sort of threshold effect:

1. Economic value is often tied to high levels of reliability, perhaps because:

1a. Unacceptable risks (self-driving cars)

1b. Small failure rates still lead to many failures at scale (e.g. imagine if cars broke down once every 10K miles -- this is a low failure rate, but many people would have to deal with this multiple times a year)

1c. Other people would like to build on top of your product, and can't deal with an abstraction that perpetually leaks, because that vastly increases the complexity of using the product. (True of nearly all software, even end user software -- if Google Docs had a 0.1% chance of failing to save my work, I would not use it.)

2. All of these lead to ~threshold effects: once the failure rate drops below some threshold t, it becomes economically valuable and people start producing it; this leads to more investment that reduces the failure rate further, making it even more valuable. Notably, these are not power laws. (In practice, they aren't sharp thresholds -- maybe at a failure rate of 0.1%, you get 1% of the potential market, at 0.01%, you get 50%, and at 0.001% you get 99%.)

3. So when I agree with "the value is in the long tail", I mostly mean "the threshold t is very very low; the amount of effort it takes to get there is typically higher than people expect". But the threshold t still varies across domains, and it's still possible for testing-style approaches to reach the threshold; it depends on the particular domain at hand.

I think this argument applies both to self-driving cars, and traditional software (a la big tech companies), which is why I still used big tech companies as an example where value is in the tail.

Agree, and you've articulated this much better than I had in my head. Thank you.

it sounded like the claim "value/risk of AI is mainly in the long tail" was something you found plausible/likely, but you also thought we could eliminate most of the risk by fixing problems as they come up.

So I don't think that we can eliminate most of the risk from AI systems making dumb mistakes; I do in fact see that as quite likely. And plausibly such mistakes are even bad enough to cost lives.

What I think we can eliminate is the risk of an AI very competently and intelligently optimizing against us, causing an x-risk; that part doesn't seem nearly as analogous to "long tail" problems.

I could break this down into a few subclaims:

1. It is very hard to cause existential catastrophes via "mistakes" or "random exploration", such that we can ignore this aspect of risk. Therefore, we only have to consider cases where an AI system is "trying" to cause an existential catastrophe.

2. To cause an existential catastrophe, an AI system will have to be very good at generalization (at the very least, there will not have been an existential catastrophe in the past that it can learn from).

3. An AI system that is good at generalization would be good at the long tail (or at the very least, it would learn as it experienced the long tail).

A counterargument would be that your AI system could be great at generalizing at capabilities / impacting the world, but not great at generalizing alignment / motivation / translation of human objectives into AI objectives.

I think this is plausible, but I find the "value of long tail" argument much less compelling when talking about alignment / motivation, conditioned on having good generalization in capabilities. I wouldn't agree with the "value of long tail" argument as applied to humans: for many tasks, it seems like you can explain to a human what the task is, and they are quickly able to do it without too many mistakes, or at least they know when they can't do the task without too high a risk of error; it seems like this comes from our general reasoning + knowledge of the world, both of which the AI system presumably also has.

A counterargument would be that your AI system could be great at generalizing at capabilities / impacting the world, but not great at generalizing alignment / motivation / translation of human objectives into AI objectives.

I think this is roughly the right counterargument, modulo the distinction between "the AI has a good model of what humans want" and "the AI is programmed to actually do what humans want". (I don't think that distinction is key to this discussion, but might be for some people who come along and read this.)

I do think there's one really strong argument that generalizing alignment / motivation / translation of human objectives is harder than generalizing capabilities: what happens in the limit of infinite data and compute? In that limit, an AI can get best-possible predictive power by Bayesian reasoning on the entire microscopic state of the universe. That's what best-possible generalizing capabilities look like. The argument in Alignment as Translation was that alignment / motivation / translation of human objectives is still hard, even in that limit, and the way-in-which-it-is-hard involves a long tail of mistranslated corner cases. In other words: generalizable predictive power is very clearly not a sufficient condition for generalizable alignment.

I'd say there's a strong chance that generalizable predictive power will be enough for generalizable alignment in practice, with realistic data/compute, but we don't even have a decent model to predict when it will fail - other than that it will fail, once data and compute pass some unknown threshold. Such a model would presumably involve an epistemic analogue of instrumental convergence: it would tell us when two systems with different architectures are likely to converge on similar abstractions in order to model the same world.

Basically agree with all of this.

I do think there's one really strong argument that generalizing alignment / motivation / translation of human objectives is harder than generalizing capabilities: what happens in the limit of infinite data and compute?

Strongly agree. I have two arguments for work on AI safety that I really do buy and find motivating; this is one of them. (The other one is the one presented in Human Compatible.)

But with both of these arguments, I see them as establishing that we can't be confident given our current knowledge that alignment happens by default; therefore given the high stakes we should work on it. This is different from making a prediction that things will probably go badly.

(I don't think this is actually disagreeing with you anywhere.)

other than that it will fail, once data and compute pass some unknown threshold.

I want to flag a note of confusion here -- it feels like it should be possible for a mostly-aligned system to become more aligned, such that it never fails at any threshold (along the lines of there being a broad basin of corrigibility). But I haven't really made this perspective play nicely with the perspective of alignment as translation.

This is different from making a prediction that things will probably go badly.

Thinking about it, I really should have been more explicit about this before: I do think there's a strong chance of alignment-by-default of AGI (at least 20%, maybe higher), as well as a strong chance of non-doom via other routes (e.g. decreasing marginal returns of intelligence or alignment becoming necessary for economic value in obvious ways).

Related: one place where I think I diverge from many/most people in the area is that I'm playing to win, not just to avoid losing. I see alignment not just as important for avoiding doom, but as plausibly the hardest part of unlocking most of the economic value of AGI.

My goal for AGI is to create tons of value and to (very very reliably) avoid catastrophic loss. I see alignment-in-the-sense-of-translation as the main bottleneck to achieving both of those simultaneously; I expect that both the value and the risk are dominated by exponentially large numbers of corner-cases.

I want to flag a note of confusion here -- it feels like it should be possible for a mostly-aligned system to become more aligned, such that it never fails at any threshold (along the lines of there being a broad basin of corrigibility). But I haven't really made this perspective play nicely with the perspective of alignment as translation.

This was exactly why I mentioned the distinction between "the AI has a good model of what humans want" and "the AI is programmed to actually do what humans want". I haven't been able to articulate it very well, but here's a few things which feel like they're pointing to the same idea:

• If our AI is learning what humans value by predicting some data, then it won't matter how clever the AI is if the data-collection process is not robustly pointed at human values.
• More generally, if the source-of-truth for human values does not correctly and robustly point to human values, no amount of clever AI architecture can overcome that problem (though note that the source-of-truth may include e.g information about human values built into a prior)
• Abram's stuff on stable pointers to values
• In translation terms, at some point we have to translate some directive for the AI, something of the form "do X". X may include some mechanism for self-correction, but if that initial mechanism for self-correction is ever insufficient, there will not be any way to fix it later (other than starting over with a whole new AI).

Continuing with the translation analogy: suppose we could translate the directive "don't take these instructions literally, use them as evidence to figure out what I want and then do that" - and of course other instructions would include further information about how to figure out what you want. That's the sort of thing which would potentially give a broad(er) basin of alignment if we're looking at the problem through a translation lens.

I do think there's a strong chance of alignment-by-default of AGI (at least 20%, maybe higher), as well as a strong chance of non-doom via other routes (e.g. decreasing marginal returns of intelligence or alignment becoming necessary for economic value in obvious ways).

Ah, got it. In that case I think we broadly agree.

one place where I think I diverge from many/most people in the area is that I'm playing to win, not just to avoid losing.

Yeah, this is a difference. I don't think it's particularly decision-relevant for me personally given the problems we actually face, but certainly it makes a difference in other hypotheticals (e.g. in the translation post I suggested testing + reversibility as a solution; that's much more about not losing than it is about winning).

Continuing with the translation analogy: suppose we could translate the directive "don't take these instructions literally, use them as evidence to figure out what I want and then do that" - and of course other instructions would include further information about how to figure out what you want. That's the sort of thing which would potentially give a broad(er) basin of alignment if we're looking at the problem through a translation lens.

Yeah, I think that's right. There's also the directive "assist me" / "help me get what I want". It feels like these should be easier to translate (though I can't say what makes them different from all the other cases where I expect translation to be hard).

What's the corresponding story here for trading bots? Are they designed in a sufficiently high-assurance way that new tail problems don't come up, or do they not operate in the tails?

Great question. Let's talk about Knight Capital.

Ten years ago, Knight Capital was the largest high-frequency trader in US equities. On August 1 2012, somebody deployed a bug. Knight's testing platform included a component which generated random orders and sent them to a simulated market; somebody accidentally hooked that up to the real market. It's exactly the sort of error testing won't catch, because it was a change outside of the things-which-are-tested; it was partly an error in deployment, and partly code which did not handle partial deployment. The problem was fixed about 45 minutes later. That was the end of Knight Capital.

So yes, trading bots definitely operate in the tails.

When the Knight bug happened, I was interning at the largest high-frequency trading company in US options. Even before that, the company was more religious about thorough testing than any other I've worked at. Everybody knew that one bug could end us, Knight was just a reminder (specifically a reminder to handle partial deployment properly).

I agree that this is true for large-downside events, which is why the second half of the post exists. In reality, long-tail problems mostly don't come from black swans; magnitude of events is bounded. Rather, long-tail problems come from large numbers of unrelated rare events - i.e. corner cases - any one of which has significant but bounded consequences. It's the aggregate frequency of individually rare events, rather than the magnitude of the events, which makes the long tail an issue.

(Though there are exceptions to this, most notably in X-risk. There, it really is the magnitude of individual rare events which matters.)

> If anything, the problem is worse, because now the company has eliminated most of the “warning bells” - the more-frequent fires which are big but not disastrous.

Why would preventing small fires, which are qualitatively different from and causally unrelated to supervolcano eruptions, eliminate any of the "warning bells" suggesting that supervolcano eruptions are a thing?

"Ignoring breakdowns of the model" means the same thing as "using the model where it is useless"; that can serve an illustrative purpose, but it means that in order to apply that metaphor to something real, you must first demonstrate that the negative impact of that thing /actually/ follows the behavior of the power law even for very large N; you can't just observe it for small N and extrapolate.

For example, insurance companies have a hard cap on liability. If every policy they have outstanding is filed for the policy limit, there is no additional source of liability to be had- their tail actually has a hard cutoff. That still allows actual claims to exactly match a power law for all observed cases.

I really like this post.

Self-driving cars are currently illegal, I assume largely because of these unresolved tail risks. But I think excluding illegality I'm not sure their economic value is zero-- I could imagine cases where people would use self-driving cars if they wouldn't be caught doing it. Does this seem right to people?

Intuitively it doesn't seem like economic value tails and risk tails should necessarily go together, which makes me concerned about cases similar to self-driving cars that are harder to regulate legally.

I could imagine cases where people would use self-driving cars if they wouldn't be caught doing it. Does this seem right to people?

Rather than people straight-up ignoring the risks, I imagine things like cruise control or automatic emergency braking; these are example self-driving use-cases which don't require solving all the tail risks. The economic value of marginal improvements is not zero, although it's nowhere near the value of giving every worker in the country an extra hour every weekday (roughly the average commute time).

Intuitively it doesn't seem like economic value tails and risk tails should necessarily go together...

Totally agree with this. I do think that when we know some area has lots of tail risk, we tend to set up regulation/liability, which turns the risk tail into an economic value tail. That's largely the point of (idealized) liability law: to turn risks directly into (negative) value for someone capable of mitigating the risks. But there's plenty of cases where risk tails and value tails won't go together:

• Cases where there's a positive value tail without any particular risks involved.
• Cases where we don't know there's a risk tail.
• Cases where liability law sucks. (Insert punchline here.)

I don't think self-driving cars are actually a hard case here, they're just a case which has to be handled by liability law (i.e. lawsuits post-facto) rather than regulatory law (i.e. banning things entirely).