Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Here is an exploration of what Eliezer Yudkowsky means when he writes about deep vs shallow patterns (although I’ll be using "knowledge" instead of "pattern" for reasons explained in the next section). Not about any specific pattern Yudkowsky is discussing, mind you, about what deep and shallow patterns are at all. In doing so, I don’t make any criticism of his ideas and instead focus on quoting him (seriously, this post is like 70% quotes) and interpreting him by finding the best explanation I can of his words (that still fit them, obviously). Still, there’s a risk that my interpretation misses some of his points and ideas— I’m building a lower-bound on his argument’s power that is as high as I can get, not an upper-bound. Also, I might just be completely wrong, in which case defer to Yudkowsky if he points out that I’m completely missing the point.

Thanks to Eliezer Yudkowsky, Steve Byrnes, John Wentworth, Connor Leahy, Richard Ngo, Kyle, Laria, Alex Turner, Daniel Kokotajlo and Logan Smith for helpful comments on a draft.

Back to the FOOM: Yudkowsky’s explanation

In recent discussions, Yudkowsky often talks about deep patterns and deep thinking. What he made clear in a comment on this draft is that he has been using the term “deep patterns” in two different ways:

  • What I’ll call deep knowledge, which is a form of human knowledge/theory as well as the related epistemic strategies. This is what I explore below.
  • What I’ll call deep cognition, which is the sort of deep patterns that Yudkowsky points out AGI would have. There’s a link and an analogy with the deep knowledge, but I don’t get it enough to write something convincing to me and Yudkowsky, so I’ll mostly avoid that topic in this post.

Focusing on deep knowledge then, Yudkowsky recently seems to ascribe his interlocutors’ failure to grasp his point to their inability to grasp different instances of deep knowledge.

(All quotes from Yudkowsky if not mentioned otherwise)

(From the first discussion with Richard Ngo)

In particular, just as I have a model of the Other Person's Beliefs in which they think alignment is easy because they don't know about difficulties I see as very deep and fundamental and hard to avoid, I also have a model in which people think "why not just build an AI which does X but not Y?" because they don't realize what X and Y have in common, which is something that draws deeply on having deep models of intelligence. And it is hard to convey this deep theoretical grasp.
 

That being said, he doesn’t really explain what this sort of deep knowledge is.

(From the same discussion with Ngo)

(Though it's something of a restatement, a reason I'm not going into "my intuitions about how cognition works" is that past experience has led me to believe that conveying this info in a form that the Other Mind will actually absorb and operate, is really quite hard and takes a long discussion, relative to my current abilities to Actually Explain things; it is the sort of thing that might take doing homework exercises to grasp how one structure is appearing in many places, as opposed to just being flatly told that to no avail, and I have not figured out the homework exercises.)

The thing is, he did exactly that in the FOOM debate with Robin Hanson 13 years ago. (For those unaware of this debate, Yudkoswky is responding to Hanson’s use of trends — like Moore’s law — extrapolations to think about intelligence explosion).

(From The Weak Inside View (2008))

Robin keeps asking me what I’m getting at by talking about some reasoning as “deep” while other reasoning is supposed to be “surface.” One thing which makes me worry that something is “surface” is when it involves generalizing a level N feature across a shift in level N−1 causes.

For example, suppose you say, “Moore’s Law has held for the last sixty years, so it will hold for the next sixty years, even after the advent of superintelligence” (as Kurzweil seems to believe, since he draws his graphs well past the point where you’re buying a billion times human brainpower for $1,000).

Now, if the Law of Accelerating Change were an exogenous, ontologically fundamental, precise physical law, then you wouldn’t expect it to change with the advent of superintelligence.

But to the extent that you believe Moore’s Law depends on human engineers, and that the timescale of Moore’s Law has something to do with the timescale on which human engineers think, then extrapolating Moore’s Law across the advent of superintelligence is extrapolating it across a shift in the previous causal generator of Moore’s Law.

So I’m worried when I see generalizations extrapolated across a change in causal generators not themselves described—i.e., the generalization itself is on the level of the outputs of those generators and doesn’t describe the generators directly.

If, on the other hand, you extrapolate Moore’s Law out to 2015 because it’s been reasonably steady up until 2008—well, Reality is still allowed to say, “So what?” to a greater extent than we can expect to wake up one morning and find Mercury in Mars’s orbit. But I wouldn’t bet against you, if you just went ahead and drew the graph.

So what’s “surface” or “deep” depends on what kind of context shifts you try to extrapolate past

An important subtlety here comes from the possible conflation of two uses of “surface”: the implicit use of “surface knowledge” as the consequences of some underlying causal processes/generator, and the explicit use of “surface knowledge” as drawing similarities without thinking about the causal process generating them. To simplify the discussion, let’s use the more modern idiom of “shallow” for the more explicit sense here.

So what is Yudkowsky pointing at? Two entangled things:

  • If you have shallow knowledge, that is a trend without an underlying causal model, then you can’t extend it when the causal process generating it changes. So if Moore’s law depends on “the timescale on which human engineers think”, we can’t extend it past the intelligence explosion, because then human engineers would be reply by AI engineers which would think faster.
  • If you have shallow knowledge, you can’t even know when to extend the trend safely because understanding when the underlying causal process changes is harder when you don’t know what the causal process is!

Imagine a restaurant that has a dish you really like. The last 20 times you went to eat there, the dish was amazing. So should you expect that the next time it will also be great? Well, that depends on whether anything in the kitchen changes. Because you don’t understand what makes the dish great, you don’t know of the most important aspects of the causal generators. So if they can’t buy their meat/meat-alternative at the same place, maybe that will change the taste; if the cook is replaced, maybe that will change the taste; if you go at a different time of the day, maybe that will change the taste.

You’re incapable of extending your trend (except by replicating all the conditions) to make a decent prediction because you don’t understand where it comes from. If on the other hand you knew why the dish was so amazing (maybe it’s the particular seasoning, or the chef’s touch), then now you can estimate its quality. But then you’re not using the trend, you’re using a model of the underlying causal process. 

Here is another phrasing by Yudkowsky from the same essay:

Though this is to some extent an argument produced after the conclusion, I would explain my reluctance to venture into quantitative futurism via the following trichotomy:

  • On problems whose pieces are individually precisely predictable, you can use the Strong Inside View to calculate a final outcome that has never been seen before—plot the trajectory of the first moon rocket before it is ever launched, or verify a computer chip before it is ever manufactured.
  • On problems that are drawn from a barrel of causally similar problems, where human optimism runs rampant and unforeseen troubles are common, the Outside View beats the Inside View. Trying to visualize the course of history piece by piece will turn out to not (for humans) work so well, and you’ll be better off assuming a probable distribution of results similar to previous historical occasions—without trying to adjust for all the reasons why this time will be different and better.
  • But on problems that are new things under the Sun, where there’s a huge change of context and a structural change in underlying causal forces, the Outside View also fails—try to use it, and you’ll just get into arguments about what is the proper domain of “similar historical cases” or what conclusions can be drawn therefrom. In this case, the best we can do is use the Weak Inside View—visualizing the causal process—to produce loose, qualitative conclusions about only those issues where there seems to be lopsided support.

More generally, these quotes point out to what Yudkowsky means when he says “deep knowledge”: the sort of reasoning that focuses on underlying causal models.

As he says himself:

To stick my neck out further: I am liable to trust the Weak Inside View over a “surface” extrapolation, if the Weak Inside View drills down to a deeper causal level and the balance of support is sufficiently lopsided.

Before going deeper into how such deep knowledge/Weak Inside View works and how to build confidence in it, I want to touch upon the correspondence between this kind of thinking and the Lucas Critique in macroeconomics. This link has been pointed out in the comments of the recent discussions — we thus shouldn’t be surprised that Yudkowsky wrote about it 8 years ago (yet I was surprised by this).

(From Intelligence Explosion Microeconomics (2013))

The “outside view” (Kahneman and Lovallo 1993) is a term from the heuristics and biases program in experimental psychology. A number of experiments show that if you ask subjects for estimates of, say, when they will complete their Christmas shopping, the right question to ask is, “When did you finish your Christmas shopping last year?” and not, “How long do you think it will take you to finish your Christmas shopping?” The latter estimates tend to be vastly over-optimistic, and the former rather more realistic. In fact, as subjects are asked to make their estimates using more detail—visualize where, when, and how they will do their Christmas shopping—their estimates become more optimistic, and less accurate. Similar results show that the actual planners and implementers of a project, who have full acquaintance with the internal details, are often much more optimistic and much less accurate in their estimates compared to experienced outsiders who have relevant experience of similar projects but don’t know internal details. This is sometimes called the dichotomy of the inside view versus the outside view. The “inside view” is the estimate that takes into account all the details, and the “outside view” is the very rough estimate that would be made by comparing your project to other roughly similar projects without considering any special reasons why this project might be different.

The Lucas critique (Lucas 1976) in economics was written up in 1976 when “stagflation”—simultaneously high inflation and unemployment—was becoming a problem in the United States. Robert Lucas’s concrete point was that the Phillips curve trading off unemployment and inflation had been observed at a time when the Federal Reserve was trying to moderate inflation. When the Federal Reserve gave up on moderating inflation in order to drive down unemployment to an even lower level, employers and employees adjusted their long-term expectations to take into account continuing inflation, and the Phillips curve shifted. Lucas’s larger and meta-level point was that the previously observed Phillips curve wasn’t fundamental enough to be structurally invariant with respect to Federal Reserve policy—the concepts of inflation and unemployment weren’t deep enough to describe elementary things that would remain stable even as Federal Reserve policy shifted.

and later in that same essay:

The lesson of the outside view pushes us to use abstractions and curves that are clearly empirically measurable, and to beware inventing new abstractions that we can’t see directly.

The lesson of the Lucas critique pushes us to look for abstractions deep enough to describe growth curves that would be stable in the face of minds improving in speed, size, and software quality.

You can see how this plays out in the tension between “Let’s predict computer speeds using this very well-measured curve for Moore’s Law over time—where the heck is all this other stuff coming from?” versus “But almost any reasonable causal model that describes the role of human thinking and engineering in producing better computer chips, ought to predict that Moore’s Law would speed up once computer-based AIs were carrying out all the research!”

This last sentence in particular points out another important feature of deep knowledge: that it might be easier to say negative things (like “this can’t work”) than precise positive ones (like “this is the precise law”) because the negative thing can be something precluded by basically all coherent/reasonable causal explanations, while they still disagree on the precise details.

Let’s dig deeper into that by asking more generally what deep knowledge is useful for.

How does deep knowledge work?

We now have a pointer (however handwavy) to what Yudkowsky means by deep knowledge. Yet we have very little details at this point about what this sort of thinking looks like. To improve that situation, the next two subsections explore two questions about the nature of deep knowledge: what is it for, and where does it come from?

The gist of this section is that:

  • Deep knowledge is primarily useful for saying what isn’t possible/what can’t work, especially in cases (like alignment) where there is very little data to draw from. (The comparison Yudkowsky keeps coming back to is how thermodynamics allows you to rule out perpetual motion machines)
  • Deep knowledge takes the form of compressed constraints on solution/hypothesis space, which have weight behind them because they let us rederive most of our current knowledge from basic/compressed ideas, and finding such compression without a strong entanglement with reality is incredibly hard. (Here an example used by Yudkowsky is the sort of thought experiments, conservation laws, and general ideas about what physical laws look like that guided Einstein in his path to Special and General Relativity)

What is deep knowledge useful for?

The big difficulty that comes up again and again, in the FOOM debate with Hanson and the discussion with Ngo and Christiano, is that deep knowledge doesn’t always lead to quantitative predictions. That doesn’t mean that the deep knowledge isn’t quantitative itself (expected utility maximization is an example used by Yudkowsky that is completely formal and quantitative), but that the causal model only partially constrains what can happen. That is, it doesn’t constrain enough to make precise quantitative predictions. 

Going back to his introduction of the Weak Outside view, recall that he wrote:

But on problems that are new things under the Sun, where there’s a huge change of context and a structural change in underlying causal forces, the Outside View also fails—try to use it, and you’ll just get into arguments about what is the proper domain of “similar historical cases” or what conclusions can be drawn therefrom. In this case, the best we can do is use the Weak Inside View—visualizing the causal process—to produce loose, qualitative conclusions about only those issues where there seems to be lopsided support.

He follows up writing:

So to me it seems “obvious” that my view of optimization is only strong enough to produce loose, qualitative conclusions, and that it can only be matched to its retrodiction of history, or wielded to produce future predictions, on the level of qualitative physics.

“Things should speed up here,” I could maybe say. But not “The doubling time of this exponential should be cut in half.”

I aspire to a deeper understanding of intelligence than this, mind you. But I’m not sure that even perfect Bayesian enlightenment would let me predict quantitatively how long it will take an AI to solve various problems in advance of it solving them. That might just rest on features of an unexplored solution space which I can’t guess in advance, even though I understand the process that searches.

Let’s summarize it that way: deep knowledge only partially constrains the surface phenomena it describes (which translate into quantitative predictions) and it takes a lot of detailed deep knowledge (and often data) to refine it enough to pin down exactly the phenomenon and make precise quantitative predictions. Alignment and AGI are fields where we don’t have that much deep knowledge, and the data is sparse, and thus we shouldn’t expect precise quantitative predictions anytime soon.

Of course, just because a prediction is qualitative doesn’t mean it comes from deep knowledge; all hand-waving isn’t wisdom. For a good criticism of shallow qualitative reasoning in alignment, let’s turn to Qualitative Strategies of Friendliness.

These then are three problems, with strategies of Friendliness built upon qualitative reasoning that seems to imply a positive link to utility:

The fragility of normal causal links when a superintelligence searches for more efficient paths through time;

The superexponential vastness of conceptspace, and the unnaturalness of the boundaries of our desires;

And all that would be lost, if success is less than complete, and a superintelligence squeezes the future without protecting everything of value in it.

The shallow qualitative reasoning criticized here relies too much on human common sense and superiority to the AI, when the situation to predict is about superintelligence/AGI. That is, this type of qualitative reasoning extrapolates across a change in causal generators.

On the other hand, Yudkowsky uses qualitative constraints to guide his criticism: he knows there’s a problem because the causal model forbids that kind of solution. Just like the laws of thermodynamics forbid perpetual motion machines.

Deep qualitative reasoning starts from the underlying (potentially quantitative) causal explanations and mostly tells you what cannot work or what cannot be done. That is, deep qualitative reasoning points out that a whole swatch of search space is not going to yield anything. A related point is that Yudkwosky rarely (AFAIK) makes predictions, even qualitative ones. He sometimes admits that he might do some, but it feels more like a compromise with the prediction-centered other person than what the deep knowledge is really for. Whereas he constantly points out how certain things cannot work.

(From Qualitative Strategies of Friendliness (2008))

In general, a lot of naive-FAI plans I see proposed, have the property that, if actually implemented, the strategy might appear to work while the AI was dumber-than-human, but would fail when the AI was smarter than human.  The fully general reason for this is that while the AI is dumber-than-human, it may not yet be powerful enough to create the exceptional conditions that will break the neat little flowchart that would work if every link operated according to the 21st-century First-World modal event.

This is why, when you encounter the AGI wannabe who hasn't planned out a whole technical approach to FAI, and confront them with the problem for the first time, and they say, "Oh, we'll test it to make sure that doesn't happen, and if any problem like that turns up we'll correct it, now let me get back to the part of the problem that really interests me," know then that this one has not yet leveled up high enough to have interesting opinions.  It is a general point about failures in bad FAI strategies, that quite a few of them don't show up while the AI is in the infrahuman regime, and only show up once the strategy has gotten into the transhuman regime where it is too late to do anything about it.

(From the second discussion with Ngo)

I live in a world where I proceed with very strong confidence if I have a detailed formal theory that made detailed correct advance predictions, and otherwise go around saying, "well, it sure looks like X, but we can be on the lookout for a miracle too".

If this was a matter of thermodynamics, I wouldn't even be talking like this, and we wouldn't even be having this debate.

I'd just be saying, "Oh, that's a perpetual motion machine. You can't build one of those. Sorry." And that would be the end.

(From Security Mindset and Ordinary Paranoia (2017))

You need to master two ways of thinking, and there are a lot of people going around who have the first way of thinking but not the second. One way I’d describe the deeper skill is seeing a system’s security as resting on a story about why that system is safe. We want that safety-story to be as solid as possible. One of the implications is resting the story on as few assumptions as possible; as the saying goes, the only gear that never fails is one that has been designed out of the machine.

[...]

There’s something to be said for redundancy, and having fallbacks in case the unassailable wall falls; it can be wise to have additional lines of defense, so long as the added complexity does not make the larger system harder to understand or increase its vulnerable surfaces. But at the core you need a simple, solid story about why the system is secure, and a good security thinker will be trying to eliminate whole assumptions from that story and strengthening its core pillars, not only scurrying around parrying expected attacks and putting out risk-fires.

Or my reading of the whole discussion with Christiano, which is that Christiano constantly tries to get Yudkowsky to make a prediction, but the latter focuses on aspects of Christiano’s model and scenario that don’t fit his (Yudkoswky’s) deep knowledge.

I especially like the perpetual motion machines analogy, because it drives home how just proposing a tweak/solution without understanding Yudkowsky’s deep knowledge (and what it would take for it to not apply) has almost no chance of convincing him. Because if someone said they built a perpetual motion machine without discussing how they bypass the laws of thermodynamics, every scientifically literate person would be doubtful. On the other hand, if they seemed to be grappling with thermodynamics and arguing for a plausible way of winning, you’d be significantly more interested.

(I feel like Bostrom’s Orthogonality Thesis is a good example of such deep knowledge in alignment that most people get, and I already argued elsewhere that it serves mostly to show that you can’t solve alignment by just throwing competence at it — also note that Yudkowsky had the same pattern earlier/parallely, and is still using it)

To summarize: the deep qualitative thinking that Yudkowsky points out by saying “deep knowledge” is the sort of thinking that cuts off a big chunk of possibility space, that is tells you the whole chunk cannot work. It also lets you judge from the way people propose a solution (whether they tackle the deep pattern or not) whether you should ascribe decent probability to them being right.

A last note in this section: although deep knowledge primarily leads to negative conclusions, it can also lead to positive knowledge through a particularly Bayesian mechanism: if the deep knowledge destroys every known hypothesis/proposal except one (or a small number of them), then that is strong evidence for the ones left.

(This quote is more obscure than the others without the context. It’s from Intelligence Explosion Microeconomics (2013), and discusses the last step in a proposal for formalizing the sort of deep insight/pattern Yudkowksy leveraged during the FOOM debate. If you’re very confused, I feel like the most relevant part to my point is the bold last sentence.)

If Step Three is done wisely—with the priors reflecting an appropriate breadth of uncertainty—and doesn’t entirely founder on the basic difficulties of formal statistical learning when data is scarce, then I would expect any such formalization to yield mostly qualitative yes-or-no answers about a rare handful of answerable questions, rather than yielding narrow credible intervals about exactly how the internal processes of the intelligence explosion will run. A handful of yeses and nos is about the level of advance prediction that I think a reasonably achievable grasp on the subject should allow—we shouldn’t know most things about intelligence explosions this far in advance of observing one—we should just have a few rare cases of questions that have highly probable if crude answers. I think that one such answer is “AI go FOOM? Yes! AI go FOOM!” but I make no pretense of being able to state that it will proceed at a rate of 120,000 nanofooms per second.

Even at that level, covering the model space, producing a reasonable simplicity weighting, correctly hooking up historical experiences to allow falsification and updating, and getting back the rational predictions would be a rather ambitious endeavor that would be easy to get wrong. Nonetheless, I think that Step Three describes in principle what the ideal Bayesian answer would be, given our current collection of observations. In other words, the reason I endorse an AI-go-FOOM answer is that I think that our historical experiences falsify most regular growth curves over cognitive investments that wouldn’t produce a FOOM.

Where does deep knowledge come from?

Now that we have a decent grounding of what Yudkowsky thinks deep knowledge is for, the biggest question is how to find it, and how to know you have found good deep knowledge. After all, maybe the causal models one assumes are just bad?

This is the biggest difficulty that Hanson, Ngo, and Christiano seemed to have with Yudkowsky’s position.

(Robin Hanson, from the comments after Observing Optimization in the FOOM Debate)

If you can’t usefully connect your abstractions to the historical record, I sure hope you have some data you can connect them to. Otherwise I can’t imagine how you could have much confidence in them.

(Richard Ngo from his second discussion with Yudkowsky)

Let me put it this way. There are certain traps that, historically, humans have been very liable to fall into. For example, seeing a theory, which seems to match so beautifully and elegantly the data which we've collected so far, it's very easy to dramatically overestimate how much that data favours that theory. Fortunately, science has a very powerful social technology for avoiding this (i.e. making falsifiable predictions) which seems like approximately the only reliable way to avoid it - and yet you don't seem concerned at all about the lack of application of this technology to expected utility theory.

(Paul Christiano from his discussion with Yudkowsky)

OK, but you keep saying stuff about how people with my dumb views would be "caught flat-footed" by historical developments. Surely to be able to say something like that you need to be making some kind of prediction?

Note that these attitudes make sense. I especially like Ngo’s framing. Falsifiable predictions (even just postdictions) are the cornerstone of evaluation hypotheses in Science. It even feels to Ngo (as it felt to me) that Yudkowsky argued for that in the Sequences:

(Ngo from his second discussion with Yudkowsky)

I'm familiar with your writings on this, which is why I find myself surprised here. I could understand a perspective of "yes, it's unfortunate that there are no advanced predictions, it's a significant weakness, I wish more people were doing this so we could better understand this vitally important theory". But that seems very different from your perspective here.

(And Yudkoswky himself from Making Belief Pay Rent (In Anticipated Experience))

Above all, don’t ask what to believe—ask what to anticipate. Every question of belief should flow from a question of anticipation, and that question of anticipation should be the center of the inquiry. Every guess of belief should begin by flowing to a specific guess of anticipation, and should continue to pay rent in future anticipations. If a belief turns deadbeat, evict it.

But the thing is… rereading part of the Sequences, I feel Yudkowsky was making points about deep knowledge all along? Even the quote I just used, which I interpreted in my rereading a couple of weeks ago as being about making predictions, now sounds like it’s about the sort of negative form of knowledge that forbids “perpetual motion machines”. Notably, Yudkowsky is very adamant that beliefs must tell you what cannot happen. Yet that doesn’t imply at all to make predictions of the form “this is how AGI will develop”, so much as saying things like “this approach to alignment cannot work”.

Also, should I point out that there’s a whole sequence dedicated to the ways rationality can do better than science? (Thanks to Steve Byrnes for the pointer). I’m also sure I would find a lot of relevant stuff by rereading Inadequate Equilibria too, but if I wait to have reread everything by Yudkowsky before posting, I’ll be there a long time…

My Initial Mistake and the Einstein Case

Let me jump here with my best guess of Yudkowsky’s justification of deep knowledge: their ability to both

  • strongly compress “what sort of hypothesis ends up being right” without having to add anything ad-hoc-y to get our theory and hypotheses back;
  • and constrain anticipations in non-trivial ways.

The thing is, I got it completely wrong initially. Reading Einstein’s Arrogance (2007), an early Sequences post that is all about saying that Einstein had excellent reasons to believe General Relativity’s correctness before experimental verification (of advanced predictions), I thought that relativity was the deep knowledge and that Yudkowsky was pointing out how Einstein, having found an instance of true deep knowledge, could allow himself to be more confident than the social process of Science would permit in the absence of experimental justification.

Einstein’s Speed (2008) made it clear that I had been looking at the moon when I was supposed to see the pointing finger: the deep knowledge Yudkowsky pointed out was not relativity itself, but what let Einstein single it out by a lot of armchair reasoning and better use of what was already known.

In our world, Einstein didn't even use the perihelion precession of Mercury, except for verification of his answer produced by other means.  Einstein sat down in his armchair, and thought about how he would have designed the universe, to look the way he thought a universe should look—for example, that you shouldn't ought to be able to distinguish yourself accelerating in one direction, from the rest of the universe accelerating in the other direction.

And Einstein executed the whole long (multi-year!) chain of armchair reasoning, without making any mistakes that would have required further experimental evidence to pull him back on track.

More generally, I interpret the whole Science and Rationality Sequence as explaining how deep knowledge can let rationalists do something that isn’t in the purview of traditional Science: estimate which hypotheses make sense before the experimental predictions and evidence come in.

(From Faster Than Science (2008))

This doesn't mean that the process of deciding which ideas to test is unimportant to Science.  It means that Science doesn't specify it.

[...]

In practice, there are some scientific queries with a large enough answer space, that picking models at random to test, it would take zillions of years to hit on a model that made good predictions—like getting monkeys to type Shakespeare.

At the frontier of science—the boundary between ignorance and knowledge, where science advances—the process relies on at least some individual scientists (or working groups) seeing things that are not yet confirmed by Science.  That's how they know which hypotheses to test, in advance of the test itself.

If you take your Bayesian goggles off, you can say, "Well, they don't have to know, they just have to guess."  If you put your Bayesian goggles back on, you realize that "guessing" with 10% probability requires nearly as much epistemic work to have been successfully performed, behind the scenes, as "guessing" with 80% probability—at least for large answer spaces.

The scientist may not know he has done this epistemic work successfully, in advance of the experiment; but he must, in fact, have done it successfully!  Otherwise he will not even think of the correct hypothesis.  In large answer spaces, anyway.

There’s a subtlety that is easy to miss: Yudkowsky doesn’t say that specifying an hypothesis in a large answer space makes it high evidence. After all, you can just generate any random guess. What he’s pointing at is that to ascribe a decent amount of probability to a specific hypothesis in a large space through updating on evidence, you need to cut a whole swath of the space to redirect the probability on your hypothesis. And that from a purely computational perspective, this implies more work on whittling down hypotheses than to make the favored hypothesis certain enough through experimental verification.

His claim then seems that Einstein, and other scientists who tended to “guess right” at what would be later experimentally confirmed, couldn’t have been just lucky — they must have found ways of whittling down the vastness of hypothesis space, so they had any chance of proposing something that was potentially right.

Yudkowsky gives some pointers to what he thinks Einstein was doing right.

(From Einstein’s Speed (2008))

Rather than observe the planets, and infer what laws might cover their gravitation, Einstein was observing the other laws of physics, and inferring what new law might follow the same pattern.  Einstein wasn't finding an equation that covered the motion of gravitational bodies.  Einstein was finding a character-of-physical-law that covered previously observed equations, and that he could crank to predict the next equation that would be observed.

Nobody knows where the laws of physics come from, but Einstein's success with General Relativity shows that their common character is strong enough to predict the correct form of one law from having observed other laws, without necessarily needing to observe the precise effects of the law.

(In a general sense, of course, Einstein did know by observation that things fell down; but he did not get GR by backward inference from Mercury's exact perihelion advance.)

So in that interpretation, Einstein learned from previous physics and from thought experiments how to cut away the parts of the hypothesis space that didn’t sound like they could make good physical laws, until he was left with a small enough subspace that he could find the right fit by hand (even if that took him 10 years)

So, from a Bayesian perspective, what Einstein did is still induction, and still covered by the notion of a simple prior (Occam prior) that gets updated by new evidence.  It's just the prior was over the possible characters of physical law, and observing other physical laws let Einstein update his model of the character of physical law, which he then used to predict a particular law of gravitation.

If you didn't have the concept of a "character of physical law", what Einstein did would look like magic—plucking the correct model of gravitation out of the space of all possible equations, with vastly insufficient evidence.  But Einstein, by looking at other laws, cut down the space of possibilities for the next law.  He learned the alphabet in which physics was written, constraints to govern his answer.  Not magic, but reasoning on a higher level, across a wider domain, than what a naive reasoner might conceive to be the "model space" of only this one law.

In summary, deep knowledge doesn’t come in the form of a particularly neat hypothesis or compression; it is the engine of compression itself. Deep knowledge compresses “what sort of hypothesis tends to be correct”, such that it can be applied to the search of a correct hypothesis at the object level. That also cements the idea that deep knowledge gives constraints, not predictions: you don’t expect to be able to have such a strong criterion for correct hypothesis that given a massive hypothesis space, you can pinpoint the correct one.

Here it is good to generalize my previous mistake; recall that I took General Relativity for the deep knowledge, when it was actually the sort of constraints on physical laws that Einstein used for even finding General Relativity. Why? I can almost hear Yudkowsky answering in my head: because General Relativity is the part accepted and acknowledged by Science. I don’t think it’s the only reason, but there’s an element of truth: I privileged the “proper” theory with experimental validation over the more vague principles and concepts that lead to it.

A similar mistake is to believe the deep knowledge is the theory when it actually is what the theory and the experiments unearthed. This is how I understand Yudkowsky’s use of thermodynamics and evolutionary biology: he points out at the deep knowledge that led and was revealed by the work on these theories, more than at the theories themselves.

Compression and Fountains of Knowledge

We still don’t have a good way of finding and checking deep knowledge, though. Not any constraint on hypothesis space is deep knowledge, or even knowledge at all. The obvious idea is to have a reason for that constraint. And the reason Yudkowsky goes for almost every time is compression. Not a compressed description, like Moore’s law; nor a “compression” that is as complex as the pattern of hypothesis it’s trying to capture. Compression in the sense that you get a simpler constraint that can get you most of the way to regenerate the knowledge you’re starting from.

This view of the importance of compression is everywhere in the Sequences. A great example is Truly Part of You, which asks what knowledge you could rederive if it was deleted from your mind. If you have a deep understanding of the subject, and you keep recursively asking how a piece of knowledge could be rederived and then how “what’s needed for the derivation” can be rederived, Yudkwosky argues that you will reach “fountains of knowledge”. Or in the terminology of this post, deep knowledge.

Almost as soon as I started reading about AI—even before I read McDermott—I realized it would be a really good idea to always ask myself: “How would I regenerate this knowledge if it were deleted from my mind?”

The deeper the deletion, the stricter the test. If all proofs of the Pythagorean Theorem were deleted from my mind, could I re-prove it? I think so. If all knowledge of the Pythagorean Theorem were deleted from my mind, would I notice the Pythagorean Theorem to re-prove? That’s harder to boast, without putting it to the test; but if you handed me a right triangle with sides of length 3 and 4, and told me that the length of the hypotenuse was calculable, I think I would be able to calculate it, if I still knew all the rest of my math.

What about the notion of mathematical proof? If no one had ever told it to me, would I be able to reinvent that on the basis of other beliefs I possess? There was a time when humanity did not have such a concept. Someone must have invented it. What was it that they noticed? Would I notice if I saw something equally novel and equally important? Would I be able to think that far outside the box?

How much of your knowledge could you regenerate? From how deep a deletion? It’s not just a test to cast out insufficiently connected beliefs. It’s a way of absorbing a fountain of knowledge, not just one fact.

What do these fountains look like? They’re not the fundamental theories themselves, but instead their underlying principles. Stuff like the principle of least action, Noether’s theorem and the principles underlying Statistical Mechanics (don’t know enough about it to name them). They are the crystallized insights which constrain enough the search space that we can rederive what we knew from them.

(Feynman might have agreed, given that he chose the atomic hypothesis/principle,  “all things are made of atomslittle particles that move around in perpetual motion, attracting each other when they are a little distance apart, but repelling upon being squeezed into one another” was the one sentence he salvage for further generations in case of a cataclysm.)

Here I hear a voice in my mind saying “What does simple mean? Shouldn’t it be better defined?” Yet this doesn’t feel like a strong objection. Simple is tricky to define intensively, but scientists and mathematicians tend to be pretty good at spotting it, as long as they don’t fall for Mysterious Answers. And most of the checks on deep knowledge seem to be in their ability to rederive the known correct hypotheses without adding stuff during the derivation.

A final point before closing this section: Yudkowsky writes that the same sort of evidence can be gathered for more complex arguments if they can be summarized by simple arguments that still get most of the current data right. My understanding here is that he’s pointing at the wiggle room of deep knowledge, that is at the non-relevant ways in which it can be off sometimes. This is important because asking for that wiggle room can sound like ad-hoc adaptation of the pattern, breaking the compression assumption.

(From Intelligence Explosion Microeconomics (2013))

In my case, I think how much I trusted a Step Three model would depend a lot on how well its arguments simplified, while still yielding the same net predictions and managing not to be falsified by history. I trust complicated arguments much more when they have simple versions that give mostly the same answers; I would trust my arguments about growth curves less if there weren’t also the simpler version, “Smart minds build even smarter minds.” If the model told me something I hadn’t expected, but I could translate the same argument back into simpler language and the model produced similar results even when given a few cross-validational shoves, I’d probably believe it.

Conclusion

Based on my reading of his position, Yudkowsky sees deep knowledge as highly compressed causal explanations of “what sort of hypothesis ends up being right”. The compression means that we can rederive the successful hypotheses and theories from the causal explanation. Finally, such deep knowledge translates into partial constraints on hypothesis space, which focus the search by pointing out what cannot work. This in turn means that deep knowledge is far better at saying what won’t work than at precisely predicting the correct hypothesis.

I also want to point out something that became clearer and clearer in reading old posts: Yudkowsky is nothing if not coherent. You might not like his tone in the recent discussions, but if someone has been saying the same thing for 13 years, nobody seems to get it, and their model predicts that this will lead to the end of the world, maybe they can get some slack for talking smack.

New Comment
32 comments, sorted by Click to highlight new comments since: Today at 1:41 PM

Now that we have a decent grounding of what Yudkowsky thinks deep knowledge is for, the biggest question is how to find it, and how to know you have found good deep knowledge.

This is basically the thing that bothered me about the debates. Your solution seems to be to analogize, Einstein:relativity::Yudkowsky:alignment is basically hopeless. But in the debates, M. Yudkowsky over and over says, "You can't understand until you've done the homework, and I have, and you haven't, and I can't tell you what the homework is." It's a wall of text that can be reduced to, "Trust me."

He might be right about alignment, but under the epistemic standards he popularized, if I update in the direction of his view, the strength of the update must be limited to "M. Yudkowsky was right about some of these things in the past and seems pretty smart and to have thought a lot about this stuff, but even Einstein was mistaken about spooky action at a distance, or maybe he was right and we haven't figured it out yet, but, hey, quantum entanglement seems pretty real." In many ways, science just is publishing the homework so people can poke holes in it.

If Einstein came to you in 1906 (after general relativity) and stated the conclusion of the special relativity paper, and when you asked him how he knew, he said, "You can't understand until you've done the homework, and I have, and you haven't," which is all true from my experience studying the equations, "and I can't tell you what the homework is," the strength of your update would be similarly limited. 

You might respond that M. Yudkowsky isn't trying to really convince anyone, but in that case, why debate? He's at least trying to get people to publish their AI findings less in order to burn less timeline.

This is basically the thing that bothered me about the debates. Your solution seems to be to analogize, Einstein:relativity::Yudkowsky:alignment is basically hopeless. But in the debates, M. Yudkowsky over and over says, "You can't understand until you've done the homework, and I have, and you haven't, and I can't tell you what the homework is." It's a wall of text that can be reduced to, "Trust me."

He might be right about alignment, but under the epistemic standards he popularized, if I update in the direction of his view, the strength of the update must be limited to "M. Yudkowsky was right about some of these things in the past and seems pretty smart and to have thought a lot about this stuff, but even Einstein was mistaken about spooky action at a distance, or maybe he was right and we haven't figured it out yet, but, hey, quantum entanglement seems pretty real." In many ways, science just is publishing the homework so people can poke holes in it.

I definitely feel you: that reaction was my big reason for taking so much time rereading his writing and penning this novel-length post.

The first thing I want to add is that after looking for discussions of this in the Sequences, they were there. So the uncharitable explanation of "he's hiding the homework/explanation because he knows he's wrong or doesn't have enough evidence" doesn't really work. (I don't think you're defending this, but it definitely crossed my mind and that of others I talked to). I honestly believe Yudowsky is saying in good faith that he has found deep knowledge and that he doesn't know how to share it in a way he didn't try in his 13 years of writing about them.

The second thing is that I feel my post brings together enough bits of Yudkowsky's explanations of deep knowledge that we have at least a partial handle on how to check it? Quoting back my conclusion:

Yudkowsky sees deep knowledge as highly compressed causal explanations of “what sort of hypothesis ends up being right”. The compression means that we can rederive the successful hypotheses and theories from the causal explanation. Finally, such deep knowledge translates into partial constraints on hypothesis space, which focus the search by pointing out what cannot work.

So the check requires us to understand what sort of successful hypotheses he is compressing, if that is really a compression as a causal underlying process that can be used to rederive these hypotheses, and if the resulting constraint actually cuts a decent chunk of hypothesis space when applied to other problems.

That's definitely a lot of work, and I can understand if people don't want to invest the time there. But it seems different from me to have a potential check and be "I don't think this is a good time investment" from saying that there's no way to check the deep knowledge.

Lastly,

If Einstein came to you in 1906 (after general relativity) and stated the conclusion of the special relativity paper, and when you asked him how he knew, he said, "You can't understand until you've done the homework, and I have, and you haven't," which is all true from my experience studying the equations, "and I can't tell you what the homework is," the strength of your update would be similarly limited. 

I recommend reading Einstein's Speed and Einstein's Superpowers, which are the two posts where Yudkowsky tries to point out that if you look for it, it's possible to find where Einstein was coming from and the sort of deep knowledge he used. I agree it would be easier if the person leveraging the deep knowledge could state it succintly enough that we could get it, but I also acknowledge that this sort of fundamental principle from which other thing derives are just plain hard to express. And even then, you need to do the homework.

(My disagreement with Yudkowsky here is that he seems to believe mostly in providing a lot of training data and examples so that people can see the deep knowledge for themselves, whereas I expect that most smart people would find it far easier to have a sort of pointer to the deep knowledge and what it is good for, and then go through a lot of examples).

I think you've identified a real through-line in Yudkowsky's work, one I hadn't noticed before.  Thank you for that.

Even so, when you're trying to think about this sort of thing I think it's important to remember that this:

In our world, Einstein didn't even use the perihelion precession of Mercury, except for verification of his answer produced by other means.  Einstein sat down in his armchair, and thought about how he would have designed the universe, to look the way he thought a universe should look—for example, that you shouldn't ought to be able to distinguish yourself accelerating in one direction, from the rest of the universe accelerating in the other direction.

...is not true.  In the comments to Einstein's Speed, Scott Aaronson explains the real story: Einstein spent over a year going down a blind alley, and was drawn back by -- among other things -- his inability to make his calculations fit the observation of Mercury's perihelion motion.  Einstein was able to reason his way from a large hypothesis space to a small one, but not to actually get the right answer.

(and of course, in physics you get a lot of experimental data for free.  If you're working on a theory of gravity and it predicts that things should fall away from each other, you can tell right away that you've gone wrong without having to do any new experiments.  In AI safety we are not so blessed.)

There's more I could write about the connection between this mistake and the recent dialogues, but I guess others will get to it and anyway it's depressing.  I think Yudkowsky doesn't need to explain himself more, he needs a vacation.

Thanks for the kind and thoughtful comment!

...is not true.  In the comments to Einstein's Speed, Scott Aaronson explains the real story: Einstein spent over a year going down a blind alley, and was drawn back by -- among other things -- his inability to make his calculations fit the observation of Mercury's perihelion motion.  Einstein was able to reason his way from a large hypothesis space to a small one, but not to actually get the right answer.

(and of course, in physics you get a lot of experimental data for free.  If you're working on a theory of gravity and it predicts that things should fall away from each other, you can tell right away that you've gone wrong without having to do any new experiments.  In AI safety we are not so blessed.)

That's a really good point. I didn't go into that debate in the post (because I tried to not criticize Yudkowky, and also because the post is already way too long), but my take on this is: Yudkowsky probably overstates the case, but that doesn't mean he's wrong about the relevance for Einstein's work of the constrains and armchair reasoning (even if the armchair reasoning was building on more empirical evidence that Yudkowsky originally pointed out). As you say, Einstein apparently did reduce the search space significantly: he just failed to find exactly what he wanted in the reduced space directly.

My comment had an important typo, sorry: I meant to write that I hadn't noticed this through-line before!

I mostly agree with you re: Einstein, but I do think that removing the overstatement changes the conclusion in an important way.  Narrowing the search space from (say) thousands of candidate theories to just 4 is an great achievement, but you still need a method of choosing among them, not just to fulfill the persuasive social ritual of Science but because otherwise you have a 3 in 4 chance of being wrong.  Even someone who trusts you can't update that much on those odds.  That's really different from being able to narrow the search space down to just 1 theory; at that point, we can trust you -- and better still, you can trust yourself!  But the history of science doesn't, so far as I can tell, contain any "called shots" of this type; Einstein might literally have set the bar.

I think we disagree on Yudkowsky's conclusion: his point IMO is that Einstein was able to reduce the search space a lot. He overemphasize for effect (and because it's more impressive to have someone who guesses right directly through these methods), but that doesn't change that Einstein reduced the state space a lot (which you seem to agree with).

Many of the relevant posts I quoted talk about how the mechanism of Science are fundamentally incapable of doing that, because they don't specify any constraint on hypothesis except that they must be falsifiable. Your point seems to be that in the end, Einstein still used the sort of experimental data and methods underlying traditional Science, and I tend to agree. But the mere fact that he was able to get the right answer out of millions of possible formulations by checking a couple of numbers should tell you that there was a massive hypothesis-space reducing step before.

Nah, we're on the same page about the conclusion; my point was more about how we should expect Yudkowsky's conclusion to generalize into lower-data domains like AI safety.  But now that I look at it that point is somewhat OT for your post, sorry.

Besides invoking “Deep Knowledge” and the analogy of ruling out perpetual motion, another important tool for understanding AI foom risk is security mindset, which Eliezer has written about here.

Maybe this is tangential, but I don’t get why the AI foom debate isn’t framed more often as a matter of basic security considerations. AI foom risk seems like a matter of basic security mindset. I think AI is a risk to humanity for the same reason I think any website can be taken out by a hack if you put a sufficiently large bounty on it.

Humanity has all kinds of vulnerabilities that are exploitable by a team of fast simulated humans, not to mention Von Neumann simulations or superhuman AIs. There are so many plausible attack vectors by which to destroy or control humanity: psychology, financial markets, supply chains, biology, nanotechnology, just to name a few.

It’s very plausible that the AI gets away from us, runs in the cloud, self-improves, and we can’t turn it off. It’s like a nuclear explosion that may start slow, but it’s picking up speed, recursively self-improving or even just speeding up to the level of an adversarial Von Neumann team, and it’s hidden in billions of devices.

We have the example of nuclear weapons. The US was a singleton power for a few years due to developing nukes first. At least a nuclear explosion stops when it burns through its fissile material. AI doesn’t stop, and it’s a much more powerful adversary that will not be contained. It’s like the first nuclear pile you’re testing with has a yield much larger than Tsar Bomba. You try one test and then you’ve permanently crashed your ability to test.

So to summarize my security-mindset view: Humanity is vulnerable to hackers, without much ability to restore a backup once we get hacked, and it’s very easy to think AI becomes a great hacker soon.

As for the many attack vectors, I would also add "many places and stages where things can go wrong", AI became a genius social and computer hacker. (By the way, I heard that most hacks are carried out not with the help of computer hacking, but with the help of social engineering, because a person is a much more unreliable and difficult to patch system) From my point of view, the main problem is not even that the first piece of uranium explodes so that it melts the Earth, the problem is that there are 8 billion people on Earth, each has several electronic devices, and processors (well, or batteries for a more complete analogy) are made of californium. Now you have to hope that literally no one in 8 billion people will cause their device to explode (this is much worse than expecting that no one in just 1 million wizards will be prompted with the idea of ​​​​transfiguring anti matter, botulinum toxin, thousands of infections, nuclear weapons, strandels, as well as things like "only top quarks", which cannot be imagined at all), or that literally none of these reactions will go as a chain reaction through all processors (which are also connected to a worldwide network operating on the basis of radiation) in form of a direct explosion or neutron beams, or that you will be able to stop literally every explosive / neutron chain reaction. We can conditionally calculate that for each of 8 billion people there are three probabilities that they will not fail all three points, and even if on average each of them is very high, we raise each of them to the power of 8 billion, worse, these are all probabilities in a certain period of time, conditionally, a year, the problem is that over time, not even the probabilities grow, but the interval for creating AI is shortened, so that we get the difference between a geometric and exponential progression. Of course, one can say that one should not consider the average over all, that the number should be reduced for all but the number of processors, but then the number of people who can interfere will be reduced, and the likelihood that one of them will create AI will increase, and again, the problem is that it's not the chance of creating AI that increases, but the process becomes easier, so that more people have a higher chance of creating it, and that's why I still count for all people. Finally, we can say that civilization will react when it sees not smoke, but fire. But civilization is not adequate. Generally. Only here she did not take fire-fighting measures and did not react to smoke. She also showed how she would react to the example of the coronavirus. But only here, "it's not more dangerous than the flu. Graphic is exponential? Never mind", "it's all a conspiracy and not true danger", "I won't get vaccinated" will be added "it's all fiction / cult", "AI is good" and so on.

Yes, I do quote the security mindset in the post.

I feel you're quite overstating the ability of the security mindset to show FOOM though. The reason it's not presented as a direct consequence of a security mindset is... because it's not one?

Like, once you are convinced of the strong possibility and unavoidability of AGI and superintelligence (maybe through FOOM arguments), then the security mindset actually helps you, and combining it with deep knowledge (like the Orthogonality Thesis) let's you find a lot more ways of breaking the "humanity security". But the security mindset applied without arguments for AGI doesn't let you postulate AGI, for the same reason that the security mindset without arguments about mind-reading doesn't let you postulate that the hackers might read the password in your mind.

For me the security mindset frame comes down to two questions:

1. Can we always stop AI once we release it?

2. Can we make the first unstoppable AI do what we want?

To which I'd answer "no" and "only with lots of research".

Without security mindset, one tends to think an unstoppable AI is a-priori likely to do what humans want, since humans built it. With security mindset, one sees that most AIs are nukes that wreak havoc on human values, and getting them to do what humans want is analogous to building crash-proof software for a space probe, except the whole human race only gets to launch one probe and it goes to whoever launches it first.

I'd like to see this kind of discussion with someone who doesn't agree with MIRI's sense of danger, in addition to all the discussions about how to extrapolate trends and predict development.

Without security mindset, one tends to think an unstoppable AI is a-priori likely to do what humans want, since humans built it. With security mindset, one sees that most AIs are nukes that wreak havoc on human values, and getting them to do what humans want is analogous to building crash-proof software for a space probe, except the whole human race only gets to launch one probe and it goes to whoever launches it first.

I think this is a really shallow argument that undersells enormously the actual reasons for caring about alignment. We have actual arguments for why unstoppable AI are not-likely to do what human wants, and they don't need the security mindset at all. The basic outline is something like:

  • Since we have historically a lot of trouble writing down programs that solve more complex and general problems like language or image recognition (and successes through ML), future AI and AGI will probably the sort to "fill-in the gaps" in our request/specifications
  • For almost everything we could ask an AI to accomplish, there are actions that would help it and would be bad and counterintuitive from previous technology standpoint (the famous convergent subgoals)
  • Precisely specifying what we want without relying on common sense is incredibly hard, and doesn't survive strong optimization (Goodhart's law)
  • And competence by itself doesn't solve the problem, because understanding what humans want doesn't mean caring about it (Orthogonality thesis).

This line of reasoning (which is not new by any mean, it's basically straight out of Bostrom and early Yudkowsky's writing) justify the security mindset for AGI and alignment. Not the other way around. 

(And historically, Yudkowsky wanted to build AGI before he found out about these points, which turned him into the biggest user — but not the only one by all mean — of the security mindset in alignment)

Ok I agree there are a bunch of important concepts to be aware of, such as complexity of value, and there are many ways for security mindset by itself to fail at flagging the extent of AI risk if one is ignorant of some of these other concepts.

I just think the outside view and extrapolating trends is so far from how one should reason about mere nukes, and superhuman intelligence is very nuke-like or at least has a very high chance of being nuke-like: that is, unlock unprecedentedly large rapid irreversible effects. Extrapolating from current trends would have been quite unhelpful to nuclear safety. I know Eliezer is just trying to meet other people in the discussion where they are, but it would be nice to have another discussion that seems more on-topic from Eliezer’s own perspective.

For what it's worth, I often find Eliezer's arguments unpersuasive because they seem shallow. For example:

The insight is in realizing that the hypothetical planner is only one line of outer shell command away from being a Big Scary Thing and is therefore also liable to be Big and Scary in many ways.

This seem like a fuzzy "outside view" sort of argument. (Compare with: "A loaded gun is one trigger pull away from killing someone and is therefore liable to be deadly in many ways." On the other hand, a causal model of a gun lets you explain which specific gun operations can be deadly and why.)

I'm not saying Eliezer's conclusion is false. I find other arguments for that conclusion much more persuasive, e.g. involving mesa-optimizers, because there is a proposed failure type which I understand in causal/mechanistic terms.

(I can provide other examples of shallow-seeming arguments if desired.)

I agree that it's a shallow argument presentation, but that's not the same thing as being based on shallow ideas. The context provided more depth, and in general a fair few of the shallowly presented arguments seem to be counters to even more shallow arguments.

In general one of the deeper concepts underlying all these shallow arguments appears to be some sort of thesis of "AGI-completeness", in which any single system that can reach or exceed human mental capability on most tasks, will almost certainly reach or exceed on all mental tasks, including deceiving and manipulating humans. Combining that with potentially very much greater flexibility and extensibility of computing substrate means you get an incredibly dangerous situation no matter how clever the designers think they've been.

One such "clever designer" idea is decoupling plan generation from plan execution, which really just means that the plan generator has humans as part of the initial plan executing hardware. You don't need a deep argument to point out an obvious flaw there. Talking about mesa-optimizers in a such a context is just missing the point from a view in which humans can potentially be used as part of a toolchain in much the same way as robot arms or protein factories.

One such "clever designer" idea is decoupling plan generation from plan execution, which really just means that the plan generator has humans as part of the initial plan executing hardware. You don't need a deep argument to point out an obvious flaw there.

I don't see the "obvious flaw" you're pointing at and would appreciate a more in-depth explanation.

In my mind, decoupling plan generation from plan execution, if done well, accomplishes something like this:

  • You ask your AGI to generate a plan for how it could maximize paperclips.

  • Your AGI generates a plan. "Step 1: Manipulate human operator into thinking that paperclips are the best thing ever, using the following argument..."

  • You stop reading the plan at that point, and don't click "execute" for it.

I had the same view as you, and was persuaded out of it in this thread. Maybe to shift focus a little, one interesting question here is about training. How do you train a plan-generating AI? If you reward plans that sound like they'd succeed, regardless of how icky they seem, then the AI will become useless to you by outputting effective-sounding but icky plans. But if you reward only plans that look nice enough to execute, that tempts the AI to make plans that manipulate whoever is reading them, and we're back at square one.

Maybe that's a good way to look at the general problem. Instead of talking about AI architecture, just say we don't know any training methods that would make AI better than humans at real world planning and safe to interact with the world, even if it's just answering questions.

I agree these are legitimate concerns... these are the kind of "deep" arguments I find more persuasive.

In that thread, johnswentworth wrote:

In particular, even if we have a reward signal which is "close" to incentivizing alignment in some sense, the actual-process-which-generates-the-reward-signal is likely to be at least as simple/natural as actual alignment.

I'd solve this by maintaining uncertainty about the "reward signal", so the AI tries to find a plan which looks good under both alignment and the actual-process-which-generates-the-reward-signal. (It doesn't know which is which, but it tries to learn a sufficiently diverse set of reward signals such that alignment is in there somewhere. I don't think we can do any better than this, because the entire point is that there is no way to disambiguate between alignment and the actual-process-which-generates-the-reward-signal by gathering more data. Well, I guess maybe you could do it with interpretability or the right set of priors, but I would hesitate to make those load-bearing.)

(BTW, potentially interesting point I just thought of. I'm gonna refer to actual-process-which-generates-the-reward-signal as "approval". Supposing for a second that it's possible to disambiguate between alignment and approval somehow, and we successfully aim at alignment and ignore approval. Then we've got an AI which might deliberately do aligned things we disapprove of. I think this is not ideal, because from the outside this behavior is also consistent with an AI which has learned approval incorrectly. So we'd want to flip the off switch for the sake of caution. Therefore, as a practical matter, I'd say that you should aim to satisfy both alignment and approval anyways. I suppose you could argue that on the basis of the argument I just gave, satisfying approval is therefore part of alignment and thus this is an unneeded measure, but overall the point is that aiming to satisfy both alignment and approval seems to have pretty low costs.)

(I suppose technically you can disambiguate between alignment and approval if there are unaligned things that humans would approve of -- I figure you solve this problem by making your learning algorithm robust against mislabeled data.)

Anyway, you could use a similar approach for the nice plans problem, or you could formalize a notion of "manipulation" which is something like: conditional on the operator viewing this plan, does their predicted favorability towards subsequent plans change on expectation?

Edit: Another thought is that the delta between "approval" and "alignment" seems like the delta between me and my CEV. So to get from "approval" to "alignment", you could ask your AI to locate the actual-process-which-generates-the-labels, and then ask it about how those labels would be different if we "knew more, thought faster, were more the people we wished we were" etc. (I'm also unclear why you couldn't ask a hyper-advanced language model what some respected moral philosophers would think if they were able to spend decades contemplating your question or whatever.)

Another edit: You could also just manually filter through all the icky plans until you find one which is non-icky.

(Very interested in hearing objections to all of these ideas.)

The main problem is that "acting via plans that are passed to humans" is not much different from "acting via plans that are passed to robots" when the AI is good enough at modelling humans.

I don't think this needs an in-depth explanation, does it?

In my mind, decoupling plan generation from plan execution, if done well, accomplishes something like this: [...]

I don't think the given scenario is realistic for any sort of competent AI. There are two sub-cases:

If step 1 won't fail due to being read, then the scenario is unrealistic at the "you stop reading the plan at that point" stage. This might be possible for a sufficiently intelligent AI, but that's already a game over case.

If step 1 will fail due to the plan being read, a competent AI should be able to predict that step 1 will fail due to being read. The scenario is then unrealistic at the "your AGI generates a plan ..." stage, because it should be assumed that the AI won't produce plans that it predicts won't work.

So this leaves only the assumption that the AI is terrible at modelling humans, but can still make plans that should work well in the real world where humans currently dominate. Maybe there is some tiny corner of possibility space where that can happen, but I don't think it contributes much to the overall likelihood unless we can find a way to eliminate everything else.

The main problem is that "acting via plans that are passed to humans" is not much different from "acting via plans that are passed to robots" when the AI is good enough at modelling humans.

I agree this is true. But I don't see why "acting via plans that are passed to humans" is what would happen.

I mean, that might be a component of the plan which is generated. But the assumption here is that we've decoupled plan generation from plan execution successfully, no?

So we therefore know that the plan we're looking at (at least at the top level) is the result of plan generation, not the first step of plan execution (as you seem to be implicitly assuming?)

The AI is searching for plans which score highly according to some criteria. The criteria of "plans which lead to lots of paperclips if implemented" is not the same as the criteria of "plans which lead to lots of paperclips if shown to humans".

My point is that plan execution can't be decoupled successfully from plan generation in this way. "Outputting a plan" is in itself an action that affects the world, and an unfriendly superintelligence restricted to only producing plans will still win.

Also, I think the last sentence is literally true, but misleading. Yes, it is possible for plans to score highly under the first criterion but not the second. However, in this scenario the humans are presumably going to discourage such plans, so they effectively score the same as the second criterion.

My point is that plan execution can't be decoupled successfully from plan generation in this way. "Outputting a plan" is in itself an action that affects the world, and an unfriendly superintelligence restricted to only producing plans will still win.

"Outputting a plan" may technically constitute an action, but a superintelligent system (defining "superintelligent" as being able to search large spaces quickly) might not evaluate its effects as such.

Yes, it is possible for plans to score highly under the first criterion but not the second. However, in this scenario the humans are presumably going to discourage such plans, so they effectively score the same as the second criterion.

I think you're making a lot of assumptions here. For example, let's say I've just created my planner AI, and I want to test it out by having it generate a paperclip-maximizing plan, just for fun. Is there any meaningful sense in which the displayed plan will be optimized for the criteria "plans which lead to lots of paperclips if shown to humans"? If not, I'd say there's an important effective difference.

If the superintelligent search system also has an outer layer that attempts to collect data about my plan preferences and model them, then I agree there's the possibility of incorrect modeling, as discussed in this subthread. But it seems anthropomorphic to assume that such a search system must have some kind of inherent real-world objective that it's trying to shift me towards with the plans it displays.

Yes, if you've just created it, then the criteria are meaningfully different in that case for a very limited time.

But we're getting a long way off track here, since the original question was about what the flaw is with separating plan generation from plan execution as a general principle for achieving AI safety. Are you clearer about my position on that now?

Yes, if you've just created it, then the criteria are meaningfully different in that case for a very limited time.

It's not obvious to me that this is only true right after creation for a very limited time. What is supposed to change after that?

I don't see how we're getting off track. (Your original statement was: 'One such "clever designer" idea is decoupling plan generation from plan execution, which really just means that the plan generator has humans as part of the initial plan executing hardware.' If we're discussing situations where that claim may be false, it seems to me we're still on track.) But you shouldn't feel obligated to reply if you don't want to. Thanks for your replies so far, btw.

What changes is that the human sees that the AI is producing plans that try to manipulate humans. It is very likely that the human does not want the AI to produce such plans, and so applies some corrective action against it happening in future.

My comment on that post asks more-or-less the same question, and also ventures an answer.

After first read-through of your post the main thing that stuck with me was this:

But the thing is… rereading part of the Sequences, I feel Yudkowsky was making points about deep knowledge all along? Even the quote I just used, which I interpreted in my rereading a couple of weeks ago as being about making predictions, now sounds like it’s about the sort of negative form of knowledge that forbids “perpetual motion machines”.

 

This gives me an icky feeling.

 

(low confidence in the following parts of this comment)

 

It makes me think of the Bible.  The "specifications" laid out in the bible are loosey-goosey enough that believers can always re-interpret such-and-such verse to actually mean whatever newer evidence permits. (I want to stress that I'm not drawing a parallel between unthinking Christian believers and anyone changing their belief based upon new evidence! I'm drawing a parallel between the difficult task of writing text designed to change future behavior.)

If it's so loosey-goosey than what's it good for?

That's most definitely not to say that anything that you can re-interpret in the light of new evidence is full of shit. However, you've got to have a good and solid explanation for the discrepancy between your earlier and later interpretations.  The importance of, and difficulty of producing, this explanation is probably based upon if we're talking about a quantitative physics experiment or a complicated tome of reasoning, philosophy, rhetoric. The complicated tome case is important and hard because it's so very hard to convey our most complicated thoughts in ways that are so explicit that we can't interpret them in a multitude of ways.

I think producing the explanation of the discrepancy between earlier and later interpretations is likely full of cognitive booby traps.

I find myself confused by this comment. I'm going to try voicing this confusion as precisely as possible, so you can hopefully clarify it for me.

I am confused that you get an icky feeling from basically the most uncontroversial part of my post and Yudkowsky's point. The part you're quoting is just saying that Yudkowsky cares more about anticipation-constraining than predictions. Of course, predictions are a particular type of very strong anticipation-constraining, but saying "this is impossible" is not wishy-washy fake specification: if the impossible thing is done, that invalidates your hypothesis. So "no perpetual motion machines" is definitely anticipation-constraining in that sense, and can readily falsified.

I am confused because this whole anticipation constraing, especially saying what can't be done, is very accepted in traditional Science. Yudkowsky says that Science Isn't Strict Enough because he says that it allows any type of anticipation-constraining hypothesis to the rank of "acceptable hypothesis": if it's wrong, it will evenutally be falsified.

I am confused because you keep comparing deep knowledge with the sort of conclusions that can always be reinterpreted from new evidence, when my posts goes into a lot of details about how Yudkowsky writes about the anticipation-constraining aspect and how to be stricter with your hypothesis, not just allowing any non-disproved hypothesis the same level of credibility.

Also I feel that I should link to this post, where Yudkowsky argues that the whole "Religion is non-falsifiable" is actually a modern invention that it doesn't make sense to retrofit into the past.

Now I'm confused about why you're confused! 

I'll say a few different things and see if it helps:

  1. I'm making a meta point about the particular form of what happened in the paragraph quoted, nothing specific about what Yudkowsky or you wrote.
  2. Specifically, the form follows something like this pattern: Entity A writes stuff. Entity B thinks it means X. Entity A, Entity B, and many others discuss it for a long time to suss out what Entity A really means. Years later Entity A tells us (or we discover in some other way) what they really meant.
  3. There's nothing about that pattern that says that any entity was wrong or not useful or not Good, but it's a pattern that causes an icky reaction from me.
  4. An icky feeling doesn't always mean Thing X is wrong or bad, just that Thing X pattern matches against enough things the person feeling ickyness has previously found to be wrong or bad. Imagine feeling ickyness about root canals.  Things that hurt are generally bad, and feeling icky about them isn't surprising, but sometimes things that hurt are good! (but just because something is icky doesn't necessarily mean there's good versions of the thing)
  5. Whether or not the part I felt icky about was un-controversial seems mostly tangential to the point I was trying to make.
  6. The actual content of what Yudkowsky or you wrote isn't exactly what I'm talking about.
  7. I'm not saying that what Yudkowsky or you wrote is wrong or right. (In fact, I think Yudkowsky  and you seem correct!)
  8. I'm not "comparing deep knowledge with the sort of conclusions that can always be reinterpreted from new evidence". I'm talking about the pattern formed between Yudkowsky's writing and your/our understanding of it regardless of the content/accuracy/virtuousness of the writing. In other words, the comparison is between something like general relativity(not committing to this being a good example, but hopefully it gestures in the correct direction) and insert-any-writing-that-you've-later-understood-to-mean-something-else.
  9. It's possible (even likely) that there is no solution to the problem of some human ideas not being conducive to transfer in human modes of communication from one mind to the other without also being subject to re-interpretation. In other words it's possible that Yudkowsky conveyed his thoughts as well as is humanly possible. He's certainly better at doing that than me.
  10. Back and forth conversation in the wake of a post or posts will often clear up what the author really meant as part of the general process of conveying ideas. However, it's surprising to me that Yudkowsky clarified what he really meant years later in the time, manner, and location that he did and that contributes to the icky feeling.
  11. I often get the sense that Yudkowsky is also frustrated by the general idea I'm gesturing at here.  The difficulty of conveying ideas of a certain type.  Not just that they're difficult to convey, but that they're difficult to convey in a manner that makes people confident in the accuracy while at the same time making them confident in the accuracy for the right reasons.
  12. At this point, I'm hoping it makes sense to you when I say I don't think Yudkowsky's post about religion's falsifiable-ness is exactly on-point.

I find myself unsatisfied with the content of this comment, but as of right now I'm not sure how to better convey my thoughts. On the other hand I don't want to ignore your comment, so here's hoping this helps rather than hinders.

Oh no, confusion is going foom!

Joke aside, I feel less confused after your clarifications. I think the issue is that it wasn't clear at all to me that you were talking about the whole "interpreting Yudkowsky" schtick as the icky  feeling.

Now it makes sense, and I definitely agree with you that there are enormous parallel with Biblical analysis. Yudkowsky's writing is very biblical in ways IMO (the parables and the dialogues), and in general is far more literary than 99% of the rat writing out there. I'm not surprised he found HPMOR easy to write, his approach to almost everything seem like a mix of literary fiction and science-fiction tropes/ideas.

Which is IMO why this whole interpretation is so important. More and more, I think I'm understanding why so many people get frustrated with Yudkowsky's writing and points: because they come expecting essays with arguments and a central point, and instead they get a literary text that requires strong interpretation before revealing what it means. I expect your icky feeling to come from the same place.

(Note that I think Yudkowsky is not doing that to be obscure, but for a mix of "it's easier for him" and "he believes that you only learn and internalize the sort of knowledge he's trying to convey through this interpretative labor, if not on the world itself, at least on his text".)

Also, as a clarifier: I'm not comparing the content of literary fiction or the Bible to Yudkowsky's writing. Generally with analysis of the former, you either get mysterious answers or platitudes; more and more with Yudkwosky I'm getting what I feel are deep insights (and his feedback on this post make me think that I'm not off the mark by much for some of those).

Great investigation/clarification of this recurring idea from the ongoing Late 2021 MIRI Conversations.

  • outside vs. inside view - I've thought about this before but hadn't read this clear a description of the differences and tradeoffs before (still catching up on Eliezer's old writings)
  • "deep knowledge is far better at saying what won’t work than at precisely predicting the correct hypothesis." - very useful takeaway

You might not like his tone in the recent discussions, but if someone has been saying the same thing for 13 years, nobody seems to get it, and their model predicts that this will lead to the end of the world, maybe they can get some slack for talking smack.

Good point and we should. Eliezer is a valuable source of ideas and experience around alignment, and it seems like he's contributed immensely to this whole enterprise.

I just hope all his smack talking doesn't turn off/away talented people coming to lend a hand on alignment. I expect a lot of people on this (AF) forum found it like me after reading all Open Phil and 80,000 Hours' convincing writing about the urgency of solving the AI alignment problem. It seems silly to have those orgs working hard to recruit people to help out, only to have them come over here and find one of the leading thinkers in the community going on frequent tirades about how much EAs suck, even though he doesn't know most of us. Not to mention folks like Paul and Richard who have been taking his heat directly in these marathon discussions!

Thanks for the comment, and glad it helped you. :)

  • outside vs. inside view - I've thought about this before but hadn't read this clear a description of the differences and tradeoffs before (still catching up on Eliezer's old writings)

My inner Daniel Kokotajlo is very emphatically pointing to that post about all the misuses of the term "outside view". Actually, Daniel commented on my draft that he definitely didn't thought that Hanson was using the real outside view AKA reference class forecasting in the FOOM debate, and that as Yudkowsky points out, reference class forecasting just doesn't seem to work for AGI prediction and alignment.

I just hope all his smack talking doesn't turn off/away talented people coming to lend a hand on alignment. I expect a lot of people on this (AF) forum found it like me after reading all Open Phil and 80,000 Hours' convincing writing about the urgency of solving the AI alignment problem. It seems silly to have those orgs working hard to recruit people to help out, only to have them come over here and find one of the leading thinkers in the community going on frequent tirades about how much EAs suck, even though he doesn't know most of us. Not to mention folks like Paul and Richard who have been taking his heat directly in these marathon discussions!

Yeah, I definitely think there are and will be bad consequences. My point is not that I think this is a good idea, just that I understand better where Yudkowsky is coming from, and can empathize more with his frustration.

I feel the most dangerous aspect of the smack talking is that it makes people not want to listen to him, and just see him as a smack talker with nothing to add. That was my reaction when reading the first discussions, and I had to explicitly notice that my brain was going from "This guy is annoying me so much" to "He's wrong", which is basically status-fueled "deduction". So I went looking for more. But I completely understand the people, especially those who are doing a lot of work in alignment, being just "I'm not going to stop my valuable work to try to understand someone who's just calling me a fool and is unable to voice their arguments in a way I understand."